We provide IT Staff Augmentation Services!

Big Data/hadoop Developer Resume

Sunnyvale, CA

SUMMARY:

  • 9+ Years of experience in Data Analysis, Design, Development, Testing, Customization, Bug fixes, Enhancement, Support and Implementation using Python, spark programming for Hadoop.
  • Worked on AWS environment such as lambda, server less applications, EMR, Athena, AWS Glue, IAM policies, S3, CFT and Ec2.
  • Developed Python and Pyspark programs for data analysis on MapR, Cloudera, Hortonworks Hadoop clusters.
  • Developed the Pysprk code for AWS Glue jobs and for EMR.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR, MapR distribution.
  • Good working experience with python to develop Custom Framework for generating of rules (just like rules engine). Developed Hadoop streaming Jobs using python for integrating python API supported applications.
  • Developed Python code to gather the data from HBase and designs the solution to implement using Pyspark
  • Apache Spark Data Frames/RDD's were used to apply business transformations and utilized Hive Context objects to perform read/write operations.
  • Worked on Jenkins for CI/CD pipeline in AWS.
  • Worked on Airflow python code for jobs.
  • Re - write some Hive queries to Spark SQL to reduce the overall batch time.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Spark, Hive, and Sqoop) as well as systems specific jobs (such as Python programs and shell scripts).
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Highly motivated to work on Python, R scripts for statistics analytics for generating reports for Data Quality.
  • Good experience with understanding R code to Analyze Machine Learning Models.
  • Working with python for statically analysis of data to finding quality, confidence intervals.
  • Good Experience in Linux Bash scripting and following PEP Guidelines in Python.
  • Worked on Kafka for streaming data and also on data ingestion.

TECHNICAL SKILLS:

Operating systems: Centos, Ubuntu, RedHat Linux 5.X, 6.X, Amazon Linux, Windows 95, 98, NT, 2000, Windows Vista, 7

Programming Languages: Python, Java, C# and Scala

Databases: Oracle (SQL) 10g, MYSQL, SQL SERVER 2008

Scripting Languages: Shell scripting, PowerShell

Hadoop EcoSystems: Hive, Pig, Flume, Oozie, Sqoop, Spark, Impala, Kafka and HBase

PROFESSIONAL EXPERIENCE:

Confidential, Sunnyvale, CA

Big Data/Hadoop developer

Responsibilities:

  • Hands on experience in Python Pyspark programming on Cloudera, Harton Works and MapR Hadoop Clusters, Aws EMR clusters, AWS Lambda functions and CFT’S
  • Handled importing of data from various data source performed transformations using spark and loaded data into hive.
  • Worked on Jenkins for CI/CD pipeline in AWS.
  • Responsible for analysing and data cleaning using Spark SQL Queries.
  • Worked on python pyspark programming.
  • Worked with spark core, Spark Streaming and spark SQL modules of Spark.
  • Used PySpark to write the code for all the use cases in spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
  • Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
  • On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
  • Migrated an existing on-premises application to AWS.
  • Used AWS services like EC2 and S3 for small data sets.
  • Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
  • Good hands on experience with Micro services on cloud foundry for real time data streaming to persist data on HBase and to communicate with Restful web services using Java API.
  • Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team and good working on Data Meer.
  • Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop Cluster.
  • Used Flume to collect, aggregate, and store the web log data from various sources like web servers, mobile and network devices and pushed to HDFS.

Environment: MapReduce, S3, EC2, EMR, Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Scala, Pyspark, Spark SQL, Kafka, Python, LINUX, Putty, Cassandra, Shell Scripting, ETL, YARN.

Confidential

Sr. Big Data Developer

Responsibilities:

  • Python Pyspark programming on MapR Hadoop Clusters, Aws EMR clusters, AWS Lambda functions and CFT’S
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Worked on Jenkins for CI/CD pipeline in AWS.
  • Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
  • Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
  • Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
  • Wrote Hive queries and Pig scripts to study customer behaviour by analysing the data.
  • Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
  • Great expose to UNIX scripting and good hands on shell scripting.
  • Wrote python scripts to process semi-structured data in formats like JSON.
  • Worked closely with the data modellers to model the new incoming data sets.
  • Experienced in loading and transforming of large sets of structured, semi structured and unstructured data.
  • Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
  • Good hands on experience with Python API by developing Kafka producer, consumer for writing Avro Schemes.
  • Installed Ganglia Monitoring Tool to generate reports related to Hadoop cluster like CPUs running, Hosts Up and Down etc., operations were performed to maintain Hadoop cluster.
  • Good hands on experience with real -time data injection using Kafka and real-time processing engine through Storm (Spout, Bolt) and persisted data into HBASE database for data analytics.
  • Responsible for analysing and data cleaning using Spark SQL Queries.
  • Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
  • Worked with spark core, Spark Streaming and spark SQL modules of Spark.
  • Used Scala to write the code for all the use cases in spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
  • Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
  • On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
  • Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
  • Good hands on experience with Microservices on cloud foundry for real time data streaming to persist data on
  • Determining the viability of a business problem for a Big Data solution with Pyspark.
  • Proactively monitored systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
  • Monitored multiple Hadoop clusters environments using Ganglia and Monitored workload, job performance and capacity planning using MapR.
  • Involved in time series data representation using HBase.
  • Performed Map Reduce programs on log data to transform into structured way to find user location, age group, and spending time using Java.
  • Analysed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
  • Build cluster on AWS environment using EMR using S3, EC2, and Redshift.
  • Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports by our BI team.
  • Great hands on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
  • Working with (BI) Tableau teams as requirement of datasets and good working experience with Data visualization.

Environment: MapReduce, S3, EC2, EMR, Java, HDFS, Hive, Pig, Tez, Oozie, Hbase, Spark, Scala, Spark SQL, Kafka, Python, Putty, Pyspark, Cassandra, Shell Scripting, ETL, YARN.AWS s3, EC2, Hadoop, HDFS, Pig, Hive, Splunk, sqoop, AWS EC2, S3, LINUX, Cloudera, Big Data, Ganglia, SQL Server, HBase.

Confidential

Software Engineer - Big Data, Python, Spark

Responsibilities:

  • Worked on python Pyspark programming on MapR, Cloudera and Horton Works.
  • Worked on analysing Hadoop cluster using different big data analytic tools including Hive, MapReduce, Pig and flume.
  • Involved in analysing system failures, identifying root causes, and recommended course of actions.
  • Managed Hadoop clusters using Cloudera. Extracted, Transformed, and Loaded (ETL) of data from multiple sources like Flat files, XML files, and Databases.
  • Worked on Talend ETL to load data from various sources to Datalake. Used tmap, treplicate, tfilterrow, tsort and various other features in Talend.
  • Developing dataset which follows CDISC Standards and stored into HDFS (S3).
  • Good experience with ODM, SEND data xml formats in interact with user data.
  • Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS using UDF developing by python and java.
  • Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
  • Good knowledge on Amazon EMR (Elastic Map Reduce).
  • Developed the Pig UDF'S to pre-process the data for analysis and Hive queries for the analysts.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig. Cluster co-ordination services through Zookeeper.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
  • Worked on designing Poc's for implementing various ETL Process.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Analysed large data sets by running Hive Queries and Pig scripts.
  • Involved in creating Hive tables, loading and analysing data using Hive Queries.
  • Extracted the data from Teradata into HDFS using the Sqoop.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Mentored analyst and test team for writing Hive Queries.
  • Involved in running Hadoop jobs for processing millions of records of text data.
  • Worked with application teams to install Hadoop updates, patches and version upgrades as required.
  • Developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
  • Implemented best income logic using Pig scripts and UDFs.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on tuning the performance for Hive and Pig queries.
  • Developed UNIX Shell scripts to automate repetitive database processes

Environment: Hadoop, Talend, Hbase, Python, MapR, ETL, HDFS, Hive, Java (jdk1.7), Pig, Zookeeper, Oozie, Flume, Unix Shell Scripting, Teradata, Sqoop

Confidential

System Engineer

Responsibilities:

  • Involved in analysing system failures, identifying root causes, and recommended course of actions.
  • Python Pyspark programming on MapR Hadoop Clusters, AWS EMR clusters, AWS Lambda functions and CFT’S
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Worked on Jenkins for CI/CD pipeline in AWS.
  • Developed Simple to complex Map/reduce streaming jobs using Python, Hive and Pig.
  • Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
  • Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
  • Wrote Hive queries and Pig scripts to study customer behaviour by analysing the data.
  • Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.

Hire Now