Data Engineer Resume
San Jose, CA
SUMMARY:
- Around 5 years of experience in IT industry as a software engineer which includes 3 years of experience in design and development using Hadoop big data eco system tools.
- Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
- Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark using python, Spark - Sql, Hive and Pig.
- Well versed with Big data on AWS cloud services i.e. EC2, S3, EMR, Lambda, Redshift and CloudWatch
- Experience in importing and exporting data using Sqoop from RDBMS to HDFS and Hive.
- Expertise in creating Spark UDFs using python in order to analyze data sets for complex aggregate requirements.
- Experience in developing spark jobs to run entity resolution on PII data of United States population.
- Responsible for taking care and running the spark jobs along with optimizing the jobs in spark along with data validation and automation.
- Experience in performing extensive data analysis in spark using complex dataframe transformations and RDD transformations and creating external hive tables on top of the aggregated data for downstream
- consumption.
- Used Shell scripting for automating the triggering of the spark jobs.
- Well versed in developing the complex SQL queries using Hive and Spark Sql.
- Experienced in preparing and executing unit test plan and unit test cases during software development.
- Strong understanding in Object-Oriented Programming concepts and implementation.
- Experience in providing training and guidance to new team members in the Project.
- Used JIRA tool for creating the tasks and logging the work and tracking the issues.
- Very good experience in customer specification study, requirements gathering, system architectural design and turning the requirements into final product/ service.
- Experience in interacting with customers and working at client locations for real time field testing of products and services.
- Ability to communicate and work effectively with associates at all levels within the organization.
- Strong background in mathematics and have very good analytical and problemsolving skills.
TECHNICAL SKILLS:
ProgrammingSkills: Python, SQL, Scala
BigDataEcoSystem: HDFS, YARN, Map Reduce, Spark Core, SparkSQL, Impala, Hive, Pig, Sqoop
AWS Stack: EC2, S3, EMR, Lambda, Redshift and CloudWatch
ELK Stack: Elastic Search, Logstash and Kibana, Splunk
Scripting Languages: UNIX Shell scripting and Python scripting.
DBMS / RDBMS.: Oracle 11g, SQL Server
BI Tools: Tableau
Version Control: Git and Bitbucket
CI/CD Tool: Jenkins
PROFESSIONAL EXPERIENCE:
Confidential, San Jose, CA
Data Engineer
Responsibilities:
- Develop software to combine data from snowflake tables and true source data to create combine layer and store it in AWS S3 using python, object-oriented programming, Spark and Hadoop technologies.
- Develop and run serverless Spark based applications using AWS Lambda service and pyspark to compute enhanced layer and store it in AWS S3 buckets.
- Also load data from teradata using pyspark for retrieving data for some of the portfolios.
- Use Parquet and Avro file formats for different use cases based on downstream application needs.
- Create and update metadata information for different portfolio datasets for one lake AWS S3 objects in Nebula
- Develop python, shell scripting and spark-based applications using PyCharm and anaconda integrated development environments.
- Push AWS EMR and AWS Lambda logs to Elastic search using Td agent and Pied piper configurations.
- Also use redshift for storing and retrieving processed data from combined layers.
- Create feed analysis dashboards using documents in elastic search using Kibana visualization tool.
- Create cerebro views to access the data in one lake using Abracadata application through apache presto service.
- Analyze failed jobs in AWS EMR Spark cluster and identify the cause of failures and improve the job performance and minimize/avoid failures using pyspark software.
- Use JIRA tool for updating my tasks/stories created by agile lead.
- Analyze, store and process data captured from retail bank and direct bank using different AWS cloud services such as AWS S3, Cloud Formation, Lambda and EMR services.
- Automate jobs using shell scripting and schedule those jobs to run at a specific time using crontab.
- Develop unit test cases for the software developed before deploying it in production servers.
Environment: SnowFlake, Teradata, Spark Core, AWS S3, EMR, Lambda, Athena, CloudFormation, CloudWatch, Python, Hive, Presto, Nebula, Crontab, Elastic Search, and Kibana
Confidential, Piscataway, NJ
Data Engineer
Responsibilities:
- Extensively used Spark core i.e. RDDs, DataFrames, and Spark Sql as part of developing multiple applications using both Python and Scala.
- Built multiple data pipelines using Pig scripts for processing data for specific applications.
- Used different file formats such as Parquet, Avro, and ORC for storing and retrieving data in Hadoop.
- Used Spark-streaming for consuming event-based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application
- Developed analytical queries on different tables using Spark sql for finding insights and building data pipelines for data scientists to consume this data for applying ML models.
- Spark performance tuning by applying different techniques: choosing optimum parallelism, Serialization format while shuffling the data, using broadcast variables, joins, aggregations, and memory management.
- Written multiple custom Sqoop import scripts to load data from oracle into HDFS directories and Hive tables.
- Used Nifi for automating and managing data flows between multiple systems.
- Used different compression techniques while storing data into Hive tables for performance improvement: snappy and Gzip
- Have used Impala for faster querying for a time critical application to generate reports.
- Also used Hbase for OLTP purpose for an application requiring high scalability using Hadoop.
- Have written sqoop export scripts to write the date from HDFS into Oracle database.
- Used Control M component to simplify and automate different batch workload Applications.
- Worked closely with multiple data science and machine learning teams in building a data eco system to support AI.
- Applied different job tuning techniques while processing data using Hive and spark frameworks to improve the performance of jobs.
Environment: Spark Core, Spark Streaming, Python, Hive, Impala, Hbase, Sqoop, Kerberos (security), LDAP, and Control M.
Confidential
Software Engineer
Responsibilities:
- AnalyzingHadoopcluster and different Big Data analytic tools including Pig, Hive, and Sqoop.
- Developed an ETL framework using Pig and Hive.
- Developed Spark Programs for Batch and Real-Time Processing to process incoming streams of data from Kafka sources and transform it into as Data frames and load those data frames into Hive and HDFS.
- Experience in developing SQL scripts usingSparkfor handling different data sets and verifying the performance of Map Reduce jobs.
- Developed Spark programs using Spark-SQL library to perform analytics on data in Hive.
- Created multiple MapReduce jobs in Pig and Hive for data cleaning and preprocessing.
- Created Hive views/tables for providing SQL like interface.
- Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
- Writing Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Using Hive to analyze the partitioned data and compute various metrics for reporting.
- Transformed the Impala queries into hive scripts which can be run using the shell commands directly for higher performance rate.
- Created the shell scripts which can be scheduled using Oozie workflows and even the Oozie Coordinators.
- Developed the Oozie workflows to generate monthly report files automatically.
- Managing and reviewing theHadooplog files.
- Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
Environment:Hadoop, MapReduce, Sqoop, HDFS, Hive, Pig, Oozie, Oracle 10g, MySQL, and Impala.
