Data Engineer Resume San Jose, CA - Hire IT People

SUMMARY:

Around 5 years of experience in IT industry as a software engineer which includes 3 years of experience in design and development using Hadoop big data eco system tools.
Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark using python, Spark - Sql, Hive and Pig.
Well versed with Big data on AWS cloud services i.e. EC2, S3, EMR, Lambda, Redshift and CloudWatch
Experience in importing and exporting data using Sqoop from RDBMS to HDFS and Hive.
Expertise in creating Spark UDFs using python in order to analyze data sets for complex aggregate requirements.
Experience in developing spark jobs to run entity resolution on PII data of United States population.
Responsible for taking care and running the spark jobs along with optimizing the jobs in spark along with data validation and automation.
Experience in performing extensive data analysis in spark using complex dataframe transformations and RDD transformations and creating external hive tables on top of the aggregated data for downstream
consumption.
Used Shell scripting for automating the triggering of the spark jobs.
Well versed in developing the complex SQL queries using Hive and Spark Sql.
Experienced in preparing and executing unit test plan and unit test cases during software development.
Strong understanding in Object-Oriented Programming concepts and implementation.
Experience in providing training and guidance to new team members in the Project.
Used JIRA tool for creating the tasks and logging the work and tracking the issues.
Very good experience in customer specification study, requirements gathering, system architectural design and turning the requirements into final product/ service.
Experience in interacting with customers and working at client locations for real time field testing of products and services.
Ability to communicate and work effectively with associates at all levels within the organization.
Strong background in mathematics and have very good analytical and problemsolving skills.

TECHNICAL SKILLS:

ProgrammingSkills: Python, SQL, Scala

BigDataEcoSystem: HDFS, YARN, Map Reduce, Spark Core, SparkSQL, Impala, Hive, Pig, Sqoop

AWS Stack: EC2, S3, EMR, Lambda, Redshift and CloudWatch

ELK Stack: Elastic Search, Logstash and Kibana, Splunk

Scripting Languages: UNIX Shell scripting and Python scripting.

DBMS / RDBMS.: Oracle 11g, SQL Server

BI Tools: Tableau

Version Control: Git and Bitbucket

CI/CD Tool: Jenkins

PROFESSIONAL EXPERIENCE:

Confidential, San Jose, CA

Data Engineer

Responsibilities:

Develop software to combine data from snowflake tables and true source data to create combine layer and store it in AWS S3 using python, object-oriented programming, Spark and Hadoop technologies.
Develop and run serverless Spark based applications using AWS Lambda service and pyspark to compute enhanced layer and store it in AWS S3 buckets.
Also load data from teradata using pyspark for retrieving data for some of the portfolios.
Use Parquet and Avro file formats for different use cases based on downstream application needs.
Create and update metadata information for different portfolio datasets for one lake AWS S3 objects in Nebula
Develop python, shell scripting and spark-based applications using PyCharm and anaconda integrated development environments.
Push AWS EMR and AWS Lambda logs to Elastic search using Td agent and Pied piper configurations.
Also use redshift for storing and retrieving processed data from combined layers.
Create feed analysis dashboards using documents in elastic search using Kibana visualization tool.
Create cerebro views to access the data in one lake using Abracadata application through apache presto service.
Analyze failed jobs in AWS EMR Spark cluster and identify the cause of failures and improve the job performance and minimize/avoid failures using pyspark software.
Use JIRA tool for updating my tasks/stories created by agile lead.
Analyze, store and process data captured from retail bank and direct bank using different AWS cloud services such as AWS S3, Cloud Formation, Lambda and EMR services.
Automate jobs using shell scripting and schedule those jobs to run at a specific time using crontab.
Develop unit test cases for the software developed before deploying it in production servers.

Environment: SnowFlake, Teradata, Spark Core, AWS S3, EMR, Lambda, Athena, CloudFormation, CloudWatch, Python, Hive, Presto, Nebula, Crontab, Elastic Search, and Kibana

Confidential, Piscataway, NJ

Data Engineer

Responsibilities:

Extensively used Spark core i.e. RDDs, DataFrames, and Spark Sql as part of developing multiple applications using both Python and Scala.
Built multiple data pipelines using Pig scripts for processing data for specific applications.
Used different file formats such as Parquet, Avro, and ORC for storing and retrieving data in Hadoop.
Used Spark-streaming for consuming event-based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application
Developed analytical queries on different tables using Spark sql for finding insights and building data pipelines for data scientists to consume this data for applying ML models.
Spark performance tuning by applying different techniques: choosing optimum parallelism, Serialization format while shuffling the data, using broadcast variables, joins, aggregations, and memory management.
Written multiple custom Sqoop import scripts to load data from oracle into HDFS directories and Hive tables.
Used Nifi for automating and managing data flows between multiple systems.
Used different compression techniques while storing data into Hive tables for performance improvement: snappy and Gzip
Have used Impala for faster querying for a time critical application to generate reports.
Also used Hbase for OLTP purpose for an application requiring high scalability using Hadoop.
Have written sqoop export scripts to write the date from HDFS into Oracle database.
Used Control M component to simplify and automate different batch workload Applications.
Worked closely with multiple data science and machine learning teams in building a data eco system to support AI.
Applied different job tuning techniques while processing data using Hive and spark frameworks to improve the performance of jobs.

Environment: Spark Core, Spark Streaming, Python, Hive, Impala, Hbase, Sqoop, Kerberos (security), LDAP, and Control M.

Confidential

Software Engineer

Responsibilities:

AnalyzingHadoopcluster and different Big Data analytic tools including Pig, Hive, and Sqoop.
Developed an ETL framework using Pig and Hive.
Developed Spark Programs for Batch and Real-Time Processing to process incoming streams of data from Kafka sources and transform it into as Data frames and load those data frames into Hive and HDFS.
Experience in developing SQL scripts usingSparkfor handling different data sets and verifying the performance of Map Reduce jobs.
Developed Spark programs using Spark-SQL library to perform analytics on data in Hive.
Created multiple MapReduce jobs in Pig and Hive for data cleaning and preprocessing.
Created Hive views/tables for providing SQL like interface.
Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
Writing Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
Using Hive to analyze the partitioned data and compute various metrics for reporting.
Transformed the Impala queries into hive scripts which can be run using the shell commands directly for higher performance rate.
Created the shell scripts which can be scheduled using Oozie workflows and even the Oozie Coordinators.
Developed the Oozie workflows to generate monthly report files automatically.
Managing and reviewing theHadooplog files.
Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.

Environment:Hadoop, MapReduce, Sqoop, HDFS, Hive, Pig, Oozie, Oracle 10g, MySQL, and Impala.

We provide IT Staff Augmentation Services!

Data Engineer Resume

San Jose, CA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship