We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Having 7+ years of professional Software Development experience with specialization in Java/J2EE technologies and Big Data Analytics.
  • Having 5+ years of work experience in ingestion, storage, querying, processing and analysis of Big Data with hands on experience in Hadoop Ecosystem including MapReduce, HDFS, Hive, Pig, Spark, HBase, Zookeeper, Sqoop, Flume and Oozie., Hortonworks, and AWS EMR.
  • In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2 Federation, High Availability and YARN architecture and good understanding of workload management, scalability and distributed platform architectures.
  • Experience in installing, configuring, supporting and managing Hadoop Clusters both on prem (CDH) and in cloud (AWS EMR).
  • Experience in working with various big data analytics services in AWS like EMR, S3, Redshift Spectrum, Athena, Glue etc.,
  • Strong experience building production ready spark applications, troubleshooting various failures in spark and fine - tuning spark applications.
  • Extensive experience working with Spark RDD Apis, Spark Data frame Apis, Spark SQL and Spark ML.
  • Strong experience and knowledge of real time data analytics using Kafka, Flume and Spark-Streaming.
  • Experience in extending HIVE and PIG core functionality by using custom UDF’s.
  • Debugging MapReduce jobs using Counters and MRUNIT testing.
  • Good understanding of Spark ML algorithms such as Classification, Clustering, and Regression.
  • Experienced in moving data from different sources using Kafka producers, consumers and pre-process data using Storm topologies.
  • Experienced in migrating data warehousing workloads into Hadoop based data lakes using MR, Hive, Pig and Sqoop.
  • Good knowledge on streaming data from different data sources like Log files, JMS, applications sources into HDFS using Flume sources.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Authored Terraform modules for infrastructure management. Authored and published a module to the Terraform registry.
  • Extensive experience working with relational databases like Teradata, Oracle, Netezza, SqlServer and MySQL database.
  • Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
  • Worked on Docker based containerized applications.
  • Worked on data migration projects from on prem clusters to cloud.
  • Knowledge of data warehousing and ETL tools like Talend, Power BI
  • Experienced in working with monitoring tools to check status of cluster using Cloudera manager, Ambari.
  • Experience with Testing MapReduce programs using MR Unit, Junit.
  • Experience in design and development of Web forms using Spring MVC, Java Script, JSON and JQ plotter.
  • Experience working on Version control tools like SVN and Git revision control systems such as GitHub and JIRA to track issues and crucible for code reviews.
  • Worked on various Tools and IDEs like Eclipse, IBM Rational, Visio, Apache Ant-Build Tool, MS-Office, PLSQL Developer.
  • Experience in different application servers like JBoss/Tomcat, Web Logic, IBM WebSphere.
  • Experience in working with Onsite-Offshore model.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, ZooKeeper, Spark.

Hadoop Distributions: Cloudera, AWS EMR, Hortonworks

Languages: Java, Scala, Python, SQL, Shell Scripting

No SQL Databases: Cassandra and HBase

IDE: Eclipse, IntelliJ

Database: Oracle 10g, MySQL, MSSQL

AWS Services: AWS EMR, EC2, S3, Redshift, Athena, Glue, Lambda Functions

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Ingested user behavioral data from external servers such as FTP server and S3 buckets on daily basis using custom Input Adapters.
  • Created Spark JDBC applications to import/export user profile data from RDBMS to S3 Data Lake.
  • Developed various spark applications to perform various enrichments of user behavioral data (click stream data) merged with user profile data.
  • Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for downstream model learning and reporting.
  • Troubleshooting Spark applications for improved error tolerance.
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines
  • Created Kafka producer API to send live-stream data into various Kafka topics.
  • Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to Redshift.
  • Utilized Spark in Memory capabilities, to handle large datasets.
  • Used broadcast variables in spark, effective & efficient Joins, transformations and other capabilities for data processing.
  • Experienced in working with EMR cluster and S3 in AWS cloud.
  • Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
  • Involved in continuous Integration of application using Jenkins.

Environment: AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Pyspark, Scala, Lambda functions, MapReduce.

Confidential, Durham, NC

Big Data Engineer

Responsibilities:

  • Developed series of data ingestion jobs for collecting the data from multiple channels and external applications in Scala.
  • Worked on both batch and streaming ingestion of the data.
  • Imported clickstream log data from FTP Servers and performed various data transformations using Spark Data frame API and Spark-SQL apis.
  • Implemented Java based Kafka Producer applications for streaming messages to Kafka topics.
  • Built Spark Streaming applications for consuming messages and writing to HBase.
  • Worked on troubleshooting and optimizing Spark Applications.
  • Worked on ingesting data from Sql-server to S3 using Sqoop with in AWS EMR.
  • Migrated Map-reduce jobs to Spark applications built on Scala and integrated with Apache Phoenix and HBase.
  • Involved in loading and transforming large sets of data and analyzed them using Hive Scripts.
  • Implemented SQL queries on AWS with platforms like Athena and Redshift
  • Have experience in Querying in AWS Athena where alerts are coming from S3 buckets and finding the difference in time interval between Clusters of Kafka and Kinesis
  • Loaded portion of processed data into Redshift tables and automated the process.
  • Worked on various performance optimizations in spark like using broadcast variables, dynamic allocation, partitioning and built custom Spark UDFs.
  • Worked on fine tuning long running hive queries by utilized proven standards like using Parquet Columnar format, partitioning, vectorized execution etc.,
  • Analyzed the data using Spark Data Frames and series of Hive Scripts to produce summarized results to downstream systems.
  • Worked with Data Science team in developing Spark ML applications to develop various predictive models.
  • Expertise on interacting with the project team to organize timelines, responsibilities, and deliverables to provide all aspects of technical support.

Environment: Hadoop, Spark, Scala, Hive, Sqoop, Oozie, Kafka, AWS EMR, Redshift, S3, Kinesis, Spark Streaming, Athena, HBase, YARN, JIRA, Shell Scripting, Maven, Git.

Confidential, SFO, CA

Big Data Engineer

Responsibilities:

  • Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
  • Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.
  • Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
  • Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
  • Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
  • Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
  • Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
  • Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.,
  • Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
  • Created Terraform modules for infrastructure management.
  • Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
  • Worked on creating Kafka producers using Kafka Java Producer Api for connecting to external Rest live stream application and producing messages to Kafka topic.

Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka,Terraform.

Confidential

HADOOP Developer

Responsibilities:

  • Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.
  • Involved in developing spark applications to perform ELT kind of operations on the data.
  • Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, Data frames and Spark SQL API’s
  • Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables
  • Involved in creating Hive external tables to perform ETL on data that is produced on daily basis
  • Validated the data being ingested into Hive for further filtering and cleansing.
  • Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
  • Loaded data into hive tables from spark and used Parquet columnar format.
  • Created Oozie workflows to automate and productionize the data pipelines
  • Migrating Map Reduce code into Spark transformations using Spark and Scala.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Designed, documented operational problems by following standards and procedures using JIRA

Environment: Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.

Confidential

Java Developer

Responsibilities:

  • Designed & developed the application using Spring Framework
  • Developed class diagrams, sequence and use case diagrams using UML Rational Rose.
  • Designed the application with reusable J2EE design patterns
  • Designed DAO objects for accessing RDBMS
  • Developed web pages using JSP, HTML, DHTML and JSTL
  • Designed and developed a web-based client using Servlets, JSP, Tag Libraries, JavaScript, HTML and XML using Struts Framework.
  • Involved in developing JSP forms.
  • Designed and developed web pages using HTML and JSP.
  • Designed various applets using JBuilder.
  • Designed and developed Servlets to communicate between presentation and business layer.
  • Actively worked and supported the creation of database schema objects (tables, stored procedures, and triggers) using Oracle SQL/PLSQL

Environment: Java / J2EE, JSP, Web Services, SOAP, Eclipse, Rational Rose, HTML, XPATH, XSLT, DOM and JDBC.

We'd love your feedback!