We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

Saint Louis, MO

SUMMARY

  • 7+ years of technical expertise in complete software development life cycle (SDLC), which includes 6 years of Data Engineering experience using Hadoop and Big Data Stack.
  • Hands on experience working wif Spark and Hadoop ecosystems like MapReduce, Sqoop, Hive, b, Flume, Kafka, Zookeeper and NoSQL Databases like HBase.
  • Excellent noledge and understanding of Distributed Computing and Parallel processing frameworks.
  • Strong experience wif developing end - to-end Spark applications in Scala.
  • Worked extensively on troubleshooting issues related to memory management, resource management, wif in spark applications.
  • Strong noledge on fine-tuning spark applications and hive scripts.
  • Written complex MapReduce jobs to perform various data transformations on large scale datasets.
  • Experience in installation, configuration, and monitoring Hadoop clusters both in house and on the cloud (AWS).
  • Good experience working wif AWS Cloud services like S3, EMR, Redshift, Atana, Glue Meta store etc.,
  • Extending Hive core functionality by writing custom UDF’s for Data Analysis.
  • Handling importing of data from various data source, performed transformation, and hands on developing and debugging MR2 jobs to process large data sets.
  • Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
  • Created Hive External and Managed Tables.
  • Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
  • Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.
  • Experience in using Apache Flume for collecting, aggregation, moving large amount of data from application server.
  • Good experience utilizing Sqoop extensively for ingesting data from relational databases.
  • Good noledge on Kafka for streaming real time feeds from external rest applications to Kafka topics.
  • Worked on building real time data workflows using Kafka, Spark Streaming and HBase.
  • Good understanding of Relational Databases like MySQL, Postgres, Oracle, and Teradata.
  • Experienced in using GIT, SVN.
  • Ability to deal wif build tools like Apache Maven, SBT

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Pig, Hive, Spark 2.x/1.x, YARN, Kafka 2.10, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari

Cloud Environment: AWS, Google Cloud

Hadoop Distributions: Cloudera CDH 6.1/5.12/5., Hortonworks, MAPR

ETL: Talend

Languages: Python, Shell Scripting, Scala

NoSQL Databases: MongoDB, HBase, DynamoDB

Development / Build Tools: Eclipse, Git, IntelliJ and log4J

RDBMS: Oracle 10g,11i, MS SQL Server, DB2

Testing: MRUnit Testing, Quality Center (QC)

Virtualization: VMWare, AWS/EC2, Google Compute Engine

Build Tools: Maven, Ant, SBT

PROFESSIONAL EXPERIENCE

Confidential, Saint Louis, MO

Data Engineer

Responsibilities:

  • Developed custom input adaptors for ingesting click stream data from external sources like ftp server into S3 backed data lakes on daily basis.
  • Created various spark applications using Scala to perform series of enrichments of these clicks stream data combined wif enterprise data of the users.
  • Implemented batch processing of jobs using Spark Scala API.
  • Developed Sqoop scripts to import/export data from Teradata to HDFS and into Hive tables.
  • Optimized Hive tables using optimization techniques like partitions and bucketing to provide better performance wif Hive QL queries.
  • Worked wif multiple file formats like Avro, Parquet, and Orc.
  • Converted existing MapReduce programs to Spark Applications for handling semi structured data like JSON files, Apache Log files, and other custom log data.
  • Wrote Kafka producers to stream the data from external rest api’s to Kafka topics.
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, TEMPeffective & efficient Joins, transformations, and other capabilities.
  • Worked extensively wif Sqoop for importing data from Teradata.
  • Implemented business logic in Hive and written UDF’s to process the data for analysis.
  • Utilized AWS services like S3, EMR, Redshift, Atana, Glue meta store etc., for building and managing data pipelines wifin the cloud.
  • Automated EMR Cluster creation and termination using AWS Java SDK.
  • Loaded the processed data to redshift clusters using Spark Redshift Integration.
  • Created views wif-in Atana for allowing downstream reporting and data analysis team to query and analyze the results.

Environment: AWS Services (S3, EMR, Redshift, Atana, Glue meta store), Spark, Hive, Teradata, Scala, Python.

Confidential, Tampa, FL

Data Engineer

Responsibilities:

  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement
  • Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
  • Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
  • Ingested syslog messages parsed them and streamed the data to Kafka.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and tan loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Scheduled and executed workflows in Oozie to run various jobs.

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java

Confidential, Sterling, VA

Hadoop Engineer

Responsibilities:

  • Involved in requirement analysis, design, coding, and implementation phases of the project.
  • Loaded the data from Teradata to MAPR using Teradata Hadoop connectors.
  • Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL Api’s.
  • Written new spark jobs in Scala to analyze the data of the customers and sales history.
  • Used Kafka to get data from many streaming sources into HDFS.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Good experience in Hive partitioning, Bucketing and performed different types of joins on Hive tables.
  • Created Hive external tables to perform ETL on data that is generated on daily basics.
  • Written HBase bulk load jobs to load processed data to HBase tables by converting to HFiles.
  • Performed validation on the data ingested to filter and cleanse the data in Hive.
  • Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
  • Loaded the data into hive tables from spark and used ORC columnar format.
  • Developed oozie workflows to automate and productionize the data pipelines.
  • Developed Sqoop import Scripts for importing data from Netezza.

Environment: HDP, MapReduce, Spark, Yarn, Hive, Tex, HBase, Oozie, Sqoop, Flume, Teradata, Netezza.

Confidential

Hadoop Developer

Responsibilities:

  • Worked on migrating MapReduce programs into Spark transformations using Spark and Python.
  • Developed Spark jobs using python along wif Yarn/MRv2 for interactive and Batch Analysis.
  • Queried data using Spark SQL wif Spark engine for faster data set processing.
  • Extensively used Elastic Load Balancing mechanism wif Auto Scaling feature to scale the capacity of EC2 instances across multiple availability zones in a region to distribute incoming high traffic for the application wif zero downtime.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Involved in creating Hive tables, loading wif data, and writing hive queries which will run internally in map reduce pattern.
  • Used Data Frames and Datasets APIs for performing analysis on Hive tables.
  • Monitored Hadoop cluster using Cloudera Manager, interacting wif Cloudera support and log the issues in Cloudera portal and fixed them as per the recommendations.
  • Responsible for Cloudera Hadoop Upgrades and Patches and Installation of Ecosystem Products through Cloudera manager along wif Cloudera Manager Upgrade.
  • Used Sqoop for large data transfers from RDBMS to HDFS/HBase/Hive and vice-versa.
  • Worked on continuous integration tools like Jenkins and automated jar files at the end of the day.
  • Developed Unix shell scripts to load many files into HDFS from Linux File System.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Used Impala connectivity from the User Interface (UI) and query the results using Impala SQL.
  • Used Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
  • Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Managed and scheduled several jobs to run over a certain period on Hadoop cluster using Oozie.
  • Supported the setting up of QA environment and implemented scripts wif Pig, Hive and Sqoop.
  • Followed Agile Methodology for entire project and supported testing teams.
  • Worked wif customers and product manager to prioritize and validate requirements.
  • Completed plans for long term goals using Microsoft Project.
  • Coordinated the work efforts of 8-person team for various projects. Helped team complete tasks successfully and on-time and resolved obstacles encountered by team members.
  • Coordinated and participated in weekly estimation meetings to provide high-level estimates (Story Points) for backlog items.

Environment: Hadoop, HDFS, Hive, MapReduce, Impala, Sqoop, SQL Talend, Python, PySpark, Yarn, Pig, Oozie, Linux-Ubuntu, AWS, Tableau, Maven, Jenkins, Cloudera, JUnit, agile methodology.

Confidential

Java Developer

Responsibilities:

  • Reviewed requirements wif the support group and developed an initial prototype.
  • Involved in the analysis, design and development of the application components using JSP, Servlets components using J2EE design pattern.
  • Wrote Specification for the development.
  • Wrote JSPs, Servlets and deployed them on Weblogic Application server.
  • Implemented Struts framework based on the Model View Controller design paradigm.
  • Implemented the MVC architecture using Strut MVC.
  • Struts-Config XML file was created, and Action mappings were done.
  • Designed the application by implementing Struts based on MVC Architecture, simple Java Beans as a Model, JSP UI Components as View and Action Servlet as a Controller
  • Wrote Oracle PL/SQL Stored procedures, triggers, views for backend database access.
  • Used JSP’s HTML on front end, Servlets as Front Controllers and Java Script for client-side validations.
  • Participated in Server side and Client-side programming.
  • Wrote SQL stored procedures, used JDBC to connect to database.
  • Designed, developed, and maintained the data layer using JDBC and performed configuration of JAVA Application Framework
  • Worked on triggers and stored procedures on Oracle database.
  • Worked on Eclipse IDE to write the code and integrate the application.
  • Communicated between different applications using JMS.
  • Extensively worked on PL/SQL, SQL.
  • Developed different modules using J2EE (Servlets, JSP, JDBC, JNDI).
  • Tested and validated the application on different testing environments.
  • Performed functional, integration and validation testing.

Technical Environment: Java, J2EE, Struts, JSP, HTML, Servlets, Java Script, Rational Rose, SQL, PL-SQL, JDBC, MS Excel, UML, Apache Tomcat.

We'd love your feedback!