Data Engineer Resume
Saint Louis, MO
SUMMARY
- 7+ years of technical expertise in complete software development life cycle (SDLC), which includes 6 years of Data Engineering experience using Hadoop and Big Data Stack.
- Hands on experience working wif Spark and Hadoop ecosystems like MapReduce, Sqoop, Hive, b, Flume, Kafka, Zookeeper and NoSQL Databases like HBase.
- Excellent noledge and understanding of Distributed Computing and Parallel processing frameworks.
- Strong experience wif developing end - to-end Spark applications in Scala.
- Worked extensively on troubleshooting issues related to memory management, resource management, wif in spark applications.
- Strong noledge on fine-tuning spark applications and hive scripts.
- Written complex MapReduce jobs to perform various data transformations on large scale datasets.
- Experience in installation, configuration, and monitoring Hadoop clusters both in house and on the cloud (AWS).
- Good experience working wif AWS Cloud services like S3, EMR, Redshift, Atana, Glue Meta store etc.,
- Extending Hive core functionality by writing custom UDF’s for Data Analysis.
- Handling importing of data from various data source, performed transformation, and hands on developing and debugging MR2 jobs to process large data sets.
- Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
- Created Hive External and Managed Tables.
- Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
- Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.
- Experience in using Apache Flume for collecting, aggregation, moving large amount of data from application server.
- Good experience utilizing Sqoop extensively for ingesting data from relational databases.
- Good noledge on Kafka for streaming real time feeds from external rest applications to Kafka topics.
- Worked on building real time data workflows using Kafka, Spark Streaming and HBase.
- Good understanding of Relational Databases like MySQL, Postgres, Oracle, and Teradata.
- Experienced in using GIT, SVN.
- Ability to deal wif build tools like Apache Maven, SBT
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Pig, Hive, Spark 2.x/1.x, YARN, Kafka 2.10, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari
Cloud Environment: AWS, Google Cloud
Hadoop Distributions: Cloudera CDH 6.1/5.12/5., Hortonworks, MAPR
ETL: Talend
Languages: Python, Shell Scripting, Scala
NoSQL Databases: MongoDB, HBase, DynamoDB
Development / Build Tools: Eclipse, Git, IntelliJ and log4J
RDBMS: Oracle 10g,11i, MS SQL Server, DB2
Testing: MRUnit Testing, Quality Center (QC)
Virtualization: VMWare, AWS/EC2, Google Compute Engine
Build Tools: Maven, Ant, SBT
PROFESSIONAL EXPERIENCE
Confidential, Saint Louis, MO
Data Engineer
Responsibilities:
- Developed custom input adaptors for ingesting click stream data from external sources like ftp server into S3 backed data lakes on daily basis.
- Created various spark applications using Scala to perform series of enrichments of these clicks stream data combined wif enterprise data of the users.
- Implemented batch processing of jobs using Spark Scala API.
- Developed Sqoop scripts to import/export data from Teradata to HDFS and into Hive tables.
- Optimized Hive tables using optimization techniques like partitions and bucketing to provide better performance wif Hive QL queries.
- Worked wif multiple file formats like Avro, Parquet, and Orc.
- Converted existing MapReduce programs to Spark Applications for handling semi structured data like JSON files, Apache Log files, and other custom log data.
- Wrote Kafka producers to stream the data from external rest api’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, TEMPeffective & efficient Joins, transformations, and other capabilities.
- Worked extensively wif Sqoop for importing data from Teradata.
- Implemented business logic in Hive and written UDF’s to process the data for analysis.
- Utilized AWS services like S3, EMR, Redshift, Atana, Glue meta store etc., for building and managing data pipelines wifin the cloud.
- Automated EMR Cluster creation and termination using AWS Java SDK.
- Loaded the processed data to redshift clusters using Spark Redshift Integration.
- Created views wif-in Atana for allowing downstream reporting and data analysis team to query and analyze the results.
Environment: AWS Services (S3, EMR, Redshift, Atana, Glue meta store), Spark, Hive, Teradata, Scala, Python.
Confidential, Tampa, FL
Data Engineer
Responsibilities:
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement
- Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
- Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
- Ingested syslog messages parsed them and streamed the data to Kafka.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and tan loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java
Confidential, Sterling, VA
Hadoop Engineer
Responsibilities:
- Involved in requirement analysis, design, coding, and implementation phases of the project.
- Loaded the data from Teradata to MAPR using Teradata Hadoop connectors.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL Api’s.
- Written new spark jobs in Scala to analyze the data of the customers and sales history.
- Used Kafka to get data from many streaming sources into HDFS.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Good experience in Hive partitioning, Bucketing and performed different types of joins on Hive tables.
- Created Hive external tables to perform ETL on data that is generated on daily basics.
- Written HBase bulk load jobs to load processed data to HBase tables by converting to HFiles.
- Performed validation on the data ingested to filter and cleanse the data in Hive.
- Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
- Loaded the data into hive tables from spark and used ORC columnar format.
- Developed oozie workflows to automate and productionize the data pipelines.
- Developed Sqoop import Scripts for importing data from Netezza.
Environment: HDP, MapReduce, Spark, Yarn, Hive, Tex, HBase, Oozie, Sqoop, Flume, Teradata, Netezza.
Confidential
Hadoop Developer
Responsibilities:
- Worked on migrating MapReduce programs into Spark transformations using Spark and Python.
- Developed Spark jobs using python along wif Yarn/MRv2 for interactive and Batch Analysis.
- Queried data using Spark SQL wif Spark engine for faster data set processing.
- Extensively used Elastic Load Balancing mechanism wif Auto Scaling feature to scale the capacity of EC2 instances across multiple availability zones in a region to distribute incoming high traffic for the application wif zero downtime.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Involved in creating Hive tables, loading wif data, and writing hive queries which will run internally in map reduce pattern.
- Used Data Frames and Datasets APIs for performing analysis on Hive tables.
- Monitored Hadoop cluster using Cloudera Manager, interacting wif Cloudera support and log the issues in Cloudera portal and fixed them as per the recommendations.
- Responsible for Cloudera Hadoop Upgrades and Patches and Installation of Ecosystem Products through Cloudera manager along wif Cloudera Manager Upgrade.
- Used Sqoop for large data transfers from RDBMS to HDFS/HBase/Hive and vice-versa.
- Worked on continuous integration tools like Jenkins and automated jar files at the end of the day.
- Developed Unix shell scripts to load many files into HDFS from Linux File System.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Used Impala connectivity from the User Interface (UI) and query the results using Impala SQL.
- Used Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Managed and scheduled several jobs to run over a certain period on Hadoop cluster using Oozie.
- Supported the setting up of QA environment and implemented scripts wif Pig, Hive and Sqoop.
- Followed Agile Methodology for entire project and supported testing teams.
- Worked wif customers and product manager to prioritize and validate requirements.
- Completed plans for long term goals using Microsoft Project.
- Coordinated the work efforts of 8-person team for various projects. Helped team complete tasks successfully and on-time and resolved obstacles encountered by team members.
- Coordinated and participated in weekly estimation meetings to provide high-level estimates (Story Points) for backlog items.
Environment: Hadoop, HDFS, Hive, MapReduce, Impala, Sqoop, SQL Talend, Python, PySpark, Yarn, Pig, Oozie, Linux-Ubuntu, AWS, Tableau, Maven, Jenkins, Cloudera, JUnit, agile methodology.
Confidential
Java Developer
Responsibilities:
- Reviewed requirements wif the support group and developed an initial prototype.
- Involved in the analysis, design and development of the application components using JSP, Servlets components using J2EE design pattern.
- Wrote Specification for the development.
- Wrote JSPs, Servlets and deployed them on Weblogic Application server.
- Implemented Struts framework based on the Model View Controller design paradigm.
- Implemented the MVC architecture using Strut MVC.
- Struts-Config XML file was created, and Action mappings were done.
- Designed the application by implementing Struts based on MVC Architecture, simple Java Beans as a Model, JSP UI Components as View and Action Servlet as a Controller
- Wrote Oracle PL/SQL Stored procedures, triggers, views for backend database access.
- Used JSP’s HTML on front end, Servlets as Front Controllers and Java Script for client-side validations.
- Participated in Server side and Client-side programming.
- Wrote SQL stored procedures, used JDBC to connect to database.
- Designed, developed, and maintained the data layer using JDBC and performed configuration of JAVA Application Framework
- Worked on triggers and stored procedures on Oracle database.
- Worked on Eclipse IDE to write the code and integrate the application.
- Communicated between different applications using JMS.
- Extensively worked on PL/SQL, SQL.
- Developed different modules using J2EE (Servlets, JSP, JDBC, JNDI).
- Tested and validated the application on different testing environments.
- Performed functional, integration and validation testing.
Technical Environment: Java, J2EE, Struts, JSP, HTML, Servlets, Java Script, Rational Rose, SQL, PL-SQL, JDBC, MS Excel, UML, Apache Tomcat.