Lead Data Engineer Resume
CaliforniA
SUMMARY
- 9+ years of hands on experience in Big Data Technologies, Hadoop Ecosystem including Hive, Spark, NiFi, Sqoop, Flume, Oozie, Kafka, Hbase, Lucidworks SOLR, Map Reduce, AWS and Java.
- Experience in building data pipelines with Kafka, Spark Streaming, and Hbase.
- Hands on experience in migrating flume java interceptors to Spark framework.
- Hands on experience in developing an application with Eventhubs, ADLS, Redis Cache with ADF pipeline.
- Hands on experience in developing standalone SOLR applications to update the data fields.
- Developed data tracer application to analyze the gap from source ingestion to destination at different stages in the pipeline.
- Hands on experience in migrating the solr, hbase data from hdp 2.5.6 to hdp 2.6.5.
- Experience in developing Spark applications in Scala for data extraction and analyzing.
- Experience in analyzing data using HiveQL and custom Map Reduce programs in Java.
- Experienced on loading large sets of structured, semi - structured, and unstructured data and performed importing and exporting data into HDFS and Hive using Sqoop.
- Experience in developing data ingestion and work-flow management scripts using NiFi.
- Hands on experience in working with NiFi to load the data from multiple sources directly into HDFS.
- Worked on big data projects such as Streamline analytics and Data consolidation.
- Worked on real time data integration using Kafka, Spark and Cassandra.
- Proficient in writing spark applications using Scala, and Java programming.
- Experience in developing HIVE UDFs, and Spark UDFs.
- Have good knowledge in NoSQL databases like Hbase, MemSQL and Cassandra.
- Good Understanding and working knowledge on Agile Software development process. Sound understanding of Agile tools like Jiraand Octane.
- Have good knowledge on ETL and BI tools like Tableau and Cognos.
- Hands on experience in monitoring and maintaining Hadoop clusters.
- Hands on Experience with Amazon EMR.
- Hands on experience in building Hortonworks Data Platform clusters with 2.5.x, 2.6.x
- Working on Cloudera Data Platform POC, to migrate hdp 2.6.5 spark applications to CDP environment.
- Proficient in developing web based applications and client server distributed architecture applications in Java/J2EE technologies using Object Oriented Methodology.
- Strong Knowledge on full Software Development life cycle-Software analysis, design, architecture, development, and maintenance.
TECHNICAL SKILLS
Big Data Technologies: HDFS, Hive, Spark, NiFi, Map Reduce, YARN, Sqoop, Oozie, Zookeeper, Lucidworks Solr, and Flume.
Scripting Languages: Shell, Scala, and Python.
Programming Languages: Java, and SQL.
Web Services: AWS EC2, S3, EMR, Dynamo DB, SOAP, and Rest.
Databases: SQL Server, Oracle, and MySQL.
DW & BI Tools: DataStage, Cognos, and Tableau.
NoSQL Databases: Hbase, MemSQL, Redis Cache, Cosmos DB, and Cassandra.
Hadoop Environments: HDP 2.5.x, HDP 2.6.x, and CDP.
Tools: Eclipse, Intellij, JBuilder and GIT Lab.
Operating Systems: Mesos DCOS, Linux, UNIX, MAC, Windows 7, Windows 8.
PROFESSIONAL EXPERIENCE
Confidential, California
Lead Data Engineer
Responsibilities:
- Involved in Data ingestion, Data processing phases.
- Lead the design and engineering of the application development; accountability for the implementation and production roll out of the solutions.
- Developed on building the legacy data pipelines such as flume kafka integration along with Lucidworks SOLR.
- Worked on migrating flume java interceptor logic to Spark streaming integration with kafka.
- Developed data pipelines with Spark Streaming, Kafka, Lucidworks SOLR.
- Developed spark application to stream the processed data to SOLR with solrj client.
- Developed on additions components like HBASE integration to the existing data pipelines to persist the data in HBASE tables after spark processing in a Kerberized HDP 2.6.x / 2.5.x environments.
- Hands on experience in developing the data tracer application to analyze the gap from source ingestion (kafka topics) to destination (SOLR, HDFS) at different stages in the pipeline.
- Developed spark streaming application to consume messages from Kafka topic, process the data from hdfs using spark dataframes and persist the dataframe to Hbase tables.
- Developed Spark preprocessor application for one of the Lam specific data, to integrate with existing data pipeline without breaking the flow.
- Working on the hdfs data migration to ADLS storage.
- Working on the existing data pipeline to migrate HDFS storage to ADLS using JAVA ADLS API.
- Have POC experience on migrating applications from current HDP 2.6.x platform to CDP platform.
- Currently working on HDInsight migration project from HDP 2.6.x platform.
- Utilized Agile Scrum Methodology to help manage and organize a team of four developers with regular code review sessions.
- Designed, documented operational problems by following standards and procedures using a software-reporting tool Confluence & JIRA.
- Optimize and tune the Hadoop environments to meet performance requirements.
- Performance analysis and debugging of slow running development and production processes.
- Assist with admin and support team to maintain the high level and low-level technical documentations.
Confidential, Chicago
Hadoop/Spark Lead Developer
Responsibilities:
- Involved in Data ingestion, Data processing and reporting phases.
- Lead the design and engineering of the application development; accountability for the implementation and production roll out of the solutions.
- Developed NIFI workflow scripts to pull the data from vendor data sources.
- Developed spark application in Scala for input files extraction based on its respective schema.
- Developed custom spark application for common processing functionalities.
- Developed custom NiFi framework to automate the ingestion mechanism with minimal effort.
- Developed spark application to read MemSQL DB data into Spark dataframes.
- Worked on data ingestion automation from multiple data sources into Hadoop Distributed File System.
- Developed business logic using Scala, Spark and HIVE.
- Implemented factory design pattern to handle multiple vendors’ data enrichment specific to each vendor.
- Developed custom spark jobs to move large datasets among different platforms like SQL Server to Hadoop and MemSQL to Hadoop vice versa.
- Developed selenium automation jobs for downloading data from vendor portals.
- Developed custom NiFi processor to download the attachment from Confidential email box.
- Developed email notification and scheduling jobs in java for selenium automation jobs in Windows server
- Developed Spark SQL scripts for implementing data transformations.
- Developed hive, SQL scripts to load data into hive external tables and for view creations to be consumed by Tableau visualization tool.
- Developed Date, Currency type conversions, and Sales Spark UDF’s to reduce the load on Tableau.
Confidential, California
Hadoop Developer
Responsibilities:
- Worked on Streamline analytics and data consolidation projects on the product Connect Home.
- Integrated Kafka, Spark and Cassandra for streamline analytics.
- Worked on ingesting, reconciling, compacting, and purging base table and incremental table data using Hive and job scheduling through Oozie.
- Utilized DevOps principle components to ensure operational excellence before deploying in production.
- Operating the cluster on AWS by using EC2, EMR, S3, and cloud watch.
- Import structural data using Sqoop to load data from MySQL, Oracle to HDFS and vice versa on regular basis.
- Developed Scripts and Batch Jobs to schedule various Hadoop Program using Oozie.
- Implemented Spark streaming on all kinds of data using most optimized and performance tuning techniques.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Optimized Hive queries using compact and bitmap indexing for quick look up inside tables.
- Created 50 buckets for each Hive ORC table based on clustering by client Id for better performance (optimization) while updating the tables.
- Used Java UDFs for performance tuning in Hive and Pig by manually driving the MR part.
- Written Spark programs to model data for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
- Developed Pig Latin scripts to extract data from web server output file to load into HDFS.
- Developed Pig UDFs to pre-process data for analysis.
- DevelopedSparkcode using Scala andSpark-SQL/Streaming for faster testing and processing of data.
- UsedSparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Utilized Agile Scrum Methodology to help manage and organize a team of four developers with regular code review sessions.
- Prepared developer (unit) test cases and executed developer testing.
- Designed, documented operational problems by following standards and procedures using a software-reporting tool JIRA.
Confidential
Hadoop/Java Developer
Responsibilities:
- Responsible for managing data from multiple sources.
- Involved in the migration part of the project for 30 different sources.
- Worked with IBM data extraction application Data stage for ETL purpose to get the data on Edge Node.
- Developed CRON jobs to write the input data files to HDFS location and Archive location.
- Developed Map Reduce applications for the schema validation and Row Count Validation.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Experienced in importing and exporting data into HDFS and assisted in exporting analyzed data to RDBMS using SQOOP.
- Once Schema and Row Count Validation is done, written MR jobs to create Avro Schema.
- Developed Applications to convert .dat files to Avro data format.
- Written MR jobs to create super set schema from different Avro schemas.
- Hive tables have been created from the super set schema.
- Participated in white board sessions to get the task requirements.
