Data Engineer Resume
Cincinnati, OH
SUMMARY
- Experienced in building highly scalable Big - data solutions using Hadoop multiple distributions i.e., Cloudera, Hortonworks and NoSQL platforms (HBase & Cassandra).
- Experience in Software development life cycle (SDLC) for various applications including Analysis, Design, Development, Implementation, Maintenance and Support.
- Hands on experience in writing Spark SQL scripts and implementing Spark RDD transformations and actions using Python/Scala.
- Have experience in Spark Core, Spark Streaming, Hive Context,Spark SQL for analyzing the data.
- Good exposure to performance tuning hive queries and map-reduce jobs in spark framework.
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Zookeeper, Sqoop, flume, Kafka, Spark in both Cloudera and Hortonworks environments.
- Experience in developing Map Reduce jobs in Java for data cleaning, transformations, pre-processing and analysis.
- Hands on experience in designing Apache Airflow orchestrations for data ingestion and processing on both on-prem and google cloud platform.
- Experienced in working with cloud services such as Google Cloud.
- Good Knowledge on distributed systems, HDFS architecture, internal working details of Map Reduce and Spark processing frameworks.
- Good understanding of Machine Learning, Data Mining and Algorithms.
- Good understanding of messaging services like Apache Kafka.
- Good understanding of cloud-based services such as amazon web services - AWS EC2, S3, RDS, LAMBDA etc.
- Analyzing Streaming data and identifying important trends in data for further analysis using Spark Streaming.
- Experience and good Understanding in internal working of streaming service Apache Kafka.
- End to end experience in designing and data visualizations using Tableau.
- Participated in detailed object-oriented analysis and design to develop code in accordance to the design.
- Experienced in using relational databases like MySQL and MS-SQL Server, including writing SQL queries, stored procedure, triggers etc.
- Familiar with Java virtual machine (JVM) and multi-threaded processing.
- Ability to adapt to evolving technology, strong sense of responsibility and accomplishment.
TECHNICAL SKILLS
Hadoop/Bigdata Technologies: HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Impala, Apache Spark, Spark Streaming, Spark-SQL, Hue.
Programming Languages: Python, Scala, SQL, HQL
Databases: Oracle, MySQL, HBase
IDE Tools: VS-Code, IntelliJ
Framework: Hibernate, Spring, Struts
Web Technologies: HTML5, CSS3, JavaScript
Reporting Tools /ETL Tools: Tableau, Microsoft Power BI
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Cincinnati, OH
Responsibilities:
- Involved in various stages of project data flow such as control validation, data quality and change data capture.
- Experienced in the entire Software development life cycle (SDLC) in the project including Analysis, Design, Development, Implementation, Maintenance and Support.
- Built our various data stores for specific functionalities of the business. (Transaction, product, store, card, etc.)
- Building segmentations on the data assets to narrow down specific areas for targeting campaigns.
- Performed transformations, cleaning and filtering on imported data using Python, Jupyter Notebooks, Visual Studio Code and loaded data into data lake on HDFS.
- Experienced in developing workflows on Apache Airflow to automate the processes.
- Experienced in implementing Spark RDD transformations, actions to implement business analysis and Worked with Apache Spark.
- Exported data from RDBMS to HDFS and vice versa using Sqoop.
- Experienced working and building CI/CD pipelines on TeamCity to facilitate continuous delivery and deployment.
- Experience working with Jupyter Notebooks on Google Cloud (GCP).
- Experience building orchestrations with Apache Airflow on GCP.
- Experienced in working closely with data scientists to cater their changing data requirements.
- Created partitioned and bucketed tables based on the hierarchy of the dataset.
- Did various performance optimizations like using distributed cache for small datasets, Partitioning, Bucketing in hive and Map Side joins.
- Proficient in reading PL/SQL code and build similar functionalities in python and Spark.
- Experience in tuning spark applications.
- Good understanding on Spark SQL, Spark Transformation Engine and Spark Streaming.
- Experience in using version control services (GitHub).
Environment: Cloudera Manager 5.15, HDFS, Hive, Spark 2.2, Airflow, Python, Jupyter Notebooks, Visual Studio Code, TeamCity, GitHub, Oracle.
Hadoop/Spark Developer
Confidential, Denver,CO
Responsibilities:
- Involved in various stages of project data flow such as control validation, data quality and change data capture.
- Performed data mining tasks depending on business scenarios.
- Experience with Cloudera distribution of Hadoop (CDH 5.10).
- Experienced in the entire Software development life cycle (SDLC) in the project including Analysis, Design, Development, Implementation, Maintenance and Support.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
- Wrote SQL stored procedure in Hue to access the data from Hive.
- Experienced in implementing Spark RDD transformations, actions to implement business analysis and Worked with Apache Spark.
- Created Hive tables integrated them as per the design using parquet file format.
- Handled Delta processing or incremental updates using Hive.
- Executed Dynamic Partitioning in Hive to segregate customer database based on age
- Designed and developed Pig Latin scripts and pig command line transformations for data.
- Involved in writing various joins in MySQL depending on client requirement.
- Developed Hive scripts for analyst requirements for analysis.
- Stored data in hive and enabled end users to access through Impala.
- Exported data from RDBMS to HDFS and vice versa using Sqoop.
- Created partitioned and bucketed tables in Hive based on the hierarchy of the dataset.
- Created several UDFs in Pig and Hive to give additional support for the project.
- Good understanding on Spark SQL, Spark Transformation Engine and Spark Streaming.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
- Involved in cluster maintenance and monitoring.
- Have experience in Scala programming language and used it extensively with Apache Spark for data processing.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts.
Environment: Map Reduce, Cloudera Manager 5.10, HDFS, Hive, Spark 1.6, Kafka, Scala, MySQL, Java (JDK 1.6), Eclipse.
Hadoop Developer
Confidential, CA
Responsibilities:
- Responsible for running Hadoop streaming jobs to process terabytes of xml format data.
- Created Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
- Optimized Hive joins for large tables and developed map reduce code for full outer join of two large tables.
- Designed and developed Pig Latin scripts and pig command line transformations for data joins and custom processing of map reduce outputs.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed MapReduce jobs in Java API to parse the raw data and store the refined data.
- Created HBase tables for random read/writes by map reduce programs.
- Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.
- Worked in tuning Hive & Pig to improve performance and solved performance issues in both scripts with understanding of joins, groups and aggregations.
- Developed Sqoop Scripts to extract data from Oracle source databases onto HDFS.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
- Developed Sqoop Scripts to extract data from Oracle source databases onto HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop. Cluster co-ordination through Zookeeper.
- Implemented Cloudera Manager on existing cluster.
- Extensively worked with Cloudera Distribution of Hadoop, CDH 5.x.
- Performed Data Ingestion from multiple internal clients using Apache Kafka.
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.
- Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend.
- Developed the Talendjobs and make sure to load the data into HIVE tables & HDFS files.
Environment: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Impala, Cassandra, Kafka, SQL, Python, Spark, Linux, Java.
Java Developer
Confidential
Responsibilities:
- Implemented the Struts framework with MVC architecture.
- Developed the presentation layer using JSP, HTML, CSS and client-side validations using JavaScript.
- Collaborated with the ETL/ Informatica team to determine the necessary data models and UI designs to support Cognos reports.
- Performed several data quality checks and found potential issues, designed Ab Initio graphs to resolve them.
- Deployed and tested the application using Tomcat web server.
- Involved in coding, code reviews, JUnit testing, Prepared and executed Unit Test Cases.
- JUnit was used for unit testing for the integration testing tool.
- Used Oracle coherence for real-time cache updates, live event processing, in-memory grid computations.
- Developed UI for Customer Service Modules and Reports using JSF, JSP's and My Faces Components
- Creating custom tags for JSP for maximum re-usability of user interface components.
- Testing and deploying the application on Tomcat.
Environment: Java, JSP, Hibernate, Junit, JavaScript, Servlets, Struts, Hibernate, EJB, JSF, JSP, Ant, Tomcat, CVS, Eclipse, SQLDeveloper, Oracle.
