Data Engineer Resume
Dallas, TX
SUMMARY
- Hadoop Developer with 7+ years of IT experience on designing and implementing complete end - to-end Hadoop Ecosystem which includes HDFS, MAPREDUCE, YARN, PIG, HIVE, HBASE, FLUME, SQOOP, OOZIE, SPARK, KAFKA and ZOOKEEPER.
- Excellent hands on knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Hands on experience with Real time streaming using Spark and Kafka Streaming into HDFS.
- Great Experience in Handling large Datasets using Partitions, Spark In-memory capabilities and Broadcasts in Spark
- Strong Knowledge on Spark concepts like RDD operations, Caching and Persistence.
- Developed Analytical Components using Spark SQL and Spark Stream.
- Experience in analyzing the data using Hive UDF and Hive UDTF custom MapReduce programs in Java.
- Expert in working with Hive Data Warehouse tool-creating tables, Data Distribution by implementing Partitions and Bucketing.
- Expertise in all the Hive functionalities and migration of data from different databases like ORACLE, DB2, MYSQL and MongoDB.
- Handled Upgrades of Apache Ambari, CDH and HDP Cluster.
- Experienced working with Amazon EMR, Cloudera (CDH3 & CDH4) and Horton Works Hadoop Distributions.
- Hands on Experience with AWS infrastructure services, Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2).
- Experienced in writing Test cases and implement unit test cases using testing frame works like Junit, Easy mock and Mockito.
- Experienced in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Hands on experience in Application Development by writing MapReduce jobs in JAVA, PYTHON, Hive Queries, RDBMS and Linux shell scripting.
- Experienced in build scripts using Maven and do continuous integrations systems like Jenkins.
- Well experienced in using applications like WebLogic, Web Sphere and Java tools in client server.
- Excellent knowledge in Python, Java, Collections, J2EE, Servlets, JSP, Spring, Hibernate, JDBC/ODBC.
- Hands on experience with NoSQL databases like HBase, Cassandra and Relational Databases like Oracle and MySQL
- Deep Analytics and understanding of Big Data and Algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
- Experienced in using Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys
- Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
- Experienced in designing both time driven and data driven automated workflows using OOZIE order to run jobs of Hadoop MapReduce and PIG.
- Good experience in doing project impact assessment, Project Schedule Planning, onsite offshore Team coordination, End user coordination starting from requirement gathering to till live support.
- Successful in meeting new technical challenges and finding solutions to meet the needs of the customer.
- Successfully working in fast-paced environment, both independently and in collaborative team environments.
- Strong Business, Analytical and Communication Skills.
TECHNICAL SKILLS
Big Data Technologies: HADOOP, HDFS, SPARK, HIVE, PIG, HBASE, SQOOP, OOZIE, FLUME, KAFKA, ZOOKEEPER.
Real-Time/Stream Processing: Apache Storm, Apache Spark, Flume.
Distributed Message Broker: Apache Kafka
Databases: Oracle9.x, 10g, 11g MS SQL Server, MySQL Server, DB2, HBase, MongoDB, Cassandra.
Database/NoSQL: HBase, Oracle 9i,10g,12c, MySQL
Scripting Languages: JavaScript, shell, python
Network & Protocols: TCP/IP, Telnet, HTTP, HTTPS, FTP, SNMP, LDAP, DNS.
Operating Systems: Linux, UNIX, MAC, Windows NT / 98 /2000/ XP / Vista, Windows 7, Windows 8.
PROFESSIONAL EXPERIENCE
Confidential - Dallas, TX
Data Engineer
Responsibilities:
- Involved in Requirement Gathering, Business Analysis and Translated Business requirements into Technical Design in Hadoop and Big Data
- Great Hands-on experience working on different Hadoop ecosystem components like Pig, Hive, Sqoop, Spark, Kafka.
- In-depth understanding of Spark Architecture including Spark Core , Spark SQL , Data Frames , Spark Streaming , Spark MLlib .
- Implemented Incremental load approach in spark for huge tables.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
- Worked on different file formats like JSON,CSV,XML using spark SQL.
- Imported and exported data into HDFS from database and vice versa using Sqoop .
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
- Worked with CDH4 as well as CDH5 applications. Performed Data transfer of large data back and forth from development and production clusters.
- Created Partitions and Buckets in Hive for both Managed and External tables in Hive for optimizing performance.
- Implemented Hive Join queries to join multiple tables of a source system and load them to Elastic search tables
- Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying .
- Developed Data Lake as a Data Management Platform for Hadoop.
- Used Talend to run the ETL processes instead of Hive queries
- Successfully moved data from Hadoop to Cassandra using Bulk output format class.
- Extracted data from Teradata database and loaded into Data warehouse using spark-JDBC.
- Handled Data Movement, Data transformation, Analysis and visualization across the lake by integrating it with various tools.
- Experienced in code repositories like GitHub.
- Good Understanding of NoSQL database and hands on experience in writing applications on NoSQL databases like HBase, Cassandra and MongoDB
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Oozie, Jenkins, Cloudera, Oracle 12c, Linux
Confidential - Denver, Co
Data Engineer
Responsibilities:
- Responsible for Building Scalable Distributed Data solutions using Hadoop.
- Real time data processing (Kafka, Spark Streaming & Spark Structured Streaming), Worked on Spark SQL, Structured Streaming, MLlib and using Core Spark API to explore Spark features to build data pipelines, Implemented Spark streaming applications & fine tune to reduce shuffling.
- Handled large datasets using partitions, Spark In-Memory capabilities, Broadcasts in Spark, Effective & Efficient Joins, transformations and other during ingestion process itself.
- Worked in Performance Tuning of Spark Applications for setting right batch interval time, correct level of parallelism and memory tuning.
- Worked on a Cluster of Size 105 nodes .
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.
- Handled Importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Implemented schema extraction for Parquet and Avro file formats in Hive.
- Used Hive and created Hive tables, loaded data from Local file system to HDFS
- Used hive to do transformations, event joins and pre-aggregations before storing the data to HDFS
- Implemented Partitioning, Dynamic Partitions, and Buckets on huge datasets to analyze and compute various metrics for reporting.
- Involved in HBase setup and storing data into HBase for future analysis.
- Good experience working on Tableau and Spotfire and enabled the JDBC/ODBC data connectivity from those to Hive tables.
- Used Oozie workflow to coordinate pig and Hive scripts.
- Used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake .
- Written Hive Queries for ad hoc data analysis to meet business requirements.
Environment: HDFS, MapReduce, Hive, PIG, Sqoop, HBase, Oozie, Flume, Sqoop, Kafka, Zookeeper, Amazon AWS, SparkSQL, Spark Dataframes, PySpark, Python, Java, JSON, SQL Scripting and Linux Shell Scripting, Avro, Parquet, Hortonworks.
Confidential, Herndon, VA
Data Engineer
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
- Configured Sqoop Jobs to import data from RDBMS into HDFS using Oozie workflows.
- Experienced in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Involved in installation and configuration of Hadoop MapReduce, HDFS and Developed multiple MapReduce jobs in Java for Data Cleaning and Processing.
- Load and Transform huge datasets of structured and semi-structured data using Hive.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Created Hive tables and Developed Hive queries for De-normalizing the Data.
- Created PIG Latin Scripts to sort, group, join and filter the enterprise wise data.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Created batch analysis job prototypes using Hadoop, Pig, Oozie, Hue and Hive.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
- Worked on the root cause analysis for all the issues that occur in batch and provide the permanent fixes for the issues.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions
- Created and maintained technical documentation for all the workflows.
- Created database access layer using JDBC and SQL stored procedures.
- Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
Environment: Hadoop YARN, Hive, Sqoop, Amazon AWS, Java, Python, Oozie, Jenkins, Cassandra, Oracle 12c, Linux.
Confidential - Lakewood, Oh
Data Engineer
Responsibilities:
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Used Oozie to orchestrate the workflow.
- Involved in loading data from LINUX file system to HDFS
- Analyzed data using Hadoop components Hive and Pig.
- Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Moved the data from Oracle, MSSQL Server in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
- Written HBASE Client program in Java and Webservices.
- Mentored analyst and test team for writing Hive Queries.
- Implemented test scripts to support test driven development and continuous integration
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Gained excellent hands on knowledge on Hadoop cluster, MapReduce Jobs, Data Migration concepts in Hive.
Environment: Hadoop, MapReduce, HDFS, Sqoop, Hive, Java, Cloudera, Pig, HBase, Linux, XML, MySQL Workbench, Java, Eclipse, Oracle 10g, PL/SQL, SQL*PLUS.