We provide IT Staff Augmentation Services!

Lead Data Engineer Resume

3.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Over all 10 years of experience in Information Technology which includes in Bigdata and Hadoop Ecosystem. 8+ years of experience with Big data, Hadoop technologies.
  • Hands on experience in configuring HDFS and Hadoop ecosystem components like HBase, Solr, Hive, Tez, Sqoop, Pig, Flume, Oozie, Zookeeper etc.
  • Hands on experience in writing scripts and Hive Query Language.
  • Experience in database development using SQL and PL/SQL and experience working on databases like Oracle, Informix and SQL Server.
  • Upgrading the Hadoop Cluster from CDH3 to CDH4, setting up High Availability Cluster and integrating HIVE with existing applications.
  • Expert in performing Data Analysis, Gap Analysis, Co - ordinate with the business, Requirement gathering and technical documents preparation. Experience in multiple distributions i.e. Horton works, cloudera etc.
  • Hands on experience on build tools like Jenkins, Maven, Ant and Virtualization and Containers (Docker) and Hypervisors ESXI, ESX.
  • Experience working on NoSQL databases like HBase and knowledge in Cassandra, Redis, MongoDB.
  • Experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • In depth knowledge of Cassandra and hands on experience with installing, configuring and monitoring DataStax Enterprise cluster.
  • Experience of Hadoop Architecture and various components such as HDFS, Name Node, Data Node, Job Tracker, Task Tracker, YARN and Map Reduce.
  • Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data
  • Working experience in databases such as Oracle, SQL Server, Sybase and DB2 in the areas of Object-Relational DBMS Architecture, physical and logical structure of database, Application Tuning and Query optimization.
  • Worked on creating Virtual machines using VMware and CHP software.
  • Handle Virtual Machine Migrations from Azure Classic to Azure Rm using power shell
  • Experience with installation and configuration of Web sphere, Web Logic, Tomcat and deployment of 3-tier applications.
  • Proficient in SQL and PL/SQL using Oracle, DB2, Sybase and SQL Server.
  • Installed, Configured Talend ETL on single and multi-server environments
  • Experienced in working with Apache Airflow, created different data pipelines using Apache Airflow
  • Created standard and best practices for Talend ETL components and jobs.
  • Effective team player and excellent communication skills with insight to determine priorities, schedule work and meet critical deadlines.
  • Strong technical and architectural knowledge in solution development.
  • Effective in working independently and collaboratively in teams.
  • Good analytical, communication, problem solving and interpersonal skills.
  • Flexible and ready to take on new challenges.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, Spark, Oozie, Flume, Impala, Tez, Kafka, Storm, Sonar, Flume, Hcatalog, yarn, Cassandra, And Mesos.

Big Data Distributions: Horton Works, Cloudera, Apache

Programming: Python, and PL/SQL

Database: Oracle 10g, DB2, SQL, No sql (MongoDB, Cassandra, HBase), Snowflake

Web/App Server: WebSphere Application Server 7.0, Apache Tomcat 5.x 6.0, Jboss 4.0

Web Languages: XML, XSL, HTML, JavaScript, JQuery and JSON.

ETL: Talend and Informatic 9.x/8.X (Integration Service / Power Center) IWX (Info works)

Messaging Systems: JMS, Kafka and IBM MQ Series

Version Tools: Git, SVN, and CVS

Scripts: Shell, Python, Maven and ANT

OS & Others: Windows, Linux, SVN, Clear Case, Putty, WinSCP and FileZilla

Cloud (AWS): AWS (EC2, S3, CloudWatch, RDS, ElastiCache, ELB, IAM), M, Rackspace, Openstack, CloudFoundry.

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Lead Data Engineer

Responsibilities:

  • Good hands on experience in various Big Data technologies like Hadoop, Map Reduce and HDFS
  • Working with Data Analysts to create metrics based on business requirements and evaluate them in Prodstage thereby move them into Production environment.
  • Playing a vital role in creating Technical SQL for snowflake based on Business SQL given by the Data Analysts
  • Troubleshooting production support issues post-deployment and come up with solutions as required
  • Implementing CICD using in house automation frameworks and GitHub as VCS.
  • Reviewing software documentation to ensure technical accuracy, compliance, or completeness, with focus to mitigate risks
  • Designing test plans, scenarios, scripts, and/or procedures to determine product quality or release readiness
  • Performing test execution and capturing results documentation by executing shell scripts on EMR.
  • Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Cloud Compute (EC2) and Amazon Simple Storage Service (S3).
  • Developed AWS Cloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups.
  • Expertise in Extraction, Transformation, loading data from JSON, DB2, SQL Server, Excel, Flat Files.
  • Developed Streaming based solutions based on Kafka Streaming API
  • Implemented Windowing technique using confluent KafkaStream API
  • Experienced in integrating kafka with Spark Structured streaming
  • Enhanced the performance of queries and daily running ETL jobs using the efficient design of partitioned tables.
  • Developed data extraction pipelines using pandas data frame in python.
  • Performing initial debugging procedures by reviewing configuration files, logs, or code pieces to determine breakdown source

Environments/Tools: GitHub, SQL, AWS (EC2, S3 & EMR), Docker, Control-m, UNIX, Shell scripts, Python, Spark, Snowflake, Nebula - Master Data Management tool,Jupyter Notebooks.

Confidential, Atlanta, GA

Lead Hadoop Developer

Responsibilities:

  • Developed Spark SQL script for handling different data sets and verified its performance over MR jobs.
  • Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job.
  • Involved in loading data from LINUX file system to HDFS
  • Experience in Writing Map Reduce jobs for text mining and worked with predictive analysis team to check the output and requirement.
  • Developed custom aggregate functions using SparkSQL and performed interactive querying.
  • Used Scoop to store the data into HBase and Hive.
  • Worked on installing cluster, commissioning & decommissioning of DataNodes, NameNode high availability, capacity planning, and slots configuration.
  • Creating Hive tables, dynamic partitions, buckets for sampling, and working on them using b.
  • Implemented partitioning, dynamic partitions and buckets in HIVE.
  • Developed Hive Scripts to create the views and apply transformation logic in the Target Database.
  • Monitored multiple Hadoop clusters environments using ClouderaManager. Monitored workload, job performance and collected metrics for Hadoop cluster when required.
  • Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, sqoop, package and MySQL.
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
  • Manage and support the Teradata EDW including their client tools i.e. Teradata Studio and SQL Assistant connecting to the Datalike
  • Designed a workflow which can download binary files directly into cluster/local directory.
  • Developed a Mapr document which it spins up in the OpenShift environment when the job is running successfully.
  • Used Sonar to check the code issues successfully which propagates the code clarity.

Environments/Tools: Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, HBase, Splunk, Kafka, LDAP, Kerberos, Oracle Server, MySQL Server, Elasticsearch, Crontab, Core Java, Linux, Bash scripts

Confidential, Dallas, TX

Hadoop Developer

Responsibilities:

  • Data is ingested from sources like Oracle and DB2, performed data transformations and then export the transformed data to Cubes as per the business requirement.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from Oracle, DB2 and Teradata into HDFS using Sqoop.
  • Involved in creating Hive tables, loading with data using HQL scripts which will run internally in map reduce way
  • Written customized Hive UDFs in Java where the functionality is too complex.
  • Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
  • Run various Hive queries on the data dumps and generate aggregated datasets for downstream systems for further analysis.
  • Applied windowing functions, aggregations, time and date functions on the data as per the business logic.
  • Developed dynamic partitioned Hive tables and store data by timestamp, source type for efficient performance tuning.
  • Scheduled sqoop ingestions and Hive transformations (Hql scripts) using Oozie, Maestro schedulers
  • Worked with different File Formats like TEXTFILE and ORC for HIVE querying and processing.
  • Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD's in Scala.
  • Worked on Apache spark writing python applications to convert txt, xls files and parse.
  • Experience in integrating Apache Kafka with Apache Spark for real time processing.
  • Used NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
  • Performed various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
  • Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.

Environment: HDFS, Hive, Map Reduce, Java, HBase, Pig, Sqoop, Oozie, MySQL, SQL Server, Windows and Linux.

We'd love your feedback!