Lead Data Engineer Resume
Dallas, TX
SUMMARY
- Over all 10 years of experience in Information Technology which includes in Bigdata and Hadoop Ecosystem. 8+ years of experience with Big data, Hadoop technologies.
- Hands on experience in configuring HDFS and Hadoop ecosystem components like HBase, Solr, Hive, Tez, Sqoop, Pig, Flume, Oozie, Zookeeper etc.
- Hands on experience in writing scripts and Hive Query Language.
- Experience in database development using SQL and PL/SQL and experience working on databases like Oracle, Informix and SQL Server.
- Upgrading the Hadoop Cluster from CDH3 to CDH4, setting up High Availability Cluster and integrating HIVE with existing applications.
- Expert in performing Data Analysis, Gap Analysis, Co - ordinate with the business, Requirement gathering and technical documents preparation. Experience in multiple distributions i.e. Horton works, cloudera etc.
- Hands on experience on build tools like Jenkins, Maven, Ant and Virtualization and Containers (Docker) and Hypervisors ESXI, ESX.
- Experience working on NoSQL databases like HBase and knowledge in Cassandra, Redis, MongoDB.
- Experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
- In depth knowledge of Cassandra and hands on experience with installing, configuring and monitoring DataStax Enterprise cluster.
- Experience of Hadoop Architecture and various components such as HDFS, Name Node, Data Node, Job Tracker, Task Tracker, YARN and Map Reduce.
- Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data
- Working experience in databases such as Oracle, SQL Server, Sybase and DB2 in the areas of Object-Relational DBMS Architecture, physical and logical structure of database, Application Tuning and Query optimization.
- Worked on creating Virtual machines using VMware and CHP software.
- Handle Virtual Machine Migrations from Azure Classic to Azure Rm using power shell
- Experience with installation and configuration of Web sphere, Web Logic, Tomcat and deployment of 3-tier applications.
- Proficient in SQL and PL/SQL using Oracle, DB2, Sybase and SQL Server.
- Installed, Configured Talend ETL on single and multi-server environments
- Experienced in working with Apache Airflow, created different data pipelines using Apache Airflow
- Created standard and best practices for Talend ETL components and jobs.
- Effective team player and excellent communication skills with insight to determine priorities, schedule work and meet critical deadlines.
- Strong technical and architectural knowledge in solution development.
- Effective in working independently and collaboratively in teams.
- Good analytical, communication, problem solving and interpersonal skills.
- Flexible and ready to take on new challenges.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, Spark, Oozie, Flume, Impala, Tez, Kafka, Storm, Sonar, Flume, Hcatalog, yarn, Cassandra, And Mesos.
Big Data Distributions: Horton Works, Cloudera, Apache
Programming: Python, and PL/SQL
Database: Oracle 10g, DB2, SQL, No sql (MongoDB, Cassandra, HBase), Snowflake
Web/App Server: WebSphere Application Server 7.0, Apache Tomcat 5.x 6.0, Jboss 4.0
Web Languages: XML, XSL, HTML, JavaScript, JQuery and JSON.
ETL: Talend and Informatic 9.x/8.X (Integration Service / Power Center) IWX (Info works)
Messaging Systems: JMS, Kafka and IBM MQ Series
Version Tools: Git, SVN, and CVS
Scripts: Shell, Python, Maven and ANT
OS & Others: Windows, Linux, SVN, Clear Case, Putty, WinSCP and FileZilla
Cloud (AWS): AWS (EC2, S3, CloudWatch, RDS, ElastiCache, ELB, IAM), M, Rackspace, Openstack, CloudFoundry.
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
Lead Data Engineer
Responsibilities:
- Good hands on experience in various Big Data technologies like Hadoop, Map Reduce and HDFS
- Working with Data Analysts to create metrics based on business requirements and evaluate them in Prodstage thereby move them into Production environment.
- Playing a vital role in creating Technical SQL for snowflake based on Business SQL given by the Data Analysts
- Troubleshooting production support issues post-deployment and come up with solutions as required
- Implementing CICD using in house automation frameworks and GitHub as VCS.
- Reviewing software documentation to ensure technical accuracy, compliance, or completeness, with focus to mitigate risks
- Designing test plans, scenarios, scripts, and/or procedures to determine product quality or release readiness
- Performing test execution and capturing results documentation by executing shell scripts on EMR.
- Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Cloud Compute (EC2) and Amazon Simple Storage Service (S3).
- Developed AWS Cloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups.
- Expertise in Extraction, Transformation, loading data from JSON, DB2, SQL Server, Excel, Flat Files.
- Developed Streaming based solutions based on Kafka Streaming API
- Implemented Windowing technique using confluent KafkaStream API
- Experienced in integrating kafka with Spark Structured streaming
- Enhanced the performance of queries and daily running ETL jobs using the efficient design of partitioned tables.
- Developed data extraction pipelines using pandas data frame in python.
- Performing initial debugging procedures by reviewing configuration files, logs, or code pieces to determine breakdown source
Environments/Tools: GitHub, SQL, AWS (EC2, S3 & EMR), Docker, Control-m, UNIX, Shell scripts, Python, Spark, Snowflake, Nebula - Master Data Management tool,Jupyter Notebooks.
Confidential, Atlanta, GA
Lead Hadoop Developer
Responsibilities:
- Developed Spark SQL script for handling different data sets and verified its performance over MR jobs.
- Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job.
- Involved in loading data from LINUX file system to HDFS
- Experience in Writing Map Reduce jobs for text mining and worked with predictive analysis team to check the output and requirement.
- Developed custom aggregate functions using SparkSQL and performed interactive querying.
- Used Scoop to store the data into HBase and Hive.
- Worked on installing cluster, commissioning & decommissioning of DataNodes, NameNode high availability, capacity planning, and slots configuration.
- Creating Hive tables, dynamic partitions, buckets for sampling, and working on them using b.
- Implemented partitioning, dynamic partitions and buckets in HIVE.
- Developed Hive Scripts to create the views and apply transformation logic in the Target Database.
- Monitored multiple Hadoop clusters environments using ClouderaManager. Monitored workload, job performance and collected metrics for Hadoop cluster when required.
- Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, sqoop, package and MySQL.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
- Manage and support the Teradata EDW including their client tools i.e. Teradata Studio and SQL Assistant connecting to the Datalike
- Designed a workflow which can download binary files directly into cluster/local directory.
- Developed a Mapr document which it spins up in the OpenShift environment when the job is running successfully.
- Used Sonar to check the code issues successfully which propagates the code clarity.
Environments/Tools: Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, HBase, Splunk, Kafka, LDAP, Kerberos, Oracle Server, MySQL Server, Elasticsearch, Crontab, Core Java, Linux, Bash scripts
Confidential, Dallas, TX
Hadoop Developer
Responsibilities:
- Data is ingested from sources like Oracle and DB2, performed data transformations and then export the transformed data to Cubes as per the business requirement.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from Oracle, DB2 and Teradata into HDFS using Sqoop.
- Involved in creating Hive tables, loading with data using HQL scripts which will run internally in map reduce way
- Written customized Hive UDFs in Java where the functionality is too complex.
- Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Run various Hive queries on the data dumps and generate aggregated datasets for downstream systems for further analysis.
- Applied windowing functions, aggregations, time and date functions on the data as per the business logic.
- Developed dynamic partitioned Hive tables and store data by timestamp, source type for efficient performance tuning.
- Scheduled sqoop ingestions and Hive transformations (Hql scripts) using Oozie, Maestro schedulers
- Worked with different File Formats like TEXTFILE and ORC for HIVE querying and processing.
- Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD's in Scala.
- Worked on Apache spark writing python applications to convert txt, xls files and parse.
- Experience in integrating Apache Kafka with Apache Spark for real time processing.
- Used NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
- Performed various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
- Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
Environment: HDFS, Hive, Map Reduce, Java, HBase, Pig, Sqoop, Oozie, MySQL, SQL Server, Windows and Linux.