Data Engineer Resume
Orlando, FL
SUMMARY
- 8+Years of professional experience in IT which includes comprehensive experience in working withBigdata/ Hadoop Eco - components, Hive, Hbase, Spark, Streaming, RDBMS, Cloud Engineering Platforms(AWS,GCP, Azure), Python, Unix, Rest API’s and ETL data processing.
- Good understanding ofBigdata ecosystems tools and technologies.
- Real time working experience with Cloudera(CDH) andHortonworks(HDP) clusters.
- Good Knowledge in creating event-processing data pipelines usingNifi, KafkaandPubsub
- Expertise in data transformation & analysis using SPARK, PIG, HIVE,SQL
- UsedApache Solrto Query Indexed Data for Analytics for boosting search catalyst operations.
- ImplementedETLfunctionalities to process real time data using Spark inbuilt API’s and Spark SQL.
- Worked on real time data integration using Kafka, Nifi, Pub-sub data pipeline, Spark streaming andHBase.
- Worked on ingesting, reconciling, compacting and purging base table and incremental table data using Hive andHBaseand job scheduling throughOozie.
- Hands on experience with AWS Services like S3,EMR, Lamdba Functions,SQS,SNS, EC2 etc.
- Worked on all major components of Hadoop Eco-components such asHDFS, HIVE, PIG, Oozie, Sqoop, Map Reduce and YARNonCloudera, MapRandHortonworksdistributions.
- Worked on setting upAWS EMR, EC2 clustersandMulti-Node Hadoop Clusterinside developer environment.
- Developed scripts and batch jobs to monitor and schedule variousSpark jobs.
- Exposure to Cloudera Installation on Azure Cloud instances
- Worked on ingesting, reconciling, compacting and purging base table and incremental table data usingHiveandHBaseand job scheduling throughOozie.
- Worked on importing and exporting data from different databases likeOracle, Teradata,MySQL, and DB2, Netezza, MSSQL intoHDFSandHivetables usingSqoop.
- Extensive work experience with Managed, Kudo and External Tables.
- Worked on collecting stream data into HDFS usingKafka and Nifi.
- WrittenSpark SQL, Hive and PIG Latin queriesfor data analysis to meet teh business requirements.
- Created base and incremented tables with partitioning, bucketing of table and creating UDF’s in Hive using optimisation and tuning.
- Worked on NoSQL databases such asHBase and integration between Hive and Hbase using bidirectional tables.
- Experience with Oozie Workflow Engine to automate and parallelize Hadoop Map/Reduce, Hive and PIG MR Frameworks.
- Implemented automatic workflows and job scheduling usingOozie, Zookeeper and Ambari
- Good working experience on Hadoop Cluster architecture and monitoring teh cluster. In-depth understanding of Data Structure and Algorithms.
- Build and configured ApacheTEZonHiveandPIGto achieve better responsive time while runningMRJobs.
- Extending Hive and Pig core functionality by writing custom UDFs, UDTF and UDAFs.
- Implemented Proofs of Concept on Hadoop stack and different big data analytic tools, migration from different databases to Hadoop
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, Avro Parquet files.
- Worked on Software Development Life Cycle (SDLC)
- Experienced in creatingPL/SQL Stored Procedures, Functions, Cursors against Oracle (10g, 11g)
- Key participant in all phases of software development life cycle with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in client server environment, Object Oriented Technology and Web based applications.
- Experienced in scheduling automated scheduling jobs using Autosys, CA7, BMC Control-M and One-Automation Scheduling tools.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Pig, Hive, Spark 2.x/1.x, YARN, Kafka 2.6, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, HBase, Beeline, Nifi, Streamsets,STEM-H20,ELKCloud Environment AWS, Azure,GCP
Hadoop Distributions: Cloudera CDH 6.1/5.12/5, Hortonworks HDP 2.6 & 3.5, MAPR
ETL: Talend, Abinitio
Languages: Python, Unix Shell Scripting, SQL,Spark,Java,SCALA
NoSQL Databases: MongoDB, HBase
RDBMS: Oracle 10g,11i, MS SQL Server, DB2, Teradata, Netezza, MSSQL
Testing: MR Unit Testing, Quality Center (QC),HP- ALM,Pytest Framework
Virtualization: VMWare, AWS/EC2,GCP(GCE,GAE,GKE)
Build Tools: Maven, Ant, SBT, Git,IBM Udeploy
PROFESSIONAL EXPERIENCE
Confidential, Orlando, FL
Data Engineer
Responsibilities:
- Involved in complete project life cycle starting from design discussion to production deployment.
- Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
- Developed a job server (REST API, Spring boot, ORACLE DB) and job shell for job submission, job profile storage, job data (HDFS) query/monitoring.
- Supported theMessageStudioandSelligentMarketingCloudapplicationstacks
- Implemented solutions for ingesting data from various sources and processing teh Data utilizing Big Data Technologies such as Hive, Pig, Sqoop, Hbase, Map reduce, etc.
- Design and develop a daily process to do incremental import of raw data from DB2 into Hive tables using Sqoop.
- Involved in debugging Map Reduce job using MR Unit framework and optimizing Map Reduce.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into Hive tables.
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest data into HDFS for analysis.
- Used Oozie and Zookeeper for workflow scheduling and monitoring.
- TEMPEffectively used Sqoop to transfer data from databases (SQL, Oracle) to HDFS, Hive.
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm.
- Designed Hive external tables using shared meta-store instead of derby with dynamic partitioning &buckets.
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Nifi and web Methods.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
- Design & implement ETL process using Abinito to load data from Worked extensively with Sqoop for importing and exporting teh data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
- Developed PIG Latin scripts to extract teh data from teh web server output files to load into HDFS.
- Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster.
- Strongly recommended to bring in Elastic Search and was responsible for installing, configuring and administration.
- Implemented using Pyspark and SQL for faster testing and processing of data. Real time streaming teh data using with KAFKA.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically. worked on creating End-End data pipeline orchestration using Oozie.
- Populated HDFS and Cassandra with massive amounts of data using Apache Kafka.
- Involved in design and developed Kafka and Storm based data with teh infrastructure team.
- Worked on major components in Hadoop Ecosystem including Hive, PIG, HBase, HBase-Hive Integration, Pyspark, Sqoop and Flume.
- Developed Hive Scripts, Pig scripts, Unix Shell scripts, programming for all ETL loading processes and converting teh files into parquet in teh Hadoop File System.
- Worked with Oozie and Zookeeper to manage job workflow and job coordination in teh cluster.
Environment: Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, SQL,Abinitio, AWS(S3, RedShift, CFT, EMR, Cloudwatch), Kafka,ZooKeeper, Pyspark
Confidential, Bellevue, WA
Data Engineer
Responsibilities:
- Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications usingScalaandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing& transforming teh data to uncover insights into teh customer usage patterns.
- Responsible for estimating teh cluster size, monitoring and troubleshooting of teh Hadoop cluster.
- UsedZeppelin,Jupyter notebooks andSpark-Shellto develop, test and analyze Spark jobs before Scheduling Customized Spark jobs.
- Undertake data analysis and collaborated with down-stream, analytics team to shape teh data according to their requirement.
- Developed and Supercised teh development of ETL pipelines from diverse sources into a centralised Data Warehouse (point of sale system, Google, Selligent, IntelliShop)
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wroteUDF’sin Pyspark.
- For Log analytics and for better query response usedKusto Explorer.
- Replaced teh existingMapReduceprograms andHiveQueries into Spark application using Scala.
- Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS).
- Conducting code reviews for team members to ensure proper test coverage and consistent code standards.
- Responsible for documenting teh process and cleanup of unwanted data.
- Responsible for Ingestion of Data and maintaining teh PROD pipelines for real business needs.
- Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running teh jobs.
- Developed Json Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh Cosmos Activity.
- Hands-on experience on developing PowerShell Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
- Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save teh results to output directory into HDFS.
- Involved in running teh Cosmos Scripts in Visual Studio 2017/2015 for checking teh diagnostics.
- Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks.
Environment: Hadoop, MapReduce, HDFS, Pig, Hive, Spark, Kafka, IntelliJ, Cosmos, Sbt, Zeppelin, YARN, Pyspark, Scala, SQL, Git.
Confidential, DC
BIG DATA ENGINEER
Responsibilities:
- Anchor artifacts for multiple milestone (application design, code development, testing, and deployment) in software lifecycle.
- Develop Apache Strom program to consume teh Alarms in real time streaming from Kafka and enrich teh alarm and pass it to EEIM Application.
- Creating rules Engine in Apache Strom to categorize teh alarm into Detection, Interrogation & Association types before processing of alarms.
- Develop Apache Spark program using python language(PYSPARK) to establish teh establish a connection between Mongo DB and EEIM application.
- Responsible to develop EEIM Application as Apache Maven project and commit to code to GIT.
- Analyze teh Alarms and enhance teh EEIM Application using Apache Strom to predict teh root cause of teh alarm and exact device where teh network failure is happened.
- Accumulate teh EEIM Alarm data to teh NoSQL database called Mongo DB and retrieve it from Mongo DB when necessary.
- Build Fiber To Teh Neighborhood or Node(FTTN) Topology and Fiber To Teh Premises(FTTP) Topology using Apache Spark and Apache Hive.
- Review teh performance of teh system and re-evaluate teh platform by doing teh complete system regression test with heavy load of data and capture teh logs and metrics of performance review
- Process teh system logs using logstash tool and store to elastic search and create dashboard using Kibana.
- Regularly tune performance of Hive queries to improve data processing and retrieving
- Provide teh technical support for debugging, code fix, platform issues, missing data points, unreliable data source connections and big data transit issues.
- Developed Java and Python application to call teh external REST APIs to retrieve weather, traffic, geocode information.
- Worked on TAX POC from scratch using sqoop and hive.
- Review teh unit, integration, system, regression test results of Data Pipelines in teh development environment and provide GO or NO GO for teh system to Production.
- Conducting code reviews on regular basis or on ad-hoc / on-demand when AT&T deems necessary.
- Creation of simulation tools and data sets for unit and integration testing.
- Provide teh advanced simplistic approaches by researching teh data using Machine Learning and Deep Leaning Techniques.
- Provide analytics on most failed equipment in topology using teh Steam-H2O analytical tool and build a dashboard.
- Experienced with Jira, Bit Bucket and source control systems like GiT and SVN and development tools like Jenkins, artifactory.
Environment: PySpark, MapReduce, HDFS, Sqoop, flume, kafka, Hive, Pig, HBase, SQL, Shell Scripting, Eclipse, SQL Developer, GiT, SVN, JIRA, Unix.
Confidential, Kansas City, MO
DATA ANALYST/ENGINEER
Responsibilities:
- Involved in complete project life cycle starting from design discussion to production deployment.
- Worked closely with teh business team to gather their requirements and new support features.
- Developed a 16-node cluster in designing teh Data Lake with teh Cloudera Distribution.
- Responsible for building scalable distributed data solutions using Hadoop.
- Implemented and configured High Availability Hadoop Cluster.
- Installed and configured Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper).
- Developed Hive scripts to analyze data and PHI are categorized into different segments and promotions are offered to customer based on segments.
- Extensive experience in writing Pig scripts to transform raw data into baseline data.
- Developed UDFs in Java as and when necessary to use in Pig and HIVE queries.
- Worked on Oozie workflow engine for job scheduling.
- Created Hive tables, partitions and loaded teh data to analyze using HiveQL queries.
- Created different staging tables like ingestion tables and preparation tables in Hive environment.
- Optimized Hive queries and used Hive on top of Spark engine.
- Worked on Sequence files, Map side joins, Bucketing, Static and Dynamic Partitioning for Hive performance enhancement and storage improvement.
- Experience in retrieving data from oracle using PHP and Java programming.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Created tables in HBase to store teh variable data formats of data coming from different upstream sources.
- Experience in managing and reviewing Hadoop log files.
- Good understanding of ETL tools and how they can be applied in a Big Data environment.
- Followed Agile Methodologies while working on teh project.
- Bug fixing and 24-7 production support for running teh processes
Environment: Hadoop, MapReduce, HDFS, Sqoop, flume, kafka, Hive, Pig, HBase, SQL, Shell Scripting, Eclipse, DBeaver, Datagrip, SQL Developer, intellij, GiT, SVN, JIRA, Unix
Confidential
DATA ANALYST
Responsibilities:
- Developing Oracle PL/SQLstored procedures, Functions, Packages,SQLscripts.
- Participated in Designing databases (schemas) to ensure dat teh relationship between data is guided by tightly bound Key constraints
- Worked with users and applicationdevelopersto identify business needs and provide solutions.
- Created Database Objects, such as Tables, Indexes, Views, and Constraints.
- Extensive experience in Data Definition, Data Manipulation, Data Query and Transaction Control Language
- Enforced database integrity using primary keys and foreign keys.
- Tuned pre-existing PL/SQLprograms for better performance.
- Created many complexSQLqueries and used them in Oracle Reports to generate reports.
- Implemented data validations using Database Triggers.
- Used import export utilities such as UTL FILE for data transfer between tables and flat files
- PerformedSQLtuning using Explain Plan.
- Provided support in teh implementation of teh project.
- Worked with built-in Oracle standard Packages like DBMS SQL, DBMS JOBS and DBMS OUTPUT.
- Created and implement report modules into database from client system using Oracle Reports as per teh business requirements.
- Used PL/SQLDynamic procedures during Package creation.
Environment: Oracle 9i, Oracle Reports,SQL, PL/SQL,SQL*Plus,SQL*Loader, Unix, Windows XP.