Data Engineer (hadoopdeveloper) Resume
SUMMARY
- 6 plus years of strong experience, working on Apache Hadoop ecosystem components like MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper, Flume and EC2, S3 cloud computing with AWS.
- Worked with distributions of Hadoop like Hortonworks.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, NameNode, DataNode and MapReduce programming paradigm.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark SQL.
- Experience in Hadoop cluster performance tuning by gathering and analyzing the existing infrastructure.
- Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS using Sqoop.
- Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
- Experience in converting Hive/SQL queries into RDD transformations using Apache Spark, Scala and Python.
- Developed complex Hive queries for business use cases and performance tunings for long running queries.
- Experience in data processing like collecting, aggregating, moving from various sources like Kafka.
- Experience in improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SQLContext, Spark Session, Data Set, Pair RDD's, Spark YARN.
- Developed spark application for filtering JSON source data from HDFS and store the filtered data into S3 buckets.
- Experience in creating Spark Scala jars using Intellij IDE and executing them.
- Developed NIFI flow to move data from different sources to HDFS and from HDFS to S3 buckets.
- Experienced in moving data from different sources using Kafka producers, consumers and preprocess data using NIFI.
- Good knowledge on various scripting languages like Linux/Unix, shell scripting and Python.
- Good working experience on different file formats (TEXTFILE, AVRO, ORC and EBCDIC).
- Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, Redshift, and Dynamo DB which provides fast and efficient processing of Big Data.
- Migrated an existing on - premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.
- Proficient in Data Warehousing, Data Mining concepts and ETL transformations from Source to target systems.
- Used Abinitio for Extraction, Transformation, and Loading (ETL) of information from numerous sources like Flat files, XML documents, and Databases.
- Exporting and importingdatato and from Teradata and Oracle using SQL developer for analysis.
- Good experience in using Sqoop for datapulls from Teradata and Oracle.
- Scheduled ETL jobs using IBM Maestro and Autosys.
- Experience in integration of various data sources with Multiple Relational Databases like Oracle, SQL Server, MS Access and Worked on integrating data from flat files.
- Good experience in Agile Engineering practices, Scrum methodologies, and Test-Driven Development and Waterfall methodologies.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, Map Reduce, Pig, Hive, Hbase, Sqoop, YARN, Flume, OozieKafka, Apache Solr, Apache Nifi, Proteus (Open Source ETL tool for Big Data)
ETL Tools: Abinitio, Informatica Power Center.
Databases: Oracle, SQL Server, MySQL, NoSQL, RDS, Redshift, HBase and Teradata
Programming: Python, Scala, Java, SQL, Linux, shell scripting
AWS Tools: AWS EC2, S3, EMR, Redshift, Dynamo DB, IAM, AWS Data-Pipeline, AWS Kinesis.
IDE’s: Eclipse, Jupyter Notebook, IntelliJ
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer (HadoopDeveloper)
Responsibilities:
- Working on developing a data pipeline that helps to migrate from an existing Abinitio ETL process to an ETL process using NiFi and Spark.
- Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Dataframe/DataSet, Spark SQL API’s.
- Currently responsible to understand the various factors that affect the Spark jobs and responsible for performance tuning.
- Part of the initial data pipeline architecture design that defines the data moment from source to Hadoop.
- Worked with the data modeling team in creating Hive tables that better enhance the performance of various queries run on these Hive tables.
- Worked on creating custom NiFi flows for batch processing.
- The data pipeline includes Apache Spark, Apache NiFi and Hive.
- Excellent knowledge on standard FlowFile processors in NIFI that are used for data routing, transformation and Mediation between systems. e.g.: “GETFILE”, “PUTKAFKA”, “GETFILE”, “PUTFILE”, “PUTHDFS” and etc.
- Strong hands on Experience in publishing the messages to various KAFKA topics using Apache NIFI and consuming the message to HBase and MySql tables using Spark and Scala.
- Assisted in developing, testing and enhancing Hive and Spark scripts for building the ETL (extract, transform and load) pipeline for this program.
- Involved in the establishing a connection between SPARK and MySQL using JDBC connectors and establishing a mapping between HDFS files and MySQL tables
- Developed SQOOP scripts to migrate data from Teradata, Oracle to Bigdata Environment.
- Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
- Created Hive DDL on Parquet and Avro data files residing in both HDFS.
- Involved in file movements between HDFS and AWS S3 using NIFI
- Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy.
- Developed shell scripts for dynamic partitions adding to hive stage table, verifying Json schema change of source files, and verifying duplicate files in source location.
- Converted Hive queries into Spark transformations using Spark RDDs.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark on YARN.
- Wrote ETL scripts using Hive and processed the data as per business logic.
- Developed Preprocessing job using Spark Data frames to flatten Json files to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Imported the data from different sources like AWS S3, Local file system into Spark RDD.
Environment: Hadoop, AWS, HDP, Elastic Map Reduce, Hive, Spark, Kafka, Python,S3,SQL Workbench, ETL Jobs, Abinitio, IBM-Maestro, Autosys, NIFI, Teradata, Oracle.
Confidential
Hadoop Developer
Responsibilities:
- Implemented solutions utilizing Advanced Big Data/Hadoop Distribution Frameworks: MapReduce, HBase, Zookeeper, Yarn, Hive, Spark, Pig, Flume, Sqoop, Kafka on Hortonworks Environment.
- Performed Data Ingestion, Batch Processing, Data Extraction, Transformation, Loading and Real Time Streaming using Hadoop Frameworks.
- Optimized Audit Control Framework (File count, Record Count, Claim Count) on Incoming and Outgoing files from and to Data Lake, and automated the process to get daily mails.
- Worked with Spark, improving the performance and optimization of the existing applications in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Expert in importing and exporting terabytes of data into HDFS and Hive using Sqoop, Websphere, Kafka, Flume from other Traditional Relational Database Systems, Online Feeds, Web applications etc.,
- Import the data from different sources (HDFS/HBase) into Spark RDD.
- Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
- Implemented Data Ingestion in real time processing using Kafka.
- Expertise in integrating Kafka with Spark streaming for high speed data processing.
- Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
- Configured Spark Streaming to receive real time data and store the stream data to HDFS.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Documented the requirements including the available code which should be implemented using Spark, Hive and HDFS.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Worked on loading CSV/TXT/AVRO/PARQUET files using Scala language in Spark Framework and process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table using ORC Reader.
- Good knowledge in setting up batch intervals, split intervals and window intervals in Spark Streaming using Scala Programming language.
- Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Worked on PySpark SQL where the task is to fetch the NOTNULL data from two different tables and loads.
- Worked closely with the Technical and Data Architects in resolving complex Data Movement and suggested workable Models.
Environment: Hadoop, Hive, Yarn, Spark, Scala, Python, Java, Eclipse, UNIX, Linux, Cassandra, MySQL,Abinitio, IBM-Maestro, Autosys, NIFI, Teradata, Oracle.
Confidential
Hadoop Developer
Responsibilities:
- Replaced default metadata storage system for Hive with MySQL system.
- Executed queries using Hive and developed Map-Reduce jobs to analyze data.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Testing load test by adding more throughputs and comparing the velocity of the application.
- Developed Hive queries for the analysts.
- Analytical and problem solving skills, applied to Big Data domain
- Involved to implement Scala code in Spark/RDD
- Implemented and exposed SparkSQL/HiveSQL
- Utilized Apache HadoopDistribution environment by Hortonworks.
- Involved in loading data from LINUX and UNIX file system to HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with HIVE.
Environment: Apache Hadoop, MapReduce, Pig, Hive, Sqoop, Flume, Kafka, HBASE, Spark, Scala.
Confidential
ETL Developer
Responsibilities:
- Worked as a junior developer on the complex ETL development projects with multiple team members.
- Worked as lead for the creation of all design review artifacts during project design phase, facilitate design reviews, capture review feedback and schedule additional detailed design sessions as necessary.
- Involved in the requirements definition and analysis in support of the extensive data analysis.
- Designed and developed Logical and Physical Models using Erwin.
- Developed demo graphic reports on daily production loads to forecast the data behavior.
- Process tuned existing AB Initio processes for efficiency using parallel components, lookup files and MFS concepts.
- Developed process for efficiently testing changes to the current production systems.
- Developed UNIX shell scripts and extensively involved to automate the entire load process through Autosys
- Created different types profiling reports to perform functional dependency analysis.
- Involved in discussions with BA's and Business users in building business rules for the ETL applications.
- Created Batch processes using Fast Load, FastExport, Mload, Unix Shell and Teradata SQL to transfer cleanup and summarize data
- Teradata Utilities to load data into Incremental/Staging tables and then move data from staging into Base tables
- Integrated lot of common environment APIs into the ETL applications to increase reusability of the code and reduce development resource hours.
- Developed extensively easy to understand tech-spec to make it easy for coding by the offshore team.
- Worked on creating Application Support Guide, Implementation Plans, Trace Matrix and Application SLAs.
- Designed lot of Unix API's to generate dynamic XFR's and dynamic DML's.
- Written various shell wrappers to in corporate AB Initio graphs and shell script calls.
- Designed and developed Tivoli Work Scheduler scripts as database outage friendly by using global resources.
- Implemented lot of defensive programming techniques to avoid customer impact issues by detecting them ahead of time in the process.
- Worked with high volume of data files on Windows and NT platforms.
- Good exposure with Mainframe EBCDIC format files to handle with AB Initio.
- Team member for Design, Coding and Implementation
- Written various shell wrappers to in corporate AB Initio graphs and shell script calls
- Design, Development, and Enhancement of new functionalities
- Defined production support methodologies and strategies.
- Logging in the different issues related to the development phase.
Environment: Abinitio, Teradata, UNIX, IBM-Maestro, Ctrl-M scheduler, Oracle 10g, Salesforce.com, SAS Marketing Automation, DB2.