Data Engineer Resume

SUMMARY

9+ years of experience in IT experience in software design, development, implementation and support of business applications for Telecom, health and Insurance industries
Experience in Big dataHadoop,Hadoop Ecosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Airflow, Oozie, and Zookeeper
Worked extensively on installing and configuring Hadoop ecosystem components Hive, SQOOP, HBase, Zookeeper and Flume
Good Knowledge in writing Spark Applictions in Python(Pyspark)
Working with the data extraction, transformation and load usingHive, Sqoop and HBase
Hands on Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
Implemented ETL operations using Big Data platform
Hands on experience on Streaming data ingestion and Processing
Experienced in designing different time driven and data driven automated workflows using Airflow.
MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in spark streaming.
Highly Acumen in choosing an efficient ecosystem inHadoopand providing the best solutions to Big Data problems.
Well versed with Design and Architecture principles to implement Big Data Systems.
Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency
Acumen on Data Migration from Relational Database to Hadoop Platform using SQOOP.
Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
Good understanding of MPP databases such as HP Vertica and Impala.
Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS
Expertise in relational databases like Oracle, My SQL and SQL Server.
Strong analytical and problem solving skills, highly motivated, good team player with very Good communication & interpersonal skills

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Hive, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark.

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, Amazon EMR

Languages: Python, SQL, Scala

No SQL Databases: HBase

ETL Tools: Informatica

Operating systems: UNIX, LINUX, Mac os and Windows Variants

Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT

App/Web servers: JBoss and Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle

RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig,Hive, HBase, Oozie, Sqoop, Flume, Spark, Impala.
Ingestedthe data from Relational Databases to HDFS using SQOOP
Implemented advanced procedures like text analytics and processing using the in - memory computing capabilities like Apache Spark written in python
Implemented Spark using python and Spark SQL for faster testing and processing of data.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
Worked with Spark to create structured data from the pool of unstructured data received.
Implemented intermediate functionalities like events or records count from the flume sinks orKafka topics by writing Spark programs in java and python.
Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS.
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Experienced in transferring Streaming data, data from different data sources into HDFS, No SQL databases
Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
Developed multiple Kafka Producers and Consumers from scratch to as per the softwarerequirement specifications.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using python.
Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language python.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Real time streaming the data using Spark with Kafka.
Designed and developed data loading strategies, transformation for business to analyze the datasets
Processed flat files in various formats and stored them as in various partition models in HDFS
Responsible for building, develop testing shared components that will be used across modules
Responsible in implementing advanced procedures like tet analytics and processing using the in-memory computing capabilities like spark.
Experience in extracting appropriate features from data sets in order to handle bad, null, partial records using spark SQL.
Collected data using spark streaming in near-real-time and performs necessary transformations and aggregations to build the data model persists the data in HDFS
Expertized in implementing spark using scala and spark SQL for faster testing and processing of data responsible to manage data from different sources
Processed Multiple Data sources input to same Reducer using Generic Writable and MultiInput format.
Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.

Environment: Hadoop, Hive, Flume, Map Reduce, Sqoop, Kafka, Spark, Yarn, Cassandra, Oozie, shell Scripting, Scala, Maven, MySQL

Confidential

Hadoop Developer

Responsibilities:

Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming
Converting the existing relational database model toHadoopecosystem.
Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
Strong experience in working with ELASTIC MAPREDUCE and setting up environments on Amazon AWS EC2 instances.
Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
Managed and reviewed Hadoop and HBase log files.
Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive.
Designed and implementedHIVE queries and functions for evaluation, filtering, loading and storing of data.
Analyze table data and implement compression techniques like Teradata Multivalued compression
Involved in ETL process from design, development, testing and migration to production environments.
Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
Provide guidance to development team working on PySpark as ETL platform.
Writing Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs
Generating analytics reporting on probe data by writing EMR (elastic map reduce) jobs to run on Amazon VPC cluster and using Amazon data pipelines for automation.
Have good understanding of Teradata MPP architecture such as Partitioning, Primary Indexes,
Good knowledge in Teradata Unity, Teradata Data Mover, OS PDE Kernel internals, Backup and Recovery
Created HBase tables to store variable data formats of data coming from different portfolios.
Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop
Creating Hive tables and working on them using HiveQL.
Creating and truncating HBase tables in hue and taking backup of submitter ID
Developed data pipeline using Kafka to store data into HDFS.
Used Spark API overHadoopYARN as execution engine for data analytics using Hive.
Continuous monitoring and managing theHadoop cluster through Cloudera Manager.
Involved in review of functional and non-functional requirements.
DevelopedETLProcess usingHIVE and HBASE.
Worked as an ETL Architect/ETL Technical Lead and provided the ETL framework Solution for the Delta process, Hierarchy Build and XML generation.
Prepared the Technical Specification document for the ETL job development.
Responsible to manage data coming from different sources.
Loaded the CDRs from relational DB using Sqoop and other sources toHadoop cluster by using Flume.
Installed and configured Apache Hadoop, Hive and Pig environment.

Environment: Hadoop, HDFS, pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Kafka.

Confidential

Data Analyst

Responsibilities:

Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Data Warehouse to Atlas Data Lake (Big Data)
Performed all the Technical Data quality (TDQ) validations which include Header/Footer validation, Record count, Data Lineage, Data Profiling, Check sum, Empty file, Duplicates, Delimiter, Threshold, DC validations for all Data sources.
Analyzed huge volumes of data Devised simple and complex HIVE, SQL scripts to validate Dataflow in various applications. Performed Cognos report validation. Made use of MHUB for validating Data Profiling & Data Lineage.
Devised PL/SQLstatements - Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Created reports using Tableau/Power BI/Cognas to perform data validation.
Set up a governance process around Tableau dashboard processes
Worked with senior management to plan, define and clarify tableau dashboard goals, objectives and requirement.
Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
Responsible for daily communications to management and internal organizations regarding status of all assigned projects and tasks.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship