Data Engineer Resume

SUMMARY

Overall, 11+ years of extensive IT industry experience including 4+ years of strong experience in Design and Development of Big Data/Apache Hadoop ecosystem components including Data Ingestion, Data modeling, Querying, Processing, Storage Analysis, Data Integration and Implementing enterprise level systems transforming Big data on Cloud using Apache Spark framework.
Experience in Design, Development, and Testing of large scale distributed data processing solutions using Big Data/Hadoop ecosystem tools Spark, Hadoop, YARN, Sqoop, s3, HDFS, Hive, Impala, Spark SQL, Pig, HBase, Oozie, Kafka, ORC, and Parquet format files on Cloud.
Experienced in Application Development using Spark, Map reduce, Hive, Spark SQL and Linux shell scripting.
Experience in various Big Data application phases like Data ingestion, Data transformations and Data visualization.
Strong experience in Spark SQL UDFs, Spark SQL and Performance Tuning.
Experience in efficient processing of large datasets using Spark features such as Partitions, In Memory capabilities, Broadcasts variables, Joins, RDD Transformations, Actions, Case classes.
In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming.
Experience working with relational databases such as MySQL, Oracle, MS SQL Server and NoSQL databases such as HBase.
Hands on experience in working with file formats like Orc, Avro, Sequence, Text, and Parquet.
Strong AWS experience setting up EC2 Instances, RDS with Oracle, MySQL, PostgreSQL, Aurora, Redshift, etc. and has experience working with S3, EBS, RDS, Lambda functions and VPC.
Good knowledge in troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and network.
Developed few reports using Tableau 9 for data visualization, Reporting and Analysis.
Ability to meet deadlines and handle multiple tasks, decisive with strong leadership qualities, flexible in work schedules and possess good communication skills.
Extensively involved through the Software Development Life Cycle (SDLC) from initial planning through implementation of the projects by using Agile and waterfall methodologies.
Skilled in using Version Control Software Such as GIT, GitHub and Jira for Project tracking.
Extensively used ETL methodology for supporting data extractions and transformations and loading processes in corporate wide ETL solutions using IBM Infosphere DataStage 8.5, 9.1, 11.5.
Good team player with ability to solve problems, organize and prioritize multiple tasks.
Good interpersonal skills, committed, result oriented, hard working with a quest and zeal to learn new technologies.
Flexible and versatile to adapt to any new environment and work on any project.

TECHNICAL SKILLS

Big data Technologies: Spark/Scala, HDFS, MapReduce, Tez, Pig, Sqoop, Hive, Impala, Spark Streaming, Lambda, Kafka and Oozie

File Formats: Orc, Avro, Sequence, Text, Json and Parquet

ETL/BI Tools: Infosphere Datastage, AWS Glue, Tableau

Scripting Languages: Shell Scripting

Distributions: AWS EMR, Cloudera CDHs

Database Technologies: Oracle, MS SQL Server, MySQL, HBase

Software IDE Tools: Eclipse and IntelliJ

Integration Tools: Dell Boomi

Version Control Systems: Git and GitHub.

Automation tools: Jenkins

Scheduling tools: Control-M, Airflow

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Environment: AWS EMR, Spark/Scala, Oracle, SQL Server, S3, Hive, SparkSQLAWS Athena, Linux, Control-m

Responsibilities:

AWS EMR is used to provision the fully managed Hadoop distributed framework.
Implement the process to migrate the data from multiple source systems to AWS S3.
Built Apache Sqoop jobs for data ingestion from different relational databases (SQL Server and Oracle) to load into S3 landing/Raw area.
Write Shell and python scripts to extract data in Json format from several web portals using REST API into S3 bucket.
Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
Write Spark/Scala programs to validate the transactional data on Parquet files.
Created Hive tables on S3 files to implement business logic using Spark Data frames.
Develop and enhance Spark jobs to cleanse and transform the data on Hive tables and S3 files.
Worked on the design, build and system testing for supporting complex pipelines by aggregating data leveraging Spark/Scala.
Used Broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
Used the spark-library to push the data into RDBMS tables from S3.
Build Athena external tables to query the data from AWS S3.
Coordination and completion of various Business Intelligence related functions including: data preparation (sourcing, acquisition and integration), data warehousing, and data exploration and information delivery.
Work with internal clients and ensure that operational systems and IT functions are maintained at peak efficiency with no disruptions or unscheduled downtime.
Extensively used GitHub & Git for code repository, code review and as version control tool and used JIRA for project tracking.
Build Control-M jobs to schedule the spark jobs and coordinate with other platforms to schedule series of jobs in pipeline, including cluster creation, spark job execution, S3 data operation, etc.
Enhance and support legacy Data warehouse Data Stage jobs.
Review the code with technical team and attend the code review sessions.
Testing of all the components in development and test environment and update test results document.
Responsible for Production Support and involved in On-Call.

Confidential

Big Data Developer

Environment: Cloudera CDH, Spark/Scala, Hive, Pig, RDBMS, Sqoop, SparkSQL, KafkaHDFS, Oozie, HBase.

Responsibilities:

Part of Design and implementation team responsible for interactions with users to fully understand the requirements and implement the same.
Involved in requirement and design phase to implement Streaming Architecture to use real time streaming using Spark and Kafka.
Used Spark Core/SQL API for reading text file formats from HDFS and convert into DataFrames by injecting schema for better performance.
Analyzed data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin)
Created Pig scripts for Sorting, Joining, Grouping the data.
Worked on Sequence file, ORC files and Mapside joins, Bucketing, Partitioning for improving hive performance and storage improvement.
Loaded processed data into Hive tables, HDFS, finally into RDBMS (using Sqoop) for Storage and BI requirements.
Developed watcher jobs using shell scripting for notifying process owners if there is any delay or miss in files generation as per agreed schedule.
Experienced in Oozie operational service to create workflows and automate the Hive, Pig jobs.
Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on-time and with good quality in coordination with onsite and offshore teams
Review the code with technical team and attend the code review sessions.
Used JIRA for incident creation, bug tracking and Bitbucket to check-in and checkout code changes.
Involved in Post-implementation support phases of the project.
Involved in Unit testing and unit test cases documentation.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship