Data Engineer Resume
SUMMARY
- Overall, 11+ years of extensive IT industry experience including 4+ years of strong experience in Design and Development of Big Data/Apache Hadoop ecosystem components including Data Ingestion, Data modeling, Querying, Processing, Storage Analysis, Data Integration and Implementing enterprise level systems transforming Big data on Cloud using Apache Spark framework.
- Experience in Design, Development, and Testing of large scale distributed data processing solutions using Big Data/Hadoop ecosystem tools Spark, Hadoop, YARN, Sqoop, s3, HDFS, Hive, Impala, Spark SQL, Pig, HBase, Oozie, Kafka, ORC, and Parquet format files on Cloud.
- Experienced in Application Development using Spark, Map reduce, Hive, Spark SQL and Linux shell scripting.
- Experience in various Big Data application phases like Data ingestion, Data transformations and Data visualization.
- Strong experience in Spark SQL UDFs, Spark SQL and Performance Tuning.
- Experience in efficient processing of large datasets using Spark features such as Partitions, In Memory capabilities, Broadcasts variables, Joins, RDD Transformations, Actions, Case classes.
- In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming.
- Experience working with relational databases such as MySQL, Oracle, MS SQL Server and NoSQL databases such as HBase.
- Hands on experience in working with file formats like Orc, Avro, Sequence, Text, and Parquet.
- Strong AWS experience setting up EC2 Instances, RDS with Oracle, MySQL, PostgreSQL, Aurora, Redshift, etc. and has experience working with S3, EBS, RDS, Lambda functions and VPC.
- Good knowledge in troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and network.
- Developed few reports using Tableau 9 for data visualization, Reporting and Analysis.
- Ability to meet deadlines and handle multiple tasks, decisive with strong leadership qualities, flexible in work schedules and possess good communication skills.
- Extensively involved through the Software Development Life Cycle (SDLC) from initial planning through implementation of the projects by using Agile and waterfall methodologies.
- Skilled in using Version Control Software Such as GIT, GitHub and Jira for Project tracking.
- Extensively used ETL methodology for supporting data extractions and transformations and loading processes in corporate wide ETL solutions using IBM Infosphere DataStage 8.5, 9.1, 11.5.
- Good team player with ability to solve problems, organize and prioritize multiple tasks.
- Good interpersonal skills, committed, result oriented, hard working with a quest and zeal to learn new technologies.
- Flexible and versatile to adapt to any new environment and work on any project.
TECHNICAL SKILLS
Big data Technologies: Spark/Scala, HDFS, MapReduce, Tez, Pig, Sqoop, Hive, Impala, Spark Streaming, Lambda, Kafka and Oozie
File Formats: Orc, Avro, Sequence, Text, Json and Parquet
ETL/BI Tools: Infosphere Datastage, AWS Glue, Tableau
Scripting Languages: Shell Scripting
Distributions: AWS EMR, Cloudera CDHs
Database Technologies: Oracle, MS SQL Server, MySQL, HBase
Software IDE Tools: Eclipse and IntelliJ
Integration Tools: Dell Boomi
Version Control Systems: Git and GitHub.
Automation tools: Jenkins
Scheduling tools: Control-M, Airflow
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Environment: AWS EMR, Spark/Scala, Oracle, SQL Server, S3, Hive, SparkSQLAWS Athena, Linux, Control-m
Responsibilities:
- AWS EMR is used to provision the fully managed Hadoop distributed framework.
- Implement the process to migrate the data from multiple source systems to AWS S3.
- Built Apache Sqoop jobs for data ingestion from different relational databases (SQL Server and Oracle) to load into S3 landing/Raw area.
- Write Shell and python scripts to extract data in Json format from several web portals using REST API into S3 bucket.
- Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
- Write Spark/Scala programs to validate the transactional data on Parquet files.
- Created Hive tables on S3 files to implement business logic using Spark Data frames.
- Develop and enhance Spark jobs to cleanse and transform the data on Hive tables and S3 files.
- Worked on the design, build and system testing for supporting complex pipelines by aggregating data leveraging Spark/Scala.
- Used Broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
- Used the spark-library to push the data into RDBMS tables from S3.
- Build Athena external tables to query the data from AWS S3.
- Coordination and completion of various Business Intelligence related functions including: data preparation (sourcing, acquisition and integration), data warehousing, and data exploration and information delivery.
- Work with internal clients and ensure that operational systems and IT functions are maintained at peak efficiency with no disruptions or unscheduled downtime.
- Extensively used GitHub & Git for code repository, code review and as version control tool and used JIRA for project tracking.
- Build Control-M jobs to schedule the spark jobs and coordinate with other platforms to schedule series of jobs in pipeline, including cluster creation, spark job execution, S3 data operation, etc.
- Enhance and support legacy Data warehouse Data Stage jobs.
- Review the code with technical team and attend the code review sessions.
- Testing of all the components in development and test environment and update test results document.
- Responsible for Production Support and involved in On-Call.
Confidential
Big Data Developer
Environment: Cloudera CDH, Spark/Scala, Hive, Pig, RDBMS, Sqoop, SparkSQL, KafkaHDFS, Oozie, HBase.
Responsibilities:
- Part of Design and implementation team responsible for interactions with users to fully understand the requirements and implement the same.
- Involved in requirement and design phase to implement Streaming Architecture to use real time streaming using Spark and Kafka.
- Used Spark Core/SQL API for reading text file formats from HDFS and convert into DataFrames by injecting schema for better performance.
- Analyzed data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin)
- Created Pig scripts for Sorting, Joining, Grouping the data.
- Worked on Sequence file, ORC files and Mapside joins, Bucketing, Partitioning for improving hive performance and storage improvement.
- Loaded processed data into Hive tables, HDFS, finally into RDBMS (using Sqoop) for Storage and BI requirements.
- Developed watcher jobs using shell scripting for notifying process owners if there is any delay or miss in files generation as per agreed schedule.
- Experienced in Oozie operational service to create workflows and automate the Hive, Pig jobs.
- Effectively followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on-time and with good quality in coordination with onsite and offshore teams
- Review the code with technical team and attend the code review sessions.
- Used JIRA for incident creation, bug tracking and Bitbucket to check-in and checkout code changes.
- Involved in Post-implementation support phases of the project.
- Involved in Unit testing and unit test cases documentation.