Sr. Data Engineer Resume
Columbus, IndianA
SUMMARY
- Around 8+ years of experience involving project development, implementation, deployment, and maintenance using Big data Hadoop ecosystem, PySpark and Cloud - related technologies in various sectors with programming language expertise like Scala, Java, and Python.
- In-depth knowledge of architecting distributed systems and parallel computing.
- Understanding of Hadoop Architecture and its components like HDFS (1.0, 2.0, and 3.0), Job tracker, task Tracker, Name Node, Data Node, and Map-Reduce.
- Knowledge of NoSQL databases such as HBase and Cassandra.
- Extensive experience in importing and exporting data using Sqoop from Relational Database Management Systems to HDFS and vice versa.
- Hands-On experience with Sqoop incremental import (append and LastModified) on structured and semi-structured data along with performance tuning.
- Experience with data partitions, and bucketing concepts in Hive and designed both managed and external tables in Hive to optimize performance.
- Knowledge and experience with Linux commands and file systems, UNIX scripting with parameterization.
- Experience with different file formats like Avro, Parquet, ORC, JSON, and XML.
- Developed data applications using AWS Services like EC2, S3, Redshift, Elastic Load Balancer, Kinesis, and other cloud services.
- Knowledge of monitoring tools like AWS CloudWatch, CloudTrail, CloudFormation, and queuing services like SQS and SNS for notifications.
- Experience working with serverless architecture using AWS Lambda, Glue, Data Pipeline, and Step Functions
- Expertise in writing Spark RDD Transformations, actions, data frames, and case classes, for the required input data and performing data transformations using Spark Context to convert RDD to Data Frames.
- Strong experience in working with various configurations of Spark like broadcast thresholds, increasing shuffle partitions, caching, repartitioning, etc to improve the performance of the jobs.
- Having a good understanding of Data Warehousing concepts and also using Dimensional Modelling, Fact Tables, Dimension Tables, Data Lakes, and schemas (Star and Snowflake).
- Experience with the application of Scrum, Waterfall, and Agile methodologies to develop processes that facilitate continual progress and team achievement.
- Experience with CI/CD process using Git (version control), Jenkins, and other repository managers.
- Experience in writing complex SQL queries, data aggregations, and performance tuning.
- Experience in implementing end-to-end data pipelines for serving reporting and data science capabilities
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Columbus, Indiana
Responsibilities:
- Worked as a Sr. Data Engineer with Big Data and Hadoop ecosystem components.
- Involved in converting Hive/SQL queries into Spark transformations using Scala.
- Created Spark data frames using Spark SQL and prepared data for data analytics by storing it in AWS S3.
- Load and transform large data sets of structured and semi-structured data using PySpark.
- Responsible for loading data from Kafka into HBase using REST API.
- Developed the batch scripts to fetch the data from AWS S3 storage and perform required transformations in Python using the Spark framework.
- Performed administration of cloud environments using AWS IAM.
- Created scripts to spin up transient EC2 instances to run jobs on the cloud and delete them once the job is finished.
- Helped in migrating the data from the on-premise data lake to AWS S3 using DataSync and used Athena to run queries for analytics.
- Used Spark streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in near real-time and persists it to the HBase.
- Created Sqoop scripts to import and export customer profile data from RDBMS to S3 buckets.
- Developed various enrichment applications in Spark using Scala for cleansing and enrichment of click stream data with customer profile lookups.
- Troubleshooting Spark applications for improved error tolerance and reliability.
- Used Spark Dataframe and Spark API to implement batch processing of Jobs.
- Used Apache Kafka and Spark Streaming to get the data from adobe live stream rest API connections.
- Automated creation and termination of AWS EMR clusters.
- Prepared clickstream data for analytics by cleaning, normalizing, and enriching data sets using AWS Glue.
- Leveraged Redshift Spectrum's ability to query data directly into our Amazon S3 data lake, and integrated it with new data sources to control infrastructure costs.
- Worked on fine-tuning and performance enhancements of various spark applications and hive scripts.
- Used various concepts in spark like broadcast variables, caching, and dynamic allocation to design more scalable spark applications.
- Identify source systems, their connectivity, related tables, and fields and ensured data suitability for mapping, preparing unit test cases, and providing support to the testing team to fix defects.
- Defined HBase tables to store various data formats of incoming data from different portfolios.
- Developed the verification and control process for daily data loading.
- Involved in daily production support to monitor and troubleshoots Hive and Spark jobs.
Environment: AWS EMR, S3, Spark, Hive, Sqoop, Scala, MySQL, Oracle DB, Athena, Redshift
Data Engineer
Confidential, Nashville, TN
Responsibilities:
- Extensively worked in Sqoop to migrate data from RDBMS to HDFS.
- Ingested data from various source systems like Teradata, MySQL, and Oracle databases.
- Developed Spark application to perform Extract Transform and load using Spark RDD and Data frames.
- Created and managed data pipeline of capturing near real-time data ingestion using AWS Kinesis and storing the data into an S3 data lake with partitions after performing transformations.
- Created Hive external tables on top of data from HDFS and wrote ad-hoc hive queries to analyze the data based on business requirements.
- Utilized Partitioning and Bucketing in Hive to improve hive query processing times.
- Performed incremental data ingestion using Sqoop as the existing application is generating data on daily basis.
- Migrated/implemented Map Reduce jobs to Spark applications for better performance.
- Handled data in different file formats like Avro and Parquet.
- Extensively used Cloudera Hadoop distributions within the project.
- Used GIT for maintaining/versioning the code.
- Created Oozie workflows to automate the data pipelines
- Involve in fully automated CI/CD pipeline process through GitHub, Jenkins.
- Used Cloudera Manager for installation and management of Hadoop Cluster.
- Exported data from the HDFS environment into RDBMS using Sqoop for report generation and visualization purposes.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Invoked in creating Hive tables, loading with data, and writing Hive queries, which will invoke MapReduce jobs in the backend. Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, and Effective & efficient Joins.
- Worked in designing and deployment of Hadoop cluster and different big data analytic tools, including Pig, Hive, Oozie, Zookeeper, Sqoop, Flume, Impala, and Cassandra with Horton work distribution.
- Utilized the Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.
- Written Hive SQL queries for Ad-hoc data analysis to meet business requirements.
- Delivered Unit test plans Involved in Unit testing and documenting.
Environment: Cloudera (CDH 5. x), Spark, Scala, Sqoop, Oozie, Hive, HDFS, MySQL, Oracle DB, Teradata
Big Data Developer
Confidential, New York, NY
Responsibilities:
- Evaluated business requirements and prepared Detailed Design documents that follow Project guidelines and SLAs required for procuring data from all the upstream data sources and developing written programs.
- Wrote Spark SQL queries and Python scripts to design the solutions and implemented them using PySpark.
- Importing data from different sources like HDFS/HBase into Spark RDD.
- Worked on Spark context Spark SQL data frames, RDD and Spark Yarn.
- Worked on analyzing the Hadoop cluster and different Big Data analytic tools, including Pig, Hive, HBase database, and SQOOP.
- Developed Spark jobs to create data frames from the source system, process, and analyze the data in data frames based on business requirements.
- Performance tuning the Spark Jobs using broadcast variables, persisting, caching, data serialization, parallelization, and Memory management techniques.
- Developed serverless data pipelines using S3, lambda, glue, and DynamoDB
- Coordinated with business customers to gather business requirements. And interact with other technical peers to derive Technical requirements and delivered the BRD and TDD documents.
- Extensively involved in the Design phase and delivered Design documents.
- Experienced in writing Hadoop Jobs for analyzing data using HiveQL (Queries), and MapReduce programs in Java.
- Exporting and Importing structured data into HDFS and Hive using SQOOP.
- Used Spark SQL to load Jason data and create schema RDD and loaded it into Hive tables.
- Developed Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying of the log data.
- Developed and designed a system to collect data from multiple portals using Kafka and then process it using Spark.
- Involved in creating Hive tables, then applied HiveQL on those tables, this will invoke and run MapReduce jobs automatically.
- Used Kafka streaming for building real-time data pipelines between clusters.
- Utilized the Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.
- Written Hive SQL queries for Ad-hoc data analysis to meet business requirements.
- Delivered Unit test plans Involved in Unit testing and documenting.
- Processing bulk data in batch on a daily basis and landing into relational tables.
- Worked on the Oozie workflow engine for job scheduling.
Environment: Spark 2.0 +, Hadoop, HDFS, Hive, Kafka, Sqoop, Scala 2.11, Cassandra, Oozie, Cloudera, Flume, Netezza, Linux, Control-M, Oracle, DB2, ETL, AWS, Redshift
AWS Big Data Developer
Confidential
Responsibilities:
- Worked on multiple AWS accounts with different VPCs for Prod and Non-Prod where key objectives included automation, build-out, integration, and cost control.
- Amazon EC2 Cloud Instances using Amazon Web Services (Linux/Centos/Ubuntu/RHEL) and Configuring launched instances with respect to specific applications.
- Assisted in migrating the existing data center into the AWS environment.
- Worked on AWS including EC2, Auto-Scaling in launching EC2 instances, Elastic Load Balancer, Elastic Beanstalk, S3, Glacier, Cloud Front, RDS, VPC, Cloud Watch, Cloud Formation, EMR, IAM, and SNS.
- Created S3 buckets and managing policies for S3 buckets and Utilized S3 bucket and Glacier for Archival storage and backup on AWS.
- Implemented Data warehouse solutions in Confidential Redshift; Worked on various projects to migrate data from on-premise databases to Confidential Redshift, RDS, and S3.
- Generated the schema for semi-structured data using AWS Glue and created ETL code to transform, flatten, and enrich data, loading the transformed data to a warehouse on a recurring basis.
- Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test deploys.
- Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.
- Designed and implemented by configuring Topics in new Kafka cluster.
- Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions, etc.
- Used security groups, network ACLs, Internet Gateways, NAT instances, and Route tables to ensure a secure zone for organizations in AWS public cloud.
- Responsible for build and deployment automation using VM WareESX, Docker, Kubernetes containers, and Ansible.
- Used Ansible and Ansible Tower as a Configuration Management tool, to automate repetitive tasks, quickly deploy critical applications, and proactively manage change.
- Used Ansible Playbooks to set up Continuous Delivery Pipeline. Deployed microservices, including provisioning AWS environments using Ansible Playbooks.
- Created scripts in Python which integrated with Amazon API to control instance operations.
- Worked on Administration and Architecture of Cloud platforms.
- Maintained the monitoring and alerting of production and corporate servers using the Cloud Watch service.
- Migrated applications from the internal data center to AWS.
- Deployed and configured Git repositories with branching, tagging, and notifications. Experienced and proficient in deploying and administering GitHub.
Environment: VPC, EC2, Elastic Load Balancer, Elastic Beanstalk, S3, Glacier, CloudFront, Redshift, Kafka, Docker, CloudWatch, Python
Data Analyst
Confidential
Responsibilities:
- Successfully Completed Junior Data Analyst Internship in Confidential.
- Built an Expense Tracker and Zonal Desk.
- Identifying inconsistencies, correcting them or escalating the problems to next level.
- Assisted in development of interface testing and implementation plans.
- Analyzing data for data quality and validation issues.
- Analyzing the websites regularly to ensure site traffic and conversion funnels are performing well.
- Collaborating with Sales and marketing teams to optimize processes that communicate insights effectively.
- Creating and maintaining automated reports using SQL.
- Understood all the Hadoop architecture and drove all the meetings
- Conducted safety check to make sure that my team is feeling safe for the retrospectives
- Aided in data profiling by examining the source data
- Extracting features from the given data set and use them to train and evaluate different classifiers that are available in the WEKA tool. Using these features, we differentiate spam messages from legitimate messages.
- Created numerous SQL queries to modify data based on data requirements and added enhancements to existing procedures.
- Implemented statistical modelling techniques in Python.
- Conducted safety check to make sure that my team is feeling safe for the retrospectives
- Aided in data profiling by examining the source data
- Performed data mappings to map the source data to the destination data
- Developed Use Case Diagrams to identify the users involved. Created Activity diagrams and Sequence diagrams to depict the process flows.
Environment: Python, Matlab, Oracle, HTML5, Tableau, MS Excel, Server Services, Informatica Power CenterSQL, Microsoft Test Manager, Adobe Connect, MS Office Suite, LDAP, Hive, Spark, Pig, Oozie.