We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

5.00/5 (Submit Your Rating)

Melville, NY

SUMMARY

  • A Dynamic and motivated IT professional with over 8 years of experience as a Big Data Engineer with expertise in designing data intensive applications usingHadoopEcosystem,Big Data Analytics,Cloud Data engineering,Data Science, Business Analytics, Data Warehouse, Data Visualization, Reporting and Data Quality solutions.
  • Solid Big Data Analytics work experience, including installing, configuring, and using components like Hadoop Map reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka, and Spark.
  • Experience in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, data mining, and advanced data processing.
  • Excellent knowledge of Hadoop Architecture and various components such as HDFS, MapReduce, Hadoop GEN2 Federation, High Availability, YARN architecture, workload management, schedulers, scalability, and distributed platform architectures.
  • Experience with the various tools & frameworks that enable capabilities within the Hadoop Eco system technologies such as Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Sqoop, Oozie, Kafka, Spark, and Zookeeper.
  • Expertise in Developing Spark application using PySpark and Spark Streaming APIs in Python, deploying in yarn cluster in client, cluster mode.
  • Thorough understanding of Hive's Partitions and Bucketing concepts, as well as the design of Managed and External tables to enhance performance.
  • Experience in importing and exporting data from HDFS to Relational Database Systems and vice - versa using Sqoop.
  • Experience in using Cloudera Manager for installation and management of single-node and multi-node Hadoop clusters (CDH4&CDH5).
  • Hands on experience inVPN Putty and WinSCP, CI/CD(Jenkins).
  • Solid understanding and experience with large-scale data warehousing and E2E data integration solutions on Snowflake Cloud and AWS Redshift.
  • Well versed withBig Data on cloud services i.e., Snowflake
  • Cloud, EC2, Anthena, S3, Glue, DynamoDB and RedShift.
  • Experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Experienced in Snowflake utilities like Snow-SQL, Snow pipe, Connectors, Data Sharing, Cloning, and creating tasks.
  • Created Spark Applications for data extraction, transformation, and aggregation from a variety of file types, as well as data analysis and transformation to reveal customer usage trends.
  • Expertise in Spark Applications performance tuning, including determining the proper batch interval duration, parallelism level, and memory tuning.
  • Experience in Manage and review data backups, manage and review Hadoop log files.
  • Good knowledge in using job scheduling tools like Oozie and Airflow.
  • Experience in building Docker Images to run airflow on local environment to test the Ingestion as well as ETL pipelines.
  • Improved continuous integration workflow, project testing, and deployments with Jenkins.
  • Hands on experience in writing SQL programming, creating databases, tables and joins. Designing Oracle database, application development and strong knowledge in PL/SQL.
  • Experience in all the phases of Data warehouse life cycle involving Requirement Analysis, Design, Coding, Testing, and Deployment.
  • Good knowledge of NoSQL Database and knowledge of writing applications on HBase.
  • Experience on UNIX commands and Shell Scripting.
  • Extensive knowledge of utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, RRS, Kinesis, Redshift, Dynamo DB, Elastic Cache Glacier, SQS, SNS, RDS, Cloud Watch, Cloud Front.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Hive, MapReduce, Hadoop distribution, and HBase, Spark, Spark Streaming, Yarn, Zookeeper, Kafka, Pig, Sqoop, Flume, Oozie

Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS, Quicksight) Azure HDInsight (DataBricks, DataLake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Cosmos DB, Azure AD)

Programming Languages: Python, R, Scala, Spark SQL, SQL, HiveQL, PL/SQL, UNIX shell Scripting, Pig Latin

Cloud: GCP, AWS, Azure

SQL Databases: Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2

NoSQL Databases: HBase, Cassandra, Mongo DB, DynamoDb and CosmosDB

Devops Tools: Jenkins, Docker, Maven

PROFESSIONAL EXPERIENCE

Sr Data Engineer

Confidential, Melville, NY

Responsibilities:

  • Developed SQOOP scripts to load data from Oracle to Hive external tables
  • Installed Name node, Secondary name node, Yarn (Resource Manager, Node manager, Application master), Data node.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Used GIT, GitHub, and the SourceTree version control technology to collaborate with the team.
  • Bottle micro-framework implemented with REST API and MongoDB (NoSQL) as back-end database.
  • Created Spark scripts utilizing Scala shell commands in accordance with the requirements.
  • DevelopedSparkcode using ScalaandSpark -SQLfor faster processing and testing.
  • Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Using the Hibernate API, designed and implemented Hibernate persistence classes.
  • Used SQL Navigator to create Stored Procedures, Triggers, and Functions to accomplish Oracle database tasks.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Assisted with Hadoop-Java API administration and HDFS maintenance.
  • Configured real-time streaming pipeline from DB2 to HDFS using Apache Kafka.
  • Worked with Agile Scrum, which comprised iterative application development, weekly Sprints, and stand-up meetings.
  • Worked extensively onAWSComponents such asAirflow, Elastic Map Reduce (EMR), Athena,Snowflake.
  • Created on-demand tables on S3 files using Lambda functions and AWS Glue (EMR) using Python and PySpark.
  • Used AWS Glue for data transformation, validation and cleansing.
  • Processed data from different sources to AWS Redshift using EMR - Spark, Python programming.
  • Implemented one-time data migration of multistate level data from Oracle to Snowflake by using Python and SnowQL
  • Developed major regulatory and financial reports using advanced SQL queries in snowflake
  • Created External and Managed Hive tables and working on them using HiveQL.
  • Validated the Map reduce, Pig, Hive Scripts by pulling the data from the Hadoopand validating it with the data in the files and reports.
  • Provided a new Web Service and Client using Spring-WS to get the alternate contractor details.
  • Design and implement MapReduce jobs to support distributed processing using java, Hive and Apache Pig.
  • Implemented robust and scalable data pipelines using Python, SQL, parallel processing frameworks, and other GCP cloud solutions.
  • Responsible for designing the pipelines that parse data from various file formats, followed by processing and loading them to GCP
  • Created program inpythonto handle PL/SQL functions like cursors and loops which are not supported by snowflake.
  • Converted Hive/SQL queries into RDD transforms in Apache Spark using Scala.
  • Good experience in troubleshooting production level issues in the cluster and its functionality.
  • Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto-scaling groups.
  • Streamed data from Oracle to Apache Kafka topics using Apache Flume.
  • Used Quay to manage Docker images. And used JIRA for the issue tracking and bug reporting.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Skilled in using Scala and Spark SQL to implement Spark for faster data testing and processing, as well as managing data from various sources.
  • Writing big query to get data wrangling for with help of data flow in GCP cloud.
  • Used Scala and Spark SQL to implement Spark for faster data testing and processing.
  • Monitored systems and services through Cloudera Manager to make the clusters available for the business.

Senior Data Engineer

Confidential, Phoenix, AZ

Responsibilities:

  • Managed and supported the operation of a corporate Data Warehouse, as well as the creation of big data advanced predictive applications using Cloudera and Hortonworks HDP.
  • Created PIG scripts to turn raw data into intelligent data according to business users' specifications.
  • Extensively worked on Spark using Scalafor computational analytics, built it on top of Hadoop, and used Spark with Hive and Snowflake to performcomplex analytical applications.
  • Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake data warehouse.
  • Worked on the design and implementation of a Hadoop cluster as well as a variety of Big Data analytical tools such as Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, Flume, Spark, Impala, and Cassandra with Horton Work Distribution.
  • Assisted with Hadoop infrastructure upgrades, configuration, and maintenance, such as Pig, Hive, and HBase.
  • Used Amazon Redshift to store and retrieve data from data warehouses.
  • Used SparkAPI over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Using Spark Context, Spark-SQL, Data Frame, Pair RDD's, and Spark YARN, improved the performance and optimization of current Hadoop methods.
  • Developed Spark code for faster data testing and processing utilizing Scala and Spark-SQL/Streaming.
  • Developed a Spark job in Java which indexes data into Elasticsearch from external Hive tables which are in HDFS.
  • Using Spark and AWS EMR, imported data from transactional source systems into the Redshift data warehouse.
  • Performed advanced procedures like text analytics and processing using in-memory computing capabilities of Spark using Python on EMR.
  • Generated ETL scripts to transform, flatten and enrich the data from source to target using AWS Glue and created event driven ETL pipelines with AWS Glue.
  • Used AWS Glue catalog with crawler to get the data from S3 and perform SQL query operations.
  • Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
  • Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Involved in converting Hive queries into spark transformations using Spark RDDs, PySpark and Scala.
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Import the data from different sources like HDFS/Hbase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
  • Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elasticsearch.
  • Performed transformations like event joins, filter bot traffic and some pre-aggregations using Pig.
  • Developed Spark code using PySpark and Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability.
  • Used Azure Data Factory extensively for ingesting data from disparate source systems.
  • Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
  • Automated jobs in ADF using various triggers (Event, Scheduled, and Tumbling).
  • Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Used Spark as an ETL tool to remove duplicates, Join and aggregate the input data before storing in a Blob.
  • Scheduled daily jobs in Airflow and developed DAGs for daily production run.
  • Developed Python code for tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow.
  • Designed and developed a new solution to process the NRT data by using Azure Stream Analytics, Azure Event Hub and Service Bus Queue.
  • Optimized MapReduce code, pig scripts, and performance tuning and analysis, as well as created and maintained several Shell and Python scripts for automating various tasks.
  • Improved performance by reducing the amount of time it takes to analyze streaming data and reducing the amount of money it costs the firm by reducing cluster run time.

Hadoop/Big Data Engineer

MT Bank, Buffalo, New York

Responsibilities:

  • Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle, collection, ingestion, storage, processing, and visualization.
  • Analyzed and understood the design document and mapping document.
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Created airflow production jobs status report to stakeholders.
  • Developed spark scripts and python functions that involve performing transformations and actions on data sets.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).
  • Extracted data from multiple systems and sources using Python and loaded the data into AWS EMR.
  • Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
  • Responsible for creating on-demand tables on S3 files using Lambda functions and AWS Glue using Python and PySpark.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue jobs, EC2 hosts using CloudWatch and used AWS Glue for data transformation, validation, and cleansing.
  • Performs quality check on the existing code to improve performance.
  • Involved in data warehousing and Business Intelligent systems
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
  • Good knowledge about the configuration management tools like BitBucket/Github and Bamboo (CICD).
  • Developed spark applications using RDD, Dataframes.
  • Worked extensively on hive to analyze the data and create reports for data quality.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
  • Worked on the configuration management tools like SVN/CVS/Github.
  • Configured Event Engine nodes to import and export the data from Teradata to HDFS and vice-versa.
  • Worked in the BI team in Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
  • Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
  • Supported and assisted QA Engineers in understanding, testing, and troubleshooting.
  • Worked on developing and predicting trend for business intelligence.
  • Designed Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.
  • Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.
  • Involved in migrating Spark Jobs from Qubole to Databricks
  • Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem.

Data Analyst

Confidential

Responsibilities:

  • Worked with client’s strategy team in its plan to include CGM devices into all insurance plans for Type-1 diabetics.
  • Provided the incidence rates and prevalence for Type-1 diabetics split by year, plan group and various other demographics such as Age, Gender, Plan Type & Region to see the trends of newly diagnosed and existing patients
  • Identified different treatment methods used by Type-1 patients based on the products they claimed since diagnosis date
  • Analyzed the total Type-1 patient cohort and provided the total patients falling under each treatment method
  • Provided patient journeys for all the Type-1 diabetic patients to showcase how they switched different therapies over the period
  • Identified different comorbidities and complications that are associated with Type-1 patients and created a concatenated flag & variable indicating total comorbidities and complications
  • Provided the average diabetes treatment cost that’s incurred to the patient and the insurance company for each treatment method. CGM therapy found out to be costlier to scheme than other treatment methods as it involves buying the high cost CGM device
  • Performed risk adjustment on Gender, Age Group, Comorbidities, and complications count as to remove bias from the data i.e., comparing the costs associated for type-1 diabetic patient in 25-35 age group with another patient of same age group
  • Provided the total costs (medication costs, hospital costs, consultation costs, etc.,) of all comorbidities and complications of Type-1 diabetic patients
  • Although the comorbidity costs came out to be in the same range for all the patients across different therapies, diabetes related complications costs are much higher for patients who are not CGM therapy
  • The analysis has helped the strategy team prove their point that CGM therapy has lower downstream costs compared to other therapies even if it involves buying the high-cost device
  • Migrated the analysis code from Scala to Snowflake as part of the Snowflake migration
  • Also, conducted various sessions across the organization on Snowflake, Excel & PowerPoint
  • Mentored junior data engineers in the team on skill development, career path and provided necessary guidance

Data Analyst

Confidential

Responsibilities:

  • Worked along with Customer Care team to proactively mitigate the calls received from stores by identifying the root cause for call volume drivers and by assessing the associate’s productivity & effectiveness
  • Extracted data present in JSON key, value pairs and converted to tabular form using Python nested dictionaries
  • Created a master table in Teradata containing all the relevant metrics by joining multiple tables and optimized the whole query to meet hourly cadence requirement
  • Took the complete ownership of data loading and quality management of the master table and automated the complete workflow using Oozie
  • Performed ad-hoc analysis during BTC & BTS season for Merchandise category to see the complaint volume in the category at various levels
  • Collaborated with marketing team on profiling the company’s focus customer segment called Busy Families. Since these are the customers with high time and money constraints, solving for them would solve for all customers
  • Worked along with business team to define busy families as the families with at least one child, tend to shop online and shop at Walmart Brick & Mortar stores. Busy families constitute around 19% Walmart customers and contribute to 27% of total revenue
  • Provided insights for busy families at various demographic levels such as age groups, level of, location of residence & income. Busy families found out to be educated, high income, young individuals compared to average Walmart households
  • Performed comparison analysis in hive for busy families compared to average households on various KPIs like sales per household, units per household, visits per household, units per basket, spend per basket & spend per unit in Walmart stores and Walmart.com
  • Provided visit distributions for busy families and others based on visit types such as immediate need, destination, fill in, stock up & seasonal
  • Created a seasonal holiday dashboard in Tableau to better understand various trends across seasons and events
  • Designed the code flow and created the master table in hive which is used as an input for the dashboard
  • Provided total sales & visits with YoY comparison by SBU, Portfolio & Department as a summary view
  • Calculated visits, sales & units at province & merchandise level
  • Provided the sales and visits contribution during holidays of all the customers by various demographics such as ethnicity, income group,, household size & age group

We'd love your feedback!