We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

5.00/5 (Submit Your Rating)

Philadelphia, PA

SUMMARY

  • Around 7 years of extensive experience in Information Technology with expertise on Data Analytics, Data Architect, Design, Development, Implementation, Testing and Deployment of Software Applications in Banking, Finance, Insurance, Retail and Telecom domains.
  • Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB),SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
  • Created Data Frames and performed analysis using Spark SQL.
  • Acute knowledge on Spark Streaming and Spark Machine Learning Libraries.
  • Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python and Java.
  • Excellent understanding of Spark Architecture and framework, Spark Context, APIs, RDDs, Spark SQL, Data frames, Streaming, MLlib.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Worked in agile projects delivering end to end continuous integration/continuous delivery pipeline by Integration of tools like Jenkins and AWS for VM provisioning.
  • Experienced in writing the automatic scripts for monitoring the file systems, key MapR services.
  • Implemented continuous integration & deployment (CICD) through Jenkins for Hadoop jobs.
  • Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS Redshift, Lambda and Amazon EC2, Amazon EMR.
  • Excellent understanding of Hadoop Architecture and good Exposure in Hadoop components like Hadoop Map Reduce, HDFS, HBase, Hive, Sqoop, Cassandra, Kafka and Amazon Web services (AWS) API test, document and monitor by Postman which is easily integrate the tests into your build automation.
  • Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and GZip.
  • Performed transformations on the imported data and exported back to RDBMS.
  • Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
  • Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
  • Created Hive External and Managed Tables.
  • Implemented Partitioning and Bucketing in Hive tables for Hive Query Optimization.
  • Used Apache Flume to ingest data from different sources to sinks like Avro, HDFS.
  • Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks.
  • Excellent knowledge on Kafka Architecture.
  • Integrated Flume with Kafka, using Flume both as a producer and consumer (concept of FLAFKA).
  • Used Kafka for activity tracking and Log aggregation.
  • Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.
  • Experience working with Text, Sequence files, XML, Parquet, JSON, ORC, AVRO file formats and Clickstream log files.
  • Familiar in data architecture including data ingestion pipeline design, Hadoop architecture, data modeling and data mining and advanced data processing.

TECHNICAL SKILLS

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, NI - FI, GCP, Google Shell, Linux, Big Query, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.

Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP

Languages: Python, Scala, Java, Pig Latin, HiveQL, Shell Scripting.

Software Methodologies: Agile, SDLC Waterfall.

Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.

NoSQL: HBase, MongoDB, Cassandra.

ETL/BI: Power BI, Tableau, Informatica.

Version control: GIT, SVN, Bitbucket.

Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.

Cloud Technologies: AWS, Azure

PROFESSIONAL EXPERIENCE

Confidential, Philadelphia, PA

Sr. Data Engineer

Responsibilities:

  • Deployed Snowflake following best practices, and provide subject matter expertise in data warehousing, specifically with Snowflake.
  • Developed the Pyspark code for AWS Glue jobs and for EMR.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
  • Experience in building models with deep learning frameworks like TensorFlow, PyTorch, and Keras.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch and used AWS Glue for the data transformation, validate and data cleansing.
  • Worked with cloud-based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
  • Deployed applications using Jenkins framework integrating Git- version control with it.
  • Used the Agile Scrum methodology to build the different phases of Software development life cycle.
  • Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau
  • Designed, developed, and implemented ETL pipelines using python API (PySpark) of Apache Spark on AWS EMR
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
  • Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python
  • Create Data pipelines for Kafka cluster and process the data by using spark streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time
  • Used Informatica power center to Extract, Transform and Load data into Data Warehouse from various sources like Oracle and flat files
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Good knowledge about the configuration management tools like Bitbucket/GitHub and Bamboo (CICD).
  • Collect the data using Spark streaming and dump into HBase and Cassandra. Used the Spark- Cassandra Connector to load data to and from Cassandra.

Environment: AWS, Python, ETL, Hive, Pyspark, MongoDB, Kafka, SQL, T-SQL, Airflow, Snow Flake

Confidential, NYC, NY

Data Engineer

Responsibilities:

  • In designing large-scale system software, evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs and worked on Big Data Hadoop cluster deployment and data integration.
  • Developed Spark programs were used to process raw data, populate staging tables, and store refined data (JSON, XML, CSV.etc) in partitioned tables in the Enterprise Data warehouse.
  • Developed streaming applications that accept messages from Amazon AWS Kinesis queues and publish data to AWS S3 buckets using Spark and Kinesis.
  • Used AWS EFS to provide scalable file storage with AWS EC2.
  • Integrated data from data warehouses and data marts into cloud-based data structures using T-SQL.
  • Worked on EMR clusters to add steps by writing shell and Python scripts to modify parquet files stored in S3.
  • Developed DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS and Hive.
  • Developed and implemented HQL scripts to generate Hive partitioned and bucketed tables for faster data access. Hive UDFs were written to implement custom aggregating functions in the hive.
  • Developed Merge jobs in Python to extract and load data into SQL database and used Test driven approach for developing applications.
  • Shell and Python scripts were written to parameterize Hive activities in Oozie workflow and to schedule tasks.
  • Used Amazon EKS to run, scale and deploy applications on cloud or On-Premises.
  • Developed PySpark codes using Python to mimic the transformations performed in the on-premises environment and analyzed the SQL scripts and designed solutions to implement on the cloud using PySpark.
  • Rewrite existing Python modules to deliver a required format of data.
  • Used Sqoop widely for importing and exporting data from HDFS to Relational Database Systems/Mainframes, as well as loading data into HDFS.
  • Analyzed large datasets to know trends and patterns by performing exploratory data analysis in Python.
  • SSIS Designer was used to create SSIS Packages for exporting heterogeneous data from OLE DB Sources and Excel Spreadsheets to SQL Server.
  • Used Python libraries and SQL queries/subqueries to create several datasets which produced statistic statistics, tables, figures, charts, and graphs.
  • Application monitoring for YARN, troubleshoot, and address cluster-specific system issues.
  • Customizing logic around error handling and logging of Ansible/Jenkins job results.
  • Oozie Scheduler technologies were used to automate the pipeline process and coordinate the map-reduce operations that extracted data, while Zookeeper was used to provide cluster coordinating services.
  • Created Hive queries to assist data analysts in identifying developing patterns by comparing new data to EDW (enterprise data warehouse) reference tables and previous measures.
  • Involved in specification design, design documents, data modeling, and data warehouse design. We evaluated existing and EDW (enterprise data warehouse) technologies and processes to ensure that our EDW/BI design fits the demands of the company and organization while also allowing for future expansion.
  • Worked on Hadoop, SOLR, Spark, and Kinesis-based Big Data Integration and Analytics.
  • Develop ETL processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Bigdata tasks were established to load large amounts of data into the S3 data lake and ultimately into the AWS RedShift, and a pipeline was created to allow for continuous data load. Develop data pipelines for Integrated Data Analysis utilizing Hive, Spark, Python, Sqoop, and MySQL.
  • Optimized long-running Hive searches utilizing Hive Joins, vectorizations, Partitioning, Bucketing, and Indexing.
  • Involved in tuning the Spark applications by adjusting memory and resource allocation settings, determining the best Batch Interval time, and adjusting the number of executors to match the rising demand over time. On the EMR cluster, Spark and Hadoop tasks were deployed.
  • Deep experience improving performance of dashboards and creating incremental refreshes for data sources on Tableau server.

Environment: Hadoop, HDFS, AWS, Hive, SSIS, Redshift Sqoop, HBase, Oozie, Storm, YARN, NiFi, Cassandra, Zookeeper, Spark.

Confidential, Bellevue, WA

Data Engineer

Responsibilities:

  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Handled AWS Management Tools as Cloud watch and Cloud Trail.
  • Stored the log files in AWS S3. Used versioning in S3 buckets where teh highly sensitive information is stored.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing.
  • Developed parallel reports using SQL and Python to validate the daily, monthly, and quarterly reports.
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Primarily Responsible for converting Manual Report system to fully automated CI/CD Data Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
  • Designed and developed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function
  • Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Developed an AWS lambda functions to trigger the unit step functions to process the scheduled EMR job.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Analyzed the SQL scripts and designed it by using PySpark SQL for faster performance.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLlib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources.
  • Developed Airflow DAGs in python by importing the Airflow libraries.

Environment: AWS, JMeter, Kafka, Ansible, Jenkins, Docker, Maven, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting, Golang, Web Sphere, Splunk, Tomcat, Soap UI, Kubernetes, Terraform, PowerShell.

Confidential

Data Analyst

Responsibilities:

  • Worked on design, development and testing of mappings, sessions, workflows to transfer the data from policy center to BIC- Business Intelligence Center.
  • Developed solution to decide at which stage of a policy life cycle, an underwriting issue occurred. Worked on root cause analysis for the cancellation of policy using SAS and SQL.
  • Conducted user interviews, gathering requirements, analyzing, and prioritizing Product Backlog.
  • Designed and developed Use Cases, flow diagrams and business functional requirements for Scrum.
  • Functioned as a liaison between the Scrum Master, QA Manager, and End-Users in defect tracking prioritization, escalation, and resolution (Environment: Windows 7, Oracle, Mainframes, SharePoint, Structured data, Semi-Structured data, Unstructured data.)
  • Applied models and data to understand and predict infrastructure costs, presenting findings to stakeholders.
  • Created interactive cohort analysis report in Tableau.
  • Built forecasting using parameters, trend lines, and reference lines. Implemented security guidelines by using user filters and row-level security. Python data wrangling, web scraping, streaming data from sources, and data parsing.

Environment: Data Warehousing, Python/R, Snowflake, Redshift, Data Visualization- SAS/Tableau, Data Science Research Methods- Power BI, Statistical Computing Methods, and Experimental Design & Analysis JSON, SQL, PowerShell, Git, and GitHub.

Confidential

Responsibilities:

  • Worked on design, development and testing of mappings, sessions, workflows to transfer the data from policy center to BIC- Business Intelligence Center.
  • Developed solution to decide at which stage of a policy life cycle, an underwriting issue occurred. Worked on root cause analysis for the cancellation of policy using SAS and SQL.
  • Conducted user interviews, gathering requirements, analyzing, and prioritizing Product Backlog.
  • Designed and developed Use Cases, flow diagrams and business functional requirements for Scrum.
  • Functioned as a liaison between the Scrum Master, QA Manager, and End-Users in defect tracking prioritization, escalation, and resolution (Environment: Windows 7, Oracle, Mainframes, SharePoint, Structured data, Semi-Structured data, Unstructured data.)
  • Applied models and data to understand and predict infrastructure costs, presenting findings to stakeholders.
  • Created interactive cohort analysis report in Tableau.
  • Built forecasting using parameters, trend lines, and reference lines. Implemented security guidelines by using user filters and row-level security. Python data wrangling, web scraping, streaming data from sources, and data parsing.

Environment: Data Warehousing, Python/R, Snowflake, Redshift, Data Visualization- SAS/Tableau, Data Science Research Methods- Power BI, Statistical Computing Methods, and Experimental Design & Analysis JSON, SQL, PowerShell, Git, and GitHub.

We'd love your feedback!