Data Engineer Resume
Atlanta, GA
SUMMARY
- Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
- Over 6+ years of experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, PyTest, PyMongo, cx Oracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Experience in Google Cloud compone nts (Dataflow, DataProc, Cloud Storage, Cloud Datastore, Cloud SQL, BigQuery), Google container builders and GCP client libraries and cloud SDK's
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hort on works distributions.
- Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/data marts from heterogeneous sources.
- Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, Big Query and Mongo DB using Python.
- Developed Spark jobs using Scala/PySpark and Spark SQL for faster data processing.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, Cloud SQL, Big Query, SQL Server, and Oracle)
- Expert in building Enterprise Data Warehouse (EDW) or Data warehouse (DWH) appliances from Scratch using both Kimball and Inmon's Approach.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.
TECHNICAL SKILLS
Programming Language: Python3, Java8, Scala2.x
Hadoop/Spark Ecosystem: Hadoop 2.x, Spark 2.x, Hive 2.x, HBase 2.2.0, Nifi 1.9.2, Sqoop 1.4.6, Flume 1.9.0, Kafka 2.3.0, HBase2.2.0, Cassandra3.11, Yarn1.17.3, Mesos 1.8.0, Zookeeper 3.4.x
Database: Oracle, MySQL, SQL Server 2008
Java/J2EE: Servlet, JSP, Struts2, Spring, Spring Boot, Spring Batch, EJB, JDBC, Hibernate, MyBatis, Web Services, SOAP, Rest, RabbitMQ, MVC, HTML, CSS, JavaScript, jQuery, XML, JSON, Log4j, JUnit, EasyMock, Mockito
AWS: ElasticSearch, Redshift, Lambda, DynamoDB, Kinesis, EMR, S3, RDS
Azure: Data Factory, Steam Analytics, Synapse, CosmosDB, SQL, Blob storage
Linux: Shell, Bash
Dev Tools: Git, JIRA, Maven, sbt, Vim, nano, pip, JUnit, Jenkins, Pyspark
Operating System: Windows, Linux, RedHat, Centos, Ubuntu, MacOS
PROFESSIONAL EXPERIENCE
Confidential | Atlanta, GA
Data Engineer
Responsibilities:
- Exposure to Full Lifecycle (SDLC) of Data Warehouse projects including Dimensional Data Modeling.
- Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
- Created and worked with various SQL and NoSQL databases such as MySQL, DynamoDB and MongoDB and connected to the database through DB Instances using the AWS Java SDK.
- Developed API for using AWS Lambda to manage the servers and run the code in the DB.
- Involved in developing functions for Amazon Lambda to manage some of the AWS services
- Experience in designing a Terraform and deploying it in cloud deployment manager to spin up resources like cloud virtual networks, Compute Engines in public and private subnets along with AutoScaler in Google Cloud Platform (GCP)
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Created Sessions and extracted data from various sources, transformed data according to the requirement and loading into data warehouse.
- Used various transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Router and Aggregator to create robust mappings in the Informatica Power Center Designer.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Used ER Studio for Creating/Updating Data Models.
- Created mappings using Designer and extracted data from various sources, transformed data according to the requirement.
- Developed Spark applications on Databricks using Python, Scala and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data.
- Developed spark applications in PySpark on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ER Studio.
- Created and implemented ER models and dimensional models (star schemas).
- Translated the business requirements into workable functional and non-functional requirements at detailed production level using Workflow Diagrams, Sequence Diagrams, Activity Diagrams and Use Case Modeling with help of ER Studio.
- Migrated data from hive to MySQL, to be displayed on UI by using PySpark job which runs for different environments.
- Evaluate Snowflake Design considerations for any change in the application. Build the Logical and Physical data model for snowflake as per the changes required
- Building/Maintaining Docker container clusters managed byKubernetes, Linux, Bash, GIT, Docker, on Azure. Utilized Kubernetes and Docker for the runtime environment of theCI/CDsystem to build, test deploy during production.
- Queried multiple databases like Snowflake, UDB and MySQL for data processing.
- Analyzed and processed complex data sets using advanced querying using Presto, HIVE and Teradata, visualization using Tableau and analytics tools such as Python and SAS.
- Developed ETL pipelines in and out data warehouse using combination of Python and Snowflake SnowSQL. Writing SQL quires against Snowflake.
- Created use case specific implementation guides in accordance with HL7 standards including FHIR.
- Familiarity with standards such as CCD/CCDA, DIRECT, HIE, MDM, FHIR
- Knowledge of existing HL7 standards and the evolving Fast Healthcare Interoperability Resources (FHIR) standard
- Installed and configured Apache airflow for workflow management and created workflows in python.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Automated workflows and CI/CD tools: Airflow
Environment: MSQL, Python, Databricks, Snowflake, SQL Server, Oracle, Teradata, Flat files, SharePoint, Scala, Hive, Kafka, Map Reduce, Sqoop, Spark, Python, GitHub, Jenkins
Confidential | Saint Louis, Mo
Data Engineer Intern
Responsibilities:
- Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.
- Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
- Created and worked with various SQL and NoSQL databases such as MySQL, DynamoDB and MongoDB and connected to the database through DB Instances using the AWS Java SDK.
- Developed API for using AWS Lambda to manage the servers and run the code in the DB.
- Involved in developing functions for Amazon Lambda to manage some of the AWS services
- Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
- Execute start up and stop scripts in Python for starting Hadoop name-node, data-node, Spark worker and master servers.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
- Writing Custom PySpark scripts to replicate the function of the Teradata code in the process of data transformation from Teradata to AWS one lake.
- Extracted, transformed, and loaded data using SAS/ETL.
- Developed Mappings, Sessions, and Workflows to extract, validate, and transform data according to the business rules using Data Stage.
- Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
Environment: AWS, DynamoDB, MSQL, Python, Databricks, Snowflake, SQL Server, Oracle, Teradata, Flat files, SharePoint, Scala, Hive, Kafka, Map Reduce, Sqoop, Spark, Python, GitHub, Jenkins
Confidential
Data Engineer Intern
Responsibilities:
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Sqoop jobs for ingesting from FTP servers into databases.
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark applications to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations and other capabilities.
- Experience developing pipelines in GCP and Azure.
- Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
- Designed, documented operational problems by following standards and procedures using JIRA.
- Developed custom multi-threaded python-based jobs
Confidential
Data Engineer
Responsibilities:
- Executed entire Data Life Cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, feature scaling, feature engineering and statistical modeling.
- Created multiple workbooks, dashboards, and charts using calculated fields, quick table calculations, Custom hierarchies, sets & parameters to meet business needs using Tableau.
- Create HIVE queries to join multiple tables of a source system and load them into Elastic Search Tables and used HIVE QL scripts to perform the incremental loads.
- Worked with different sources such as Oracle, Teradata, SQLServer2012 and Excel, Flat, Complex Flat File, Cassandra, MongoDB, HBase, and COBOL files.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random Forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Led teams performing Data mining, Data Modeling, Data/Business Analytics, Data Visualization, Data Governance & Operations, and Business Intelligence (BI) Analysis and communicated insights and results to the stakeholders.