Software/data Engineer Resume
Richmond, VA
SUMMARY
- 12+ years of experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
- Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Python
- Lead Production Support including EMR AMI rehydration, deployments, supporting for incidents and all other change orders in production for our team
- Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures
- In - depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce
- Adept in programming languages like Scala and Python including Big Data technologies like Hadoop, Hive
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement
- Developed Spark StructuredStreaming&Batch applications for various business use-cases utilizing various programming languages such as Java, Scala&Python
- Experience in Extraction, Transformation, and Loading (ETL) data from various sources into DB such as Cassandra, DynamoDB, AWS S3; as well as data processing like integration, aggregating, and moving data from various sources using Apache Kafka, Snowflake, AWS S3
- Expertise in working with AWS cloud services like EMR, EC2, S3, Lambda, DynamoDB, SNS, CloudWatch, Event Bridge, Data Pipeline for big data development
- Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats
- Experienced inDataManagement solution that covers DWH/DataArchitecture design, Data Governance Implementation and BigData
- Experienced in handling BIGDATA using HADOOP eco system components like SQOOP and HIVE
- Experience in designing, building, and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, Sqoop, Oozie, HBase, MongoDB, and Spark
- Expertise in python scripting and Shell scripting. Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis
- Proficient in building PySpark, Scala & Java applications for interactive analysis, batch processing, and stream processing
- Involved in all the phases of the Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies
- Experienced in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments
- Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies
- Experience in designingstar schema,SnowflakeschemaforDataWarehouse, ODS architecture
- Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional, and multidimensional modeling
- Experience in coding SQL for developing Procedures, Triggers, and Packages
- Experience in creating separate virtualdatawarehouses with difference size classes inAWSSnowflake
- Experience writing spark streaming and spark batch jobs, using spark MLlib for analytics
- Experience in importing and exportingdatausing Sqoop from HDFS to Relational Database Systems (RDBMS)-Oracle, DB2 and SQL Server and from RDBMS to HDFS
- Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases
- Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments
- Solid understanding of AWS, Redshift, S3, EC2 and Apache Spark, Scala process, and concepts
- Hands on experience in machine learning, big data, data visualization, R and Python development, Linux, SQL, GIT/GitHub
- Experienced inDataModeling retaining concepts of RDBMS, Logical and PhysicalData Modeling until (3NF) and MultidimensionalDataModeling Schema (Star schema, Snow-Flake Modeling, Facts, and dimensions)
- Experienced working on NoSQL databases like Cassandra and DynamoDB
- Worked and extracted data from various database sources like Oracle, SQL Server, and DB2
- Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating, and handling data
- Expertise in complex Data design/development, Master data and Metadata and hands-on experience on Data analysis in planning, coordinating, and executing on records and databases
- Implemented machine learning algorithms on large datasets to understand hidden patterns and capture insights
TECHNICAL SKILLS
Big Data Tools: Hadoop, HDFS, Sqoop, Hbase, Hive, MapReduce, Spark, Kafka
Cloud Technologies: Snowflake, SnowSQL, Azure, Databricks, AWS (EMR, EC2, S3, CloudWatch, Event Bridge, Lambda, SNS)
ETL Tools: SSIS, Informatica Power Center
Modeling and Architect Tools: Erwin, ER Studio, Star-Schema, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables
Database: Snowflake Cloud Database, Oracle, MS SQL Server, MySQL, Cassandra, DynamoDB
Operating Systems: Microsoft Windows, Unix, Linux
Reporting Tools: MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView
Methodologies: Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model
Machine Learning: Regression Models, Classification Models, Clustering, Linear regression, Logistic regression, Decision trees, Random Forest, Gradient Boosting, K nearest neighbor (KNN), K mean, Naïve Bayes, Time Series Analysis,PCA, Avro, MLbase
Python and R Libraries: R-tidyr, tidyverse, dplyr, lubridate, ggplot2, tseries Python - beautiful Soup, numpy, scipy, matplotlib, seaborn, pandas, scikit-learn
Programming Languages: SQL, R(shiny, R-studio), Python (Jupiter Notebook, PyCharm IDE), Scala
PROFESSIONAL EXPERIENCE
Confidential, Richmond, VA
Software/Data Engineer
Responsibilities:
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, Redshift, IAM)
- Utilize programming languages like Java, Scala and Python, and NoSQL databases & Cloud based data warehousing
- Data Aggregation&Ingestions of real-time transactional data into DatastaxCassandra DB
- Develop & implement POCs for Spark Batch &StructuredStreaming Applications in Python & Scala programming language.
- Develop Spark Structured Streaming & Batch Applications in Java, Scala & Python for Data Aggregation&Ingestions
- Consume Avro data from multiple Kafka topics with Spark Structured Streaming Applications
- Ingested Avro, Json, Parquet data from Data Lake/OneLake while tapping API for consumer authorization using spring framework jars into Cassandra DB by performing ETL using Spark Batch Application
- Develop Spark Scala Application for ingesting data from Snowflake warehouse using snowflake-spark connector jars for Data Priming into Cassandra DB
- Develop lambda with AWS S3, CloudWatch for creating infrastructure & scheduling PySpark applications
- Develop Batch Application with PySpark for computing & ingesting data from Snowflake into AWS S3, creating multiple csv files based on use-case with >5GB each
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Implement PySpark Batch Process for ingesting csv data from AWS S3 into DynamoDB, with scheduled scaling concept
- Performed the migration of Hive and MapReduce Jobs from on - premiseMapR to AWS cloud using EMR and Qubole.
- Implement batch writes with primary keys & secondary index to remove duplication &expedite writes in DynamoDB
- Implement checkpointing/offset technique & toggle offsets to track DynamoDB batch insertions
- Develop custom checkpoint/offset to S3 using Spark 3.x timestamp checkpointing feature &AKKA framework to overcome Data Loss while implementing Regional Rehydration for structured-streaming spark applications
- Utilize Jenkins for configuration management for CI/CD pipeline (Bogie/One pipeline) for creating lambda & EMR infrastructure
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
- Ingestion of data from Relational Database Systems to AWS S3 Cloud environment using Sqoop.
- Work on bash scripting for bootstrap scripts for infrastructure creation, vulnerability remediation and improvisation
- Lead AWS EMR version upgrades while making all Spark Applications compatible with impacted Spark & Hadoop versions.
- Managed large datasets using Panda data frames.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline for EMR infrastructure creation, deploying step function & monitoring step-functions using CloudWatch rules
- Spark Application scheduling & executions through Step functions & implementation on AWS EMR/EC2 instances
- Create Source to Target Mapping to create database design, table structures & definitions
- Monitoring & re-engineering cluster size/EC2 instances on EMR based on Spark Job configuration requirements
- Perform Performance Tuning for Spark jobs (including GC Tuning)
- Involve in designing, building & maintaining scalable data pipelines and work with ETL framework
- Develop unit test cases using Junit/ScalaTest(FunSuite) & resolved vulnerabilities flagged by Whitesource/Eratocode/Qualys Scans
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, and standardization, and then applied transformations as per the use cases.
- Familiar with data architecture including data ingestion pipeline design, data modelling and data mining.
- Develop test data by producing data on Kafka topics using JMeter for performance testing
- Work on Analytic tools like Splunk, Ganglia &DataStaxOpsCenter to monitor Spark Applications & Database Memory utilization
- Lead Production Support including EMR AMI rehydration, deployments, supporting for incidents and all other change orders in production for our team
- Worked on agile environment, used GitHub for version control and Teamcity for continuous build
Environment: Spark (Scala/Java/Python), Hadoop, SQL, Snowflake Warehouse, NoSQL, Cassandra DB, AWS DynamoDB, AWS (EMR, EC2, S3, CloudWatch, Event Bridge, Lambda, SNS), CI/CD pipeline (EMR infrastructure, Lambda, S3 trigger & CloudWatch/Event Bridge event creation)
Confidential, Boston, MA
Data Engineer
Responsibilities:
- Involved in Agile development methodology active member in scrum meetings
- Involved inDataProfiling and mergedatafrom multipledatasources
- Developed Python based REST API to track the performance using Flask, SQLAlchemy and PostgreSQL.
- Involved in Bigdatarequirement analysis, develop and design solutions for ETL and Business Intelligence platforms
- Designed 3NF data models for ODS, OLTP systems and dimensional data models using Star and SnowflakeSchemas
- Orchestrate data workflows using airflow to manage and schedule by creating DAGS using Python.
- Used Spark SQL for Scala, Python interface that automatically converts RDD case classes to schema RDD
- Worked on Snowflake environment to remove redundancy and load real timedatafrom variousdatasources into HDFS using Kafka
- Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
- Utilized SnowSQL (CLI Client) to connect to the data warehouse and performed loading and unloading of the data.
- Automated data ingestion or data loading with SnowSQL using Python.
- Executed SQL queries and performed all DDL and DML operations & developed batch scripts using SnowSQL.
- Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse
- Performed end- to-end Architecture & implementation assessment of various AWS Cloud services like Amazon EMR, Redshift, Glue, IAM, RDS, Lambda, Cloud Watch, Athena.
- Deploy new hardware and software environments required for PostgreSQL/Hadoop and expand existing environment
- Work with structured/semi-structured data ingestion and processing on AWS Cloud using S3, Python. Migrate on-premises big data workloads to AWS Databricks
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Involved in migration ofdatafrom existing RDBMS to Hadoop using Sqoop for processingdata, evaluate performance of various algorithms/models/strategies based on real-world data sets
- Created Hive tables for loading and analyzing data and developed Hive queries to process data and generate data cubes for visualizing.
- Build Jenkins Jobs for creating platform for Hadoop and Spark environment on EMR
- Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, pySpark and feature selection and created nonparametric models in Spark
- Handled importing data from various data sources, performed transformations using Hive and loaded data into HDFS
- Captured unstructured data that was otherwise not used and stored it in HDFS and HBase. Scarpe data using Beautiful Soup and saved data into MongoDB (JSON format)
- Worked on AWSS3 buckets and intra cluster file transfer between PNDA and s3 securely
- Design & Implementation of Data Mart, DBA coordination, DDL &DML generation & usage
- Provide dataarchitecture support to enterprisedatamanagement efforts, such as development of enterprise data model and master and reference data, as well as support to projects, such as development of physicaldatamodels, data warehouses and datamarts.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Develop, prototype and test predictive algorithms.Filtering and cleaningdata, review reports and performance indicators
- Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
- Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
- Used Jenkins for CI/CD, Docker as a container tool and Git as a version control tool.
- Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data
- Worked with Datagovernance, Dataquality, datalineage, Dataarchitect to design various models and processes
- Independently coded new programs and designed Tables to load and test program effectively for given POC's using with Big Data/Hadoop
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
Environment: Python, R/R studio, SQL, Oracle, Cassandra, MongoDB, AWS, Snowflake, Azure Databricks, Hadoop, Hive, MapReduce, Scala, Spark, Kafka, MLLib, regression, Tableau
Confidential, Boston MA
Data Engineer
Responsibilities:
- Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data
- Acquiring, cleaning and structuringdatafrom multiple sources and maintain databases/data systems. Identifying, analyzing, and interpreting trends or patterns in complexdatasets
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Developing and implementingdatacollection systems and other strategies that optimize statistical efficiency anddataquality
- Create and statistically analyze large data sets of internal and external data
- Working closely with marketing team to deliver actionable insights from huge volume ofdata, coming from different marketing campaigns and customer interaction matrices such as web portal usage, email campaign responses, public site interaction, and other customer specific parameters
- Performed incremental loads as well as full loads to transferdatafromOLTPtoDataWarehouse ofsnowflakeschemausing differentdataflow and control flow tasks and provide maintenance for existing jobs
- Design and implement secure data pipelines into a Snowflake data warehouse from on-premises and cloud data sources
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
- Creation of best practices and standards for data pipelining and integration with Snowflake data warehouses
- Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
- Responsible for DataCleaning, features scaling, features engineering by using NumPy and Pandas in Python
- Conducted ExploratoryDataAnalysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features
- Designing and building Big Data ingestion and query platforms with Spark, Hadoop, Hive, Oozie, Sqoop, Presto, Amazon EMR, Amazon S3, EC2, AWS Cloud Formation, RDS, Glue, IAM, Amazon IAM, and Control-M.
- Used Spark SQL for Scala, Python interface that automatically converts RDD case classes to schema RDD.
- Worked with NoSQL databases like HBase in creating tables to load large sets of semi structureddatacoming from source systems
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Used information value, principal components analysis, and Chi square feature selection techniques
- Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
- Designed and developed the core data pipeline code, involving work in Python and built on Kafka and Storm.
- Developed Data Migration and Cleansing rules for Integration Architecture (OLTP, ODS, DW)
- Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations.
- Tested ETL process for both before data validation and after data validation process.
- Experience in developing packages in R studio with a shiny interface
- Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, AdA boost and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
- Built models using Python and Pyspark to predict probability of attendance for various campaigns and events
- Implemented classification algorithms such as Logistic Regression, K-NN neighbors and Random Forests to predict Customer churn and Customer interface
- UsedNifito automate the data flow between disparate systems.
- Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker.
- Used GIT for version control.
- Performeddatavisualization and Designed dashboards with Tableau, and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders
- Followed Agile methodology including, test-driven and pair-programming concept.
Environment: OLTP Data Warehouse, Hadoop, Hive, HBase, Spark, Snowflake, R/R studio, Python- Pandas, Numpy, Scikit-Learn, TensorFlow - SciPy, Seaborn, Matplotlib, SQL, Machine Learning, ggplot, lattice, MASS, mice and logit.
Confidential - McLean, VA
Data Engineer
Responsibilities:
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce
- Worked with Oozie Workflow Engine in running workflow jobs with actions that runHadoopMapReduce, Hive, Spark jobs
- PerformedDataMapping,Datadesign (DataModeling) to integratedataacross multiple databases in to EDW
- Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling
- Hands on experience on Hadoop /BigDatarelated technology experience in Storage, Querying, Processing and analysis ofdata
- Developed Spark/Scala, Python for regular expression (regex) project in Hadoop/Hive environment for big data resources. Used clustering techniques like K-means to identify outliers and to classify unlabeled data
- Transformed raw data into actionable insights by incorporating various statistical techniques, data mining,data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
- Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging DecisionTrees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
- Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
- Worked with ETL processes to transfer/migrate data from relational database and flat files common staging tables in various formats to meaningful data in Oracle and MS- SQL
- Developed Talend Bigdata jobs to load heavy volume of data into S3 data
- Created a task scheduling application to run in an EC2 environment on multiple servers
- Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means
- Conducted model optimization and comparison using stepwise function based on AIC value
- Used cross-validation to test models with different batches of data to optimize models and prevent over fitting
- Ingestion of data from Relational Database Systems to AWS S3 Cloud environment using Sqoop.
- Worked and collaborated with various business teams (operations, commercial, innovation, HR, logistics, safety, environmental, accounting) to analyze and understand changes in key financial metrics and provide ad-hoc analysis that can be leveraged to build long term points of view where value can be captured
- Developed a fully automated continuous integration system using Git, Jenkins custom tools developed in Python and Bash.
- Involved in the development of Agile, iterative, and proven data modeling patterns that provide flexibility.
- Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers
- Supported data quality management by implementing proper data quality checks in data pipelines.
- Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau
Environment: Machine Learning, R Language, AWS, Hadoop, Big Data, Python, DB2, MongoDB, Web Services
Confidential - Bellevue, WA
Data Engineer
Responsibilities:
- Analyzed and translated Functional Specifications and Change Requests into Technical Specifications
- Designed and Implemented BigDataAnalytics architecture, transferringdatafrom Oracle Datawarehouse/ external APIs/ flat files to Hadoop using Hortonworks
- Designed and developed Use Cases, Activity Diagrams, and Swim Lane Diagrams and Process flows using Unified Model Language
- Ran SQL queries for data validation and performed quality analysis on data extracts to ensure data quality and integrity across various database systems
- Involved with Data Profiling activities for new sources before creating new subject areas in warehouse
- Created DDL scripts for implementing Data Modeling changes
- Responsible for different Data mapping activities from Source systems
- Performed extensive data cleansing, data manipulations and date transforms and data auditing
- Involved in SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on basis of using defect reports
- Involved in Data mapping specifications to create and execute detailed system test plans. Data mapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity
- Developed data pipeline using big data/hadoop tools Flume, Sqoop, Hbase, Spark, pig and MapReduce to ingest customer behavioral
- Parsed complex files using Informatica Data Transformations (normalizer, Lookup, Source Qualifier, Expression, Aggregator, Sorter, Rank and Joiner) and loaded them into databases.
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Followed Agile methodologies and implemented them on various projects by setting up Sprint for every two weeks and daily stand-up meetings.
- Implemented a Python-based distributed random forest via Pythonstreaming
- Performeddatadiscovery and build a stream that automatically retrievesdatafrom multitude of sources (SQL databases, externaldatasuch as social networkdata, user reviews) to generate KPI's using Tableau
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Used GIT for version control.
- Writing SQL queries for visualization and reporting systems. Good experience in visualization tool Tableau
- Wrote ETL scripts in SQL for extraction and validatingdata
Environment: SQL Server, ETL, SSIS, SSRS, Tableau, Excel