Software/Data Engineer Resume Richmond, VA - Hire IT People

SUMMARY

12+ years of experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Python
Lead Production Support including EMR AMI rehydration, deployments, supporting for incidents and all other change orders in production for our team
Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures
In - depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce
Adept in programming languages like Scala and Python including Big Data technologies like Hadoop, Hive
Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement
Developed Spark StructuredStreaming&Batch applications for various business use-cases utilizing various programming languages such as Java, Scala&Python
Experience in Extraction, Transformation, and Loading (ETL) data from various sources into DB such as Cassandra, DynamoDB, AWS S3; as well as data processing like integration, aggregating, and moving data from various sources using Apache Kafka, Snowflake, AWS S3
Expertise in working with AWS cloud services like EMR, EC2, S3, Lambda, DynamoDB, SNS, CloudWatch, Event Bridge, Data Pipeline for big data development
Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats
Experienced inDataManagement solution that covers DWH/DataArchitecture design, Data Governance Implementation and BigData
Experienced in handling BIGDATA using HADOOP eco system components like SQOOP and HIVE
Experience in designing, building, and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, Sqoop, Oozie, HBase, MongoDB, and Spark
Expertise in python scripting and Shell scripting. Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis
Proficient in building PySpark, Scala & Java applications for interactive analysis, batch processing, and stream processing
Involved in all the phases of the Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies
Experienced in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments
Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies
Experience in designingstar schema,SnowflakeschemaforDataWarehouse, ODS architecture
Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional, and multidimensional modeling
Experience in coding SQL for developing Procedures, Triggers, and Packages
Experience in creating separate virtualdatawarehouses with difference size classes inAWSSnowflake
Experience writing spark streaming and spark batch jobs, using spark MLlib for analytics
Experience in importing and exportingdatausing Sqoop from HDFS to Relational Database Systems (RDBMS)-Oracle, DB2 and SQL Server and from RDBMS to HDFS
Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases
Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments
Solid understanding of AWS, Redshift, S3, EC2 and Apache Spark, Scala process, and concepts
Hands on experience in machine learning, big data, data visualization, R and Python development, Linux, SQL, GIT/GitHub
Experienced inDataModeling retaining concepts of RDBMS, Logical and PhysicalData Modeling until (3NF) and MultidimensionalDataModeling Schema (Star schema, Snow-Flake Modeling, Facts, and dimensions)
Experienced working on NoSQL databases like Cassandra and DynamoDB
Worked and extracted data from various database sources like Oracle, SQL Server, and DB2
Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating, and handling data
Expertise in complex Data design/development, Master data and Metadata and hands-on experience on Data analysis in planning, coordinating, and executing on records and databases
Implemented machine learning algorithms on large datasets to understand hidden patterns and capture insights

TECHNICAL SKILLS

Big Data Tools: Hadoop, HDFS, Sqoop, Hbase, Hive, MapReduce, Spark, Kafka

Cloud Technologies: Snowflake, SnowSQL, Azure, Databricks, AWS (EMR, EC2, S3, CloudWatch, Event Bridge, Lambda, SNS)

ETL Tools: SSIS, Informatica Power Center

Modeling and Architect Tools: Erwin, ER Studio, Star-Schema, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables

Database: Snowflake Cloud Database, Oracle, MS SQL Server, MySQL, Cassandra, DynamoDB

Operating Systems: Microsoft Windows, Unix, Linux

Reporting Tools: MS Excel, Tableau, Tableau server, Tableau Reader, Power BI, QlikView

Methodologies: Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model

Machine Learning: Regression Models, Classification Models, Clustering, Linear regression, Logistic regression, Decision trees, Random Forest, Gradient Boosting, K nearest neighbor (KNN), K mean, Naïve Bayes, Time Series Analysis,PCA, Avro, MLbase

Python and R Libraries: R-tidyr, tidyverse, dplyr, lubridate, ggplot2, tseries Python - beautiful Soup, numpy, scipy, matplotlib, seaborn, pandas, scikit-learn

Programming Languages: SQL, R(shiny, R-studio), Python (Jupiter Notebook, PyCharm IDE), Scala

PROFESSIONAL EXPERIENCE

Confidential, Richmond, VA

Software/Data Engineer

Responsibilities:

Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, Redshift, IAM)
Utilize programming languages like Java, Scala and Python, and NoSQL databases & Cloud based data warehousing
Data Aggregation&Ingestions of real-time transactional data into DatastaxCassandra DB
Develop & implement POCs for Spark Batch &StructuredStreaming Applications in Python & Scala programming language.
Develop Spark Structured Streaming & Batch Applications in Java, Scala & Python for Data Aggregation&Ingestions
Consume Avro data from multiple Kafka topics with Spark Structured Streaming Applications
Ingested Avro, Json, Parquet data from Data Lake/OneLake while tapping API for consumer authorization using spring framework jars into Cassandra DB by performing ETL using Spark Batch Application
Develop Spark Scala Application for ingesting data from Snowflake warehouse using snowflake-spark connector jars for Data Priming into Cassandra DB
Develop lambda with AWS S3, CloudWatch for creating infrastructure & scheduling PySpark applications
Develop Batch Application with PySpark for computing & ingesting data from Snowflake into AWS S3, creating multiple csv files based on use-case with >5GB each
Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
Implement PySpark Batch Process for ingesting csv data from AWS S3 into DynamoDB, with scheduled scaling concept
Performed the migration of Hive and MapReduce Jobs from on - premiseMapR to AWS cloud using EMR and Qubole.
Implement batch writes with primary keys & secondary index to remove duplication &expedite writes in DynamoDB
Implement checkpointing/offset technique & toggle offsets to track DynamoDB batch insertions
Develop custom checkpoint/offset to S3 using Spark 3.x timestamp checkpointing feature &AKKA framework to overcome Data Loss while implementing Regional Rehydration for structured-streaming spark applications
Utilize Jenkins for configuration management for CI/CD pipeline (Bogie/One pipeline) for creating lambda & EMR infrastructure
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
Ingestion of data from Relational Database Systems to AWS S3 Cloud environment using Sqoop.
Work on bash scripting for bootstrap scripts for infrastructure creation, vulnerability remediation and improvisation
Lead AWS EMR version upgrades while making all Spark Applications compatible with impacted Spark & Hadoop versions.
Managed large datasets using Panda data frames.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline for EMR infrastructure creation, deploying step function & monitoring step-functions using CloudWatch rules
Spark Application scheduling & executions through Step functions & implementation on AWS EMR/EC2 instances
Create Source to Target Mapping to create database design, table structures & definitions
Monitoring & re-engineering cluster size/EC2 instances on EMR based on Spark Job configuration requirements
Perform Performance Tuning for Spark jobs (including GC Tuning)
Involve in designing, building & maintaining scalable data pipelines and work with ETL framework
Develop unit test cases using Junit/ScalaTest(FunSuite) & resolved vulnerabilities flagged by Whitesource/Eratocode/Qualys Scans
Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, and standardization, and then applied transformations as per the use cases.
Familiar with data architecture including data ingestion pipeline design, data modelling and data mining.
Develop test data by producing data on Kafka topics using JMeter for performance testing
Work on Analytic tools like Splunk, Ganglia &DataStaxOpsCenter to monitor Spark Applications & Database Memory utilization
Lead Production Support including EMR AMI rehydration, deployments, supporting for incidents and all other change orders in production for our team
Worked on agile environment, used GitHub for version control and Teamcity for continuous build

Environment: Spark (Scala/Java/Python), Hadoop, SQL, Snowflake Warehouse, NoSQL, Cassandra DB, AWS DynamoDB, AWS (EMR, EC2, S3, CloudWatch, Event Bridge, Lambda, SNS), CI/CD pipeline (EMR infrastructure, Lambda, S3 trigger & CloudWatch/Event Bridge event creation)

Confidential, Boston, MA

Data Engineer

Responsibilities:

Involved in Agile development methodology active member in scrum meetings
Involved inDataProfiling and mergedatafrom multipledatasources
Developed Python based REST API to track the performance using Flask, SQLAlchemy and PostgreSQL.
Involved in Bigdatarequirement analysis, develop and design solutions for ETL and Business Intelligence platforms
Designed 3NF data models for ODS, OLTP systems and dimensional data models using Star and SnowflakeSchemas
Orchestrate data workflows using airflow to manage and schedule by creating DAGS using Python.
Used Spark SQL for Scala, Python interface that automatically converts RDD case classes to schema RDD
Worked on Snowflake environment to remove redundancy and load real timedatafrom variousdatasources into HDFS using Kafka
Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
Utilized SnowSQL (CLI Client) to connect to the data warehouse and performed loading and unloading of the data.
Automated data ingestion or data loading with SnowSQL using Python.
Executed SQL queries and performed all DDL and DML operations & developed batch scripts using SnowSQL.
Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse
Performed end- to-end Architecture & implementation assessment of various AWS Cloud services like Amazon EMR, Redshift, Glue, IAM, RDS, Lambda, Cloud Watch, Athena.
Deploy new hardware and software environments required for PostgreSQL/Hadoop and expand existing environment
Work with structured/semi-structured data ingestion and processing on AWS Cloud using S3, Python. Migrate on-premises big data workloads to AWS Databricks
Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
Involved in migration ofdatafrom existing RDBMS to Hadoop using Sqoop for processingdata, evaluate performance of various algorithms/models/strategies based on real-world data sets
Created Hive tables for loading and analyzing data and developed Hive queries to process data and generate data cubes for visualizing.
Build Jenkins Jobs for creating platform for Hadoop and Spark environment on EMR
Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, pySpark and feature selection and created nonparametric models in Spark
Handled importing data from various data sources, performed transformations using Hive and loaded data into HDFS
Captured unstructured data that was otherwise not used and stored it in HDFS and HBase. Scarpe data using Beautiful Soup and saved data into MongoDB (JSON format)
Worked on AWSS3 buckets and intra cluster file transfer between PNDA and s3 securely
Design & Implementation of Data Mart, DBA coordination, DDL &DML generation & usage
Provide dataarchitecture support to enterprisedatamanagement efforts, such as development of enterprise data model and master and reference data, as well as support to projects, such as development of physicaldatamodels, data warehouses and datamarts.
Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
Develop, prototype and test predictive algorithms.Filtering and cleaningdata, review reports and performance indicators
Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
Used Jenkins for CI/CD, Docker as a container tool and Git as a version control tool.
Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data
Worked with Datagovernance, Dataquality, datalineage, Dataarchitect to design various models and processes
Independently coded new programs and designed Tables to load and test program effectively for given POC's using with Big Data/Hadoop
Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS

Environment: Python, R/R studio, SQL, Oracle, Cassandra, MongoDB, AWS, Snowflake, Azure Databricks, Hadoop, Hive, MapReduce, Scala, Spark, Kafka, MLLib, regression, Tableau

Confidential, Boston MA

Data Engineer

Responsibilities:

Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data
Acquiring, cleaning and structuringdatafrom multiple sources and maintain databases/data systems. Identifying, analyzing, and interpreting trends or patterns in complexdatasets
Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Developing and implementingdatacollection systems and other strategies that optimize statistical efficiency anddataquality
Create and statistically analyze large data sets of internal and external data
Working closely with marketing team to deliver actionable insights from huge volume ofdata, coming from different marketing campaigns and customer interaction matrices such as web portal usage, email campaign responses, public site interaction, and other customer specific parameters
Performed incremental loads as well as full loads to transferdatafromOLTPtoDataWarehouse ofsnowflakeschemausing differentdataflow and control flow tasks and provide maintenance for existing jobs
Design and implement secure data pipelines into a Snowflake data warehouse from on-premises and cloud data sources
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
Creation of best practices and standards for data pipelining and integration with Snowflake data warehouses
Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
Responsible for DataCleaning, features scaling, features engineering by using NumPy and Pandas in Python
Conducted ExploratoryDataAnalysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features
Designing and building Big Data ingestion and query platforms with Spark, Hadoop, Hive, Oozie, Sqoop, Presto, Amazon EMR, Amazon S3, EC2, AWS Cloud Formation, RDS, Glue, IAM, Amazon IAM, and Control-M.
Used Spark SQL for Scala, Python interface that automatically converts RDD case classes to schema RDD.
Worked with NoSQL databases like HBase in creating tables to load large sets of semi structureddatacoming from source systems
ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
Used information value, principal components analysis, and Chi square feature selection techniques
Used Python and R scripting by implementing machine algorithms to predict data and forecast data for better results
Designed and developed the core data pipeline code, involving work in Python and built on Kafka and Storm.
Developed Data Migration and Cleansing rules for Integration Architecture (OLTP, ODS, DW)
Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations.
Tested ETL process for both before data validation and after data validation process.
Experience in developing packages in R studio with a shiny interface
Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, AdA boost and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
Built models using Python and Pyspark to predict probability of attendance for various campaigns and events
Implemented classification algorithms such as Logistic Regression, K-NN neighbors and Random Forests to predict Customer churn and Customer interface
UsedNifito automate the data flow between disparate systems.
Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker.
Used GIT for version control.
Performeddatavisualization and Designed dashboards with Tableau, and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders
Followed Agile methodology including, test-driven and pair-programming concept.

Environment: OLTP Data Warehouse, Hadoop, Hive, HBase, Spark, Snowflake, R/R studio, Python- Pandas, Numpy, Scikit-Learn, TensorFlow - SciPy, Seaborn, Matplotlib, SQL, Machine Learning, ggplot, lattice, MASS, mice and logit.

Confidential - McLean, VA

Data Engineer

Responsibilities:

Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce
Worked with Oozie Workflow Engine in running workflow jobs with actions that runHadoopMapReduce, Hive, Spark jobs
PerformedDataMapping,Datadesign (DataModeling) to integratedataacross multiple databases in to EDW
Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling
Hands on experience on Hadoop /BigDatarelated technology experience in Storage, Querying, Processing and analysis ofdata
Developed Spark/Scala, Python for regular expression (regex) project in Hadoop/Hive environment for big data resources. Used clustering techniques like K-means to identify outliers and to classify unlabeled data
Transformed raw data into actionable insights by incorporating various statistical techniques, data mining,data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging DecisionTrees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
Worked with ETL processes to transfer/migrate data from relational database and flat files common staging tables in various formats to meaningful data in Oracle and MS- SQL
Developed Talend Bigdata jobs to load heavy volume of data into S3 data
Created a task scheduling application to run in an EC2 environment on multiple servers
Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means
Conducted model optimization and comparison using stepwise function based on AIC value
Used cross-validation to test models with different batches of data to optimize models and prevent over fitting
Ingestion of data from Relational Database Systems to AWS S3 Cloud environment using Sqoop.
Worked and collaborated with various business teams (operations, commercial, innovation, HR, logistics, safety, environmental, accounting) to analyze and understand changes in key financial metrics and provide ad-hoc analysis that can be leveraged to build long term points of view where value can be captured
Developed a fully automated continuous integration system using Git, Jenkins custom tools developed in Python and Bash.
Involved in the development of Agile, iterative, and proven data modeling patterns that provide flexibility.
Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers
Supported data quality management by implementing proper data quality checks in data pipelines.
Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau

Environment: Machine Learning, R Language, AWS, Hadoop, Big Data, Python, DB2, MongoDB, Web Services

Confidential - Bellevue, WA

Data Engineer

Responsibilities:

Analyzed and translated Functional Specifications and Change Requests into Technical Specifications
Designed and Implemented BigDataAnalytics architecture, transferringdatafrom Oracle Datawarehouse/ external APIs/ flat files to Hadoop using Hortonworks
Designed and developed Use Cases, Activity Diagrams, and Swim Lane Diagrams and Process flows using Unified Model Language
Ran SQL queries for data validation and performed quality analysis on data extracts to ensure data quality and integrity across various database systems
Involved with Data Profiling activities for new sources before creating new subject areas in warehouse
Created DDL scripts for implementing Data Modeling changes
Responsible for different Data mapping activities from Source systems
Performed extensive data cleansing, data manipulations and date transforms and data auditing
Involved in SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on basis of using defect reports
Involved in Data mapping specifications to create and execute detailed system test plans. Data mapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity
Developed data pipeline using big data/hadoop tools Flume, Sqoop, Hbase, Spark, pig and MapReduce to ingest customer behavioral
Parsed complex files using Informatica Data Transformations (normalizer, Lookup, Source Qualifier, Expression, Aggregator, Sorter, Rank and Joiner) and loaded them into databases.
Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
Followed Agile methodologies and implemented them on various projects by setting up Sprint for every two weeks and daily stand-up meetings.
Implemented a Python-based distributed random forest via Pythonstreaming
Performeddatadiscovery and build a stream that automatically retrievesdatafrom multitude of sources (SQL databases, externaldatasuch as social networkdata, user reviews) to generate KPI's using Tableau
Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
Used GIT for version control.
Writing SQL queries for visualization and reporting systems. Good experience in visualization tool Tableau
Wrote ETL scripts in SQL for extraction and validatingdata

Environment: SQL Server, ETL, SSIS, SSRS, Tableau, Excel

We provide IT Staff Augmentation Services!

Software/data Engineer Resume

Richmond, VA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship