We provide IT Staff Augmentation Services!

Sr.data Engineer Resume

Seattle, WA

SUMMARY:

  • Above 7 years of experience as Data Engineer and Data Analyst including designing, developing and implementation of data models for enterprise - level applications and systems. Good experienced in Data Analysis as a Proficient in gathering business requirements and handling requirements management.
  • Good Experience on importing and exporting the data from HDFS and Hive into Relational Database Systems like MySQL and vice versa using Sqoop.
  • Developed data pipeline on premise and Cloud platforms such as AWS.
  • Good knowledge on NoSQL Databases including HBase, MongoDB, MapR-DB.
  • Experience in Power BI.Experience in using Python's packages like NumPy, Pandas, SciPy, Scikit-learn and IDEs - PyCharm, Spyder, Anaconda and Jupiter,
  • Experience in designing star schema, Snowflake schema for, ODS architecture.
  • Expertise in Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.Performed data collection, preprocessing, feature engineering, data visualization and analysis on large volumes of unstructured data using and R (Scikitlearn, Matplotlib, Pandas, NumPy, Seaborn, ggplot2, dplyr, ).
  • Exposure to different types of testing like Automation testing, System & Integration testing.
  • Experience with Client-Server application development using Oracle, SQL PLUS, SQL Developer, TOAD, and SQL LOADER.
  • Web Development using Azure, Java, C#, .NET.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, Mongo DB using Python.
  • Experience in using Python's packages like NumPy, Pandas, SciPy, Scikit-learn and IDEs - PyCharm, Spyder, Anaconda and Jupiter,
  • Hands on experience with different AWS services like S3, EC2, EMR, SNS, SQS, Lambda, Redshift, Data pipeline, Athena, AWS Glue, S3 Glacier, Cloud Watch, Cloud Formation, IAM, AWS Single Sign-On, Key Management Service, AWS Transfer for SFTP, VPC, SES, Code Commit.
  • Hands on experience with different AWS services like S3, EC2, EMR, SNS, SQS, Lambda, Redshift, Data pipeline, Athena, AWS Glue, S3 Glacier, Cloud Watch, Cloud Formation, IAM, AWS Single Sign-On, Key Management Service, AWS Transfer for SFTP, VPC, SES, Code Commit, Code Build.
  • Experienced in developing Python code to retrieve and manipulate data from AWS Redshift, Oracle 11g/12c, T-SQL, MongoDB, MS SQL Server, Excel and Flat files.
  • Strong experience with architecting highly per formant databases using PostgreSQL, PostGIS, MySQL and Cassandra.
  • Extensive experience in using ER modeling tools such as Erwin and ER/Studio.
  • Experience in AWS Cloud platform and its features which includes EC2, AMI, EBS Cloud watch, AWS Config, Auto-scaling, IAM user management, and AWS S3.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures and functions.
  • Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop.
  • Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Kafka,Apache Hadoop, Apache Spark and Cloudera.
  • Hands on experience in Normalization (1NF, 2NF, 3NF and BCNF) Denormalization techniques for effective and optimum performance in OLTP and OLAP environments.
  • Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift
  • Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
  • Expertise in moving structured schema data between Pig and Hive using HCatalog.
  • Solid knowledge of Data Marts, Operational Data Store (ODS),OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
  • Proficient knowledge and hands on experience in writing shell scripts in Linux.
  • Experience on developing MapReduce jobs for data cleaning and data manipulation as required for the business.

PROFESSIONAL EXPERIENCE:

Sr.Data Engineer

Confidential, Seattle, WA

Responsibilities:

  • Experience on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
  • Ingesting Real-Time Data with Azure IoT Hubs .
  • Responsible for loading the data from the different Data sources like (Oracle, SQL Server and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
  • Used python to retrieve and manipulate data from AWS Redshift, Oracle 11g/12c, MongoDB, T-SQL, MS SQL Server, Excel and Flat files.
  • Evaluated data import-export capabilities, data analysis performance of Apache Hadoop framework.
  • Developed HiveUDF's to bring all the customers email id into a structured format.
  • Excellent programming skills with experience in Java, C, SQL and Python Programming
  • Ingested the data from Source RDMS to Hadoop Data Lake using Datastage ETL framework and sqoop framework from different sources like SQL Server, Flat files, Oracle.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.
  • Developed bash scripts to bring the Tlog files from ftp server and then processing it to load into hive tables.
  • Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment.
  • Experience in Custom Process design of Transformation via Azure Data Factory & Automation Pipelines.
  • Working experience Azure Data warehouse, BLOB storage and Integration between on-perm and Azure cloud.
  • Working on Azure (IAAS), Should be involved in the planning, design, and deployment of Cloud solutions.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.
  • Developed predictive models on large scale datasets to address various business problems through leveraging advanced statistical modeling, machine learning and deep learning.
  • Implemented public segmentation using unsupervised machine learning algorithms by implementing K-means algorithm by using PySpark using data munging.
  • Experience in Machine learning using NLP text classification using Python.
  • Worked on different Machine Learning models like Logistic Regression, Multi-layer perception classifier and K-means clustering.
  • Data saved also in Kafka topic as in JSON format
  • Implement dynamic data masking with the Azure portal
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and have a good experience in using Spark-Shell and SparkStreaming.
  • Import millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Designed and implemented machine learning solutions including Supervised and Unsupervised models.
  • Implemented Feature Engineering, Feature extraction and Feature selection for efficient models.
  • Actively Working with Support Vector Machines, Random Forests and Gradient Boosting to obtain better results.
  • Involvedin all stages of development in machine learning-based systems: visualization, labeling data, selecting algorithms, training, model development, regression, debugging and updates.
  • Used Gradient Descent techniques like Adam, rmsprop and Stochastic Gradient Descent (SGD).
  • Developed UDF's using both DataFrames/SQL and RDD in Spark for data Aggregation queries and reverting into OLTP through Sqoop.
  • Implementing Real-Time Analytics with Azure HDInsight (Storm)
  • Developed ETL workflow which pushes web server logs to an Confidential S3 bucket.
  • Knowledge in using Java IDE's like Eclipse and IntelliJ
  • Automated & monitored Jenkins CI/ CD build and deployment jobs to reduce code patch breakages from data team
  • Developed applications/GUI's in C# for users to interface with SQL Server
  • Demonstrate a full understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
  • Extensive experience in Tableau/ Power BI in design and development of various dashboards and visualizations.
  • Designed appropriate Partitioning/Bucketing schema to allow faster data during analysis using HIVE.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Create / update various database/scripts to support teams for effective daily activities in Apache Data Bricks
  • Worked with star schema, snowflakes schema dimensions, SSRS to support large Reporting needs.
  • Develop PYTHON scripts to extract Health Safe member data from MYSQL and email report in EXCEL format to users.
  • Imported, exported and modelled large data sets, performed statistical analysis and visualized results using Python and its library NumPy, pandas, matplotlib, seaborn and SQL
  • Interested in building front end technologies like java script, CSS, HTML.
  • Leading the ever greening initiatives assuring upgrades for the EDW and for the java webpage, this includes the provisioning of the WAS Web servers, DB instances, upgrades in the java code of the applications and the switch from the java application from using DB2 to the Teradata DB.
  • Build the Jenkins and support for the code deployment into the production and Fixed the postproduction defects to perform the Map/Reduce code to work as expected.

Environment: Hadoop, HDFS, CI/CD, HTML, Power BI, Spark, Strom, Kafka,Snowflake Map Reduce, Data Bricks, Hive, Apache Spark, Pig, Sqoop, Oozie, DB2, Scala, Python, Pandas, Splunk, UNIX Shell Scripting, SQL, MYSQL,Azure, EMR, Java Script, ETL

Data Engineer

Confidential, Columbus, IN

Responsibilities:

  • Extracted files from DB2 through Kettle and placed in HDFS and processed.
  • Analyzed large data sets by running Hive queries and Pig scripts.
  • Developed spark job API which is real time from kafka topic at enriched level
  • Contributed to creation of automated procedures to deploy Hadoop-based clusters in AWS based on Bamboo and Chef,
  • Created an EMR log parsing tool to triage container, application, and daemon logs
  • Created R data pipeline that automatically pulls data from separate IRS data sets and merges into single source of trut
  • Generated PL/SQL scripts for data manipulation, validation and materialized views for remote instances.
  • Experience on Data bricks, Confidential web services, One lake and python scripting.
  • Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function
  • Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL. Created and managed complete data pipeline by using Kafka in JSON format data
  • Extract Real time feed using Kafka and FF and convert it to RDD and process data in the form of Data Frame and save the data as
  • Developing Batch processing solutions with Azure SQL Data Warehouse
  • Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.
  • Expertise in Power BI, Power BI Pro, Power BI Mobile
  • Used SQL Server Job System to run weekly automated backups on specific SQL Databases
  • Worked on Pandas API to Store data in tabular and time series format, which makes timestamp data easy to retrieve and manipulate.
  • Extensive experience in Tableau/ Power BI in design and development of various dashboards and visualizations.
  • Written the Map Reduce programs, Hive UDFs in Java.
  • Design/Maintain data model to support HealtheNotes project in ERWIN, create DDL to deploy tables to DEV, TEST, STAGE and PROD databases in MYSQL.
  • Created R functions that require table arguments to automatically generate SQL queries, significantly reducing coding time.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Used Jenkins for CICD process and exported data to Snowflake for Tableau Dash Boards.
  • Wrote Python scripts to parse XML documents and load the data in database. Used Python to extract weekly information from XML files. Developed Python scripts to clean the raw data.
  • Fix Plan Regression issues after upgrading from SQL Server 2014 to SQL Server 2017
  • Designed and developed Hadoop MapReduce programs and algorithms for analysis of cloud-scale classified data stored in Cassandra.
  • Used various sources to pull data into Power BI such as Sql Server, SAP BW, Oracle, SQL Azure etc.
  • Developed the backend for the NIH Data Commons Computational Genomic Platform (CGP) (Python 3, Docker, AWS Lambda, AWS EC2, AWS S3, Elasticsearch)
  • Used Apache Spark to execute Scala Source Code for JSON Data Processing and developed code to process it.
  • Worked on building input adapters for data dumps from FTP Servers using Apache spark.
  • Built a database in SQL Server from MySQL and designed Fraud Model and created a Tableau Dashboard.
  • Used AWS CLI with IAM roles to load data to Redshift cluster,
  • Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Designed and developed enterprise scale cloud alert mechanism using azure data bricks, Spark/Spark UI data processing framework (Python/Scala) and azure Data Factory. Built data pipelines to transform, aggregate and process data using Azure Databricks, Azure ADLS, Blob, Azure Delta and Airflow.
  • Coded ads with Html 5 to animate and function across multiple ad servers
  • Parsed the Online Orders JSON files to CSV using Python and used them as a source in the ETL process.
  • Optimized the performance of queries with modification in T-SQL queries, established joins and created clustered indexes
  • Used Hive and Sqoop utilities and Oozie workflows for data extraction and data loading.
  • Used Data Stage ETL to dump the data from Upstream source system to our Landing schema
  • Responsible for the design, implementation and architecture of very large-scale data intelligence solutions around Snowflake Data Warehouse.
  • Style the reporting suite using combination of R and CSS/HTML
  • Migrated some of the existing pipelines to Azure Data bricks using PySpark Notebooks for analytical team
  • Developed dashboards in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Experience with POC that involves scripting using PySpark in Azure Data bricks
  • Worked on QA the data and adding Data sources, snapshot, caching to the report.
  • Performed ad-hoc ETL changes as per the Business user's request.
  • Implemented and created jobs to perform CI/CD from version control systems like Bitbucket,Stash and Bamboo
  • Involved in troubleshooting at database levels, error handling and performance tuning of queries and procedures.

Enviroment: SAS, SQL, MYSQL, Teradata r15, Netezza, PL/SQL, UNIX, XML, ETL, HTML, Python, AWS, SSRS, TSQL, Hive 2.3, Sqoop. Apache Spark and Apache Hadoop, Kafka, Lambda, AWS EC2, AWS S3, Elasticsearch, power BI, R, Bamboo, Snowflake, Data Bricks, Pandas, EMR

Data Analyst/ Software Engineer

Confidential, Schenectady, NY

Responsibilities:

  • Researched, evaluated, architect, and deployed new tools, frameworks and patterns to build sustainable Big Data platforms.
  • Responsible for the data architecture design delivery, data model development, review, approval and Data warehouse implementation.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Enhanced current Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Spark, Mongo DB using Python.
  • Developed Sqoop jobs to import data incrementally from MySQL to HDFS
  • Involved in OLAP model based on Dimension and FACTS for efficient loads of data based on Star Schema structure on levels of reports using multi-dimensional models such as Star Schemas and Snowflake Schema.
  • Developed data analysis tools using SQL and Python code.
  • Developed ETL batch automation using shell scripting for QA functional testing.
  • Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
  • Performed the Data Mapping, Data design ( Data Modeling) to integrate the data across the multiple databases in to EDW. Coordinate and guide support teams for daily activities in Data Bricks
  • Designed both 3NF Data models and dimensional Data models using Star and Snowflake schemas.
  • Worked on Data modeling, Advanced SQL with Columnar Databases using AWS.
  • Developed visualizations for assessing on-going possibility of predicting overall manufacturing outcome from reduced early sample of outcomes in PowerBI using Python Visualizations with Matplotlib, Numpy, and Pandas.
  • Developed an application using Java and MySQL to process database information
  • Cleansed, extracted and analyzed business data on daily basis and prepared ad-hoc analytical reports using Excel and T-SQL
  • Developed Spark Job to get the data from kafka topic at enriched level and save into Oracle in real time
  • Done proof of concepts using Power BI tools ( Power Query/ Power View) for excel. Airflow
  • Performed ad-hoc ETL changes as per the Business user's request. members. Primary tools and methods included Python, Jenkins CI/ CD, bash, JIRA, LISP, Pytorch, test-driven development, and Scrum
  • Extensive experience working with ETL of large datasets using PySpark in Spark on HDFS.
  • Developed and maintained Java/J2EE code required for the web application.
  • Coordinate and guide support teams for daily activities in Data Bricks
  • Handled performance requirements for databases in OLTP and OLAP models.

Environment: Map Reduce, Java Script, Kafka, YARN, H Base, HDFS, CI/CD, ETL, Hadoop 3.0, Erwin 9.1, AWS, EMR, T-SQL, OLTP, OLAP, ODS, DW, Snowflake, EDW, Data Bricks, MYSQL, Python, Pandas

Data Analyst

Confidential, Richmond, VA

Responsibilities:

  • Participated in requirement gathering, business logic analysis, design, implementation, and deployment phase of development life cycle (SDLC).
  • Providing post release data validation and working with Project team, internal/external stakeholder to improve existing database applications in Snowflake
  • Worked on Pandas API to Store data in tabular and time series format, which makes timestamp data easy to retrieve and manipulate.
  • Having experience in developing a data pipeline using Kafka to store data into HDFS.
  • Good knowledge in Azure cloud services, Azure Storage to manage and configure the data.
  • Wrote User Defined functions (UDFs) for special functionality for Apache Spark.
  • Designed and built Use Case Diagram and Sequence Diagram for both 2 management systemsby using Visual Studio.
  • Designed the Entity Relationship Diagram (ERD) of these 2 management systems in Microsoft SQL server.
  • Defined the Data Contract in service to describe the format of data and how data should beserialized/deserialized.
  • Designed and developed front-end pages based on ASP.NET and MVC.
  • Utilized Database First approach to generate all entity models from database and used LINQ tomanipulate entities that reflects tables.
  • Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards.
  • Used python Libraries like PySpark, Pytest, coracle and PyMongo based on the modules and business requirement.
  • Worked with star schema, snowflakes schema dimensions, SSRS to support large Reporting needs.
  • Implemented state management in MVC using Session and Cookie with help of ViewData andTempData.
  • Utilized Agile and Scrum methodology for team and project management.
  • Exposure to Apache Data Bricks to generate scripts in Py-Spark to automate the reports.
  • Used JIRA for version control with colleagues.

Environment: MS SQL Server 12, HTML5, CSS3, ASP.NET MVC, JavaScript, Kafka, TFS, Power BI, Apache Spark, Snowflake, SQL, MYSQL and Visual Studio, Data Bricks, Python, Pandas, AWS, EMR

Hire Now