We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

Branchburg, NJ

SUMMARY

  • Around 8+ years of Real Time Hands - on experience IT experience in the Analysis, design, development, testing and Implementation of ETL Informatica developer & Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, SQL SERVER, MSBI, Power BI and Azure Data Engineer.
  • Extensively worked on system analysis, design, development, testing and implementation of projects and capable of handling responsibilities independently as well as a proactive team member.
  • Excellent knowledge of entire Software Development Life Cycle (SDLC) methodologies like Agile, Scrum, Waterfall and Project Management Methodologies.
  • Worked on Data Virtualization using Teiid and Spark, RDF graph Data, Solr Search and Fuzzy Algorithm.
  • Experience in designing, developing, and deploying Business Intelligence solutions using SSIS, SSRS, SSAS, Power BI.
  • Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis)
  • Responsible for designing and building a DataLake using Hadoop and its ecosystem components.
  • Built a data warehouse on SQL Server & Azure Database.
  • Working experience in creating real time data streaming solutions using Apache Spark/Spark Streaming & Kafka and built Spark Data Frames using Python.
  • Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.
  • Hands on Experience working with Azure Data Lake Analytics to analyze the structured, non-structured data from various sources.
  • Strong knowledge of Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally.
  • Experienced working with various services in Azure like Data lake to store and analyze the data.
  • Experience in developing OLAP Cubes by using SQL Server Analysis Services (SSAS), and defined data sources, data source views, Dimensions, Measures, Hierarchies, Attributes, Calculations using multi - dimensional expression (MDX), Perspectives and Roles.
  • Extensive experience in Dynamic SQL, Records, Arrays and Exception handling, data sharing, Data Caching, Data Pipelining. Complex processing using nested Arrays and Collections.
  • Building and publishing POWER BI reports utilizing complex calculated fields, table calculations, filters, parameters.
  • Designed and developed matrix and tabular reports with drill down, drill through using SSRS.
  • Involved in migration of legacy data by creating various SSIS packages.
  • Expert in Data Extraction, Transforming and Loading (ETL) using various tools such as SQL Server Integration Services (SSIS), DTS, Bulk Insert, UNIX shell scripting, SQL, PL/SQL, SQL Loader and BCP
  • Expertise in developing Parameterized, Chart, Graph, Linked, Dashboard, Scorecards, Report on SSAS Cube using MDX, Drill-down, Drill-through and Cascading reports using SSRS.
  • Experience in Handling Heterogeneous data sources and databases Oracle, Teradata, and csv and XML files using SSIS.
  • Hands-on Real time experience in utilizing databases like MongoDB, MySQL and Cassandra.
  • Strong knowledge and exposure to creating Jobs, Alerts, SQL Mail Agent and scheduled SSIS Packages.
  • Test & Deploy code across various SDLC Environments like TEST, UA/UT/Production.
  • Working knowledge of SQL Trace, TK-Prof, Explain Plan, and SQL Loader for performance tuning and database optimization.
  • Implemented Slowly changing dimensions and change data capture using Informatica.
  • Extensively developed Complex mappings using various transformations such as Unconnected / Connected lookups, Router, Filter, Expression, Aggregator, Joiner, Update Strategy, Union and more.
  • Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
  • Experience with ETL tool Informatica in designing and developing complex Mappings, Mapplets, Transformations, Workflows, Worklets, and scheduling the Workflows and sessions.
  • Experience in using the Informatica command line utilities like pmcmd to execute workflows in non-windows environments.
  • Experience in integrating databases like MongoDB, MySQL with webpages like HTML, PHP and CSS to update, insert, delete and retrieve data with simple ad-hoc queries.
  • Developed heavy load Spark Batch processing on top of Hadoop for massive parallel computing.
  • Strong analytical, problem-solving, communication, learning and team skills.
  • Experience in using Automation Scheduling tools like Auto-sys and Control-M.
  • Developed Spark RDD and Spark DataFrame API for Distributed Data Processing.

TECHNICAL SKILLS

Areas of expertise: Big Data|ETL Pipelines Data Pipelines AWS Data Science Machine Learning Data Visualization Data Mining Data Modeling and Data Warehousing Natural Language Processing (NLP) Cloud Computing Restful API structuring Python NumPy Scipy Pandas|Pyspark MongoDB FLASK Business Development Data Analysis Project Management Strong Statistical Knowledge Spark JavaScript SQL

Programming Languages: Python, R, C#, JavaScript, SQL

Python Libraries/Packages: NumPy, SciPy, PySide, PyTables, Pandas, Tensorflow, Pytorch, Matplotlib, SQLAlchemy, PyQuery

Statistical Analysis Skills: A/B Testing, Time Series Analysis, Marko

Big Data Tools: Hadoop, HDFS, MapReduce, Hive, HBase, Spark, Kafka, Scala, Zookeeper, Hive, Pig, Sqoop Cassandra, Oozie, MongoDB, Flume.

IDE: PyCharm, PyScripter, Spyder, PyCharm, PyDev, IDLE, NetBeans, Sublime Text, Visual Code

Machine Learning and Analytical Tools: Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, and Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Google Analytics Fiddler, Tableau.

Cloud Computing: AWS (EC2, EMR and S3), Azure, Rackspace, OpenStack

AWS Services: Amazon EC2, Amazon S3, Amazon Snowflake, Amazon Databricks, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon Sagemaker, Amazon RDS, Elastic Search, Amazon SQS, AWS IAM, AWS Cloud Watch, EBS and CloudFormation

Databases: MySQL, SQLite3, Cassandra, Redis, PostgreSQL, CouchDB, MongoDB, DynamoDB, TerraData

ETL: Informatica 9.6, Data Stage, SSIS. SQL Loader

Miscellaneous: Git, GitHub, SVN, CVS

Build and CI tools: Docker, Kubernetes, Maven, Gradle, Jenkins, Hudson, Bamboo

SDLC/Testing Methodologies: Agile, Waterfall, Scrum, TDD (Test Driven Development)

Operating Systems: Linux, UNIX, AIX, Windows NT/2000/2003/XP/7

PROFESSIONAL EXPERIENCE

Sr Data Engineer

Confidential, Branchburg, NJ

Responsibilities:

  • As a Data Engineer, Worked closely with Business Analysts to gather requirements and design a reliable and scalable data pipelines.
  • Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
  • Developed high speed data ingestion pipeline using Scala, Apache Spark, Akka Streams, HDFS, Hive, and Cassandra.
  • Developed python scripts for data cleaning, analysis and automating day to day activities.
  • Developed shell scripts and SQL for data analysis and quality checks.
  • Developed automated reports using Tableau, Python and MySQL to reduce the manual intervention saving 20 hours a month.
  • Designed, build and managed ELT data pipeline, leveraging Airflow, python, dbt, Stitch Data and GCP solutions.
  • Optimized SQL queries and denormalize tables to improve query execution speed while lowering query costs. Accessing the Hive tables using spark Hive context (Spark, SQL) and used Scala for interactive operations.
  • Overseer of GitHub repository, enforcing industry code version-control best practices and approving PRs.Singlehandedly migrated data warehouse from Postgres to BigQuery.
  • Designed and launched CI/CD framework with CircleCI which led to more reliable code and cleaner data.Interface with stakeholders to assess needs, development requirements, and timelines.
  • Mentor and guide analyst on building purposeful analytics tables in dbt for cleaner schemas.
  • Organize and facilitate daily scrum meetings, sprint planning, sprint reviews and sprint retrospectives.
  • Model, lift and shift custom SQL and transpose LookML into dbt for materializing incremental views.
  • Performed Data Analysis and provided statistical insights for the same using Python and Statistical Algorithms. Developed dashboards using Grafana with TSDB as data source.
  • Developed multiple dashboards and reports in SPLUNK to improve agility of Security Analysts.
  • Provided time and again various stats and intelligible insights by leveraging data from disparate data sources to internal stakeholders using DB/ETL/BI tools for discovering opportunities in customer retention/engagement, collateral generation, product development & product marketing.
  • Worked on data modeling, building and creating data visualization, BI reports and dashboards for multiple applications using Sisense and Power BI.
  • Assisted in database designing and maintenance for various products in PostgreSQL, MS SQL Server, creation of data lake in Snowflake using Stitch, App Testing and Production support.
  • Created complete pipelines of static and dynamic data, using FLASK framework and deployed with AWS.
  • Fetched data from the web by structuring API calls, and scraping using BeautifulSoup and Splinter libraries.
  • ETL performed on data collected from the web or provided by the client, using primarily Python (Pandas, SQLAlchemy).
  • Used FLASK framework to bring together Python, SQL or NoSQL (MongoDB) databases and JavaScript to create dynamic visualizations in D3, Plotly and/or Leaflet (Mapbox).
  • Used MITM proxy in a mobile device to identify API (POST, GET) requests from the server and replicate it in a Python algorithm.
  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various Machine learning algorithms.
  • Built linear and non-linear regression models using Scikit Learn Python library to check correlations within a dataset and present them in tables & charts.
  • Implement low latency and a fault-tolerant data pipelines for optimal extraction, transformation, and loading of data into HDFS using Apache technologies like Sqoop, Hive, and Java for it to be used in machine learning modeling.
  • Hydrate the data lake by creating data ingestion pipelines for using various technologies like Attunity Replicate, Kafka, Spark, HDFS, Hive, Java, Bash Scripting HBase, Teradata, Query Grid, CA7. The data is then used by stakeholders for machine learning and reporting.
  • Used pandas, NumPy, Seaborn, matplotlib, Scikit-learn, SciPy, NLTK in Python for developing various machine learning algorithms.
  • Setup storage and data analysis tools in Amazon Web Services (AWS) cloud computing infrastructure.

Data Engineer

Confidential, St. Louis, MO

Responsibilities:

  • Used custom developed PySpark scripts to pre-process, transform data and map to tables inside the CIF (Non- corporate Information Factory) data warehouse
  • Developed shell scripts of Sqoop jobs for loading periodic incremental imports of structured data from various RDMS to S3 and used Kafka to ingest real-time website traffic data to HDFS
  • As part of reverse engineering discussed issues/complex code to be resolved and translated them into Informatica logic and prepared ETL design documents.
  • Experienced working with team, lead developers, Interfaced with business analysts, coordinated with management and understand the end user experience
  • Used Informatica Designer to create complex mappings using different transformations to move data to a Data Warehouse.
  • Developed mappings in Informatica to load the data from various sources into the Data Warehouse using different transformations like Source Qualifier, Expression, Lookup, aggregate, Update Strategy and Joiner.
  • Optimized the performance of the mappings by various tests on sources, targets and transformations.
  • Scheduling the sessions to extract, transform and load data in to warehouse database on Business requirements using scheduling tool.
  • Extracted (Flat files, mainframe files), Transformed and Loaded data into the landing area and then into staging area followed by integration and sematic layer of Data Warehouse (Teradata) using Informatica mappings and complex transformations (Aggregator, Joiner, Lookup, Update Strategy, Source Qualifier, Filter, Router and Expression Optimized the existing ETL pipelines by tuning SQL queries and data partition techniques
  • Created independent data marts from existing data warehouse as per the application requirement and updated them on bi-weekly basis
  • Decreased the Azure billing by pivoting from using Redshift storage to Hive tables for unpaid services and implemented various techniques like Partitioning and Bucketing over hive tables to improve the query performance
  • Used Presto distributed query engine over hive tables for its high performance.

Data Engineer

Confidential, New York, NY

Responsibilities:

  • Hands-on experience developing ETL's using Informatica Cloud Services (ICS) and third-party data connectors (i.e. Salesforce, Zuora, Oracle EBS etc.) and Change data capture
  • Export/Import data from Teradata to Hive/HDFS using Sqoop and the Hortonworks Connector for Teradata.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Kafka producer API and consumer API configuration, upgrading, rolling upgrade, Topic level Configs, Kafka connect configs, stream configs, consumer rebalancing, operations, replication, message delivery semantics, end - to - end batch compression etc.
  • Hands-on experience with Informatica power center and power exchange in integrating with different applications and relational databases.
  • AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, AWS Lambda.
  • Hands on with different API Endpoints like Edge Optimized, Regional, Private in AWS API Gateway.
  • Configured control connections different levels like API Key, Method level, Account Level.
  • AWS API Gateway protection strategies like Resource Policies, IAM, Lambda, Cognito Authentications.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
  • Building ETL data pipeline on Hadoop/Teradata using Hadoop/Pig/Hive/UDFs. Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
  • Expertise in implementing DevOps culture through CI/CD tools like Repos, Code Deploy, Code Pipeline, GitHub.
  • Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in software delivery process.
  • Hands-on experience with Informatica power center and power exchange in integrating with different applications and relational databases.
  • Created AWS Code Pipeline, a service that builds, tests, and deploys code every time there is a code change, based on the release process models.
  • Created pipeline that uses AWS Code Deploy to deploy applications from an Amazon S3 bucket and AWS Code Commit repository to Amazon EC2 instances running Amazon Linux.

Data Engineer

Confidential

Responsibilities:

  • Worked closely with a project team for gathering the business requirements and interacted with business analysts to translate business requirements into technical specifications.
  • Conducted independent data analysis, gap analysis, write mid-level SQL queries with interpretation and generated reports on graphs as per specifications
  • Extensive experience in Text Analytics, generating Data Visualization using Python and R creating dashboards using tools like Tableau
  • Performed Data analysis on data-set of more than 100,000 rows using R-Studio and generated financial report analysis using ggplot2 and lattice packages
  • Predicted the net profit of the next quarter using linear regression which helped in the expansion of the company by estimating the budget for the next year.
  • Designed and implemented data integration modules for Extract/Transform/Load (ETL) functions.
  • Involved in data warehouse design.
  • Worked with internal architects in the development of current and target state data architectures.
  • Worked with project team representatives to ensure that logical and physical ER/Studio data.
  • Involved in defining the source to target data mappings, business rules and data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Worked on Data modeling, Data mapping., Data cleansing, Data visualization.
  • Under supervision of Sr. Data Scientist performed Data Transformation method for Rescaling.
  • Used SQL queries on the internal database and performed CRUD operations to maintain the database for data tracking purposes
  • Gathered requirements and created Use Cases, Use Case Diagrams, Activity Diagrams using MS Visio
  • Performed Gap Analysis to check the compatibility of the existing system infrastructure with the new business requirements.
  • Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce
  • Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs
  • Performed Data Mapping, Data design (Data Modeling) to integrate data across multiple databases in to EDW
  • Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling
  • Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data
  • Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.
  • Automated the monthly data validation process to validate the data for nulls and duplicates and created reports and metrics to share it with business teams
  • Used clustering techniques like K-means to identify outliers and to classify unlabeled data
  • Data gathering, data cleaning and data wrangling performed using Python
  • Transformed raw data into actionable insights by incorporating various statistical techniques, data mining, data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
  • Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging Decision Trees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
  • Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
  • Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means
  • Conducted model optimization and comparison using stepwise function based on AIC value
  • Used cross-validation to test models with different batches of data to optimize models and prevent over fitting
  • Worked and collaborated with various business teams (operations, commercial, innovation, HR, logistics, safety, environmental, accounting) to analyze and understand changes in key financial metrics and provide ad-hoc analysis that can be leveraged to build long term points of view where value can be captured
  • Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Used Kibana an open-source plugin for Elasticsearch in analytics and Data visualization.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, AdA boost and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
  • Built models using Python and PySpark to predict probability of attendance for various campaigns and events
  • Implemented classification algorithms such as Logistic Regression, K-NN neighbors and Random Forests to predict Customer churn and Customer interface
  • Performed data visualization and Designed dashboards and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders

Environment: Hadoop, HDFS, Hbase, Oozie, Spark,Machine Learning, Big Data, Python,PySpark, DB2, MongoDB, Elastic Search, Web Services.

Data Analyst

Confidential

Responsibilities:

  • Handled a data set containing 64 million observations and 36 variables (IU Methodist Hospital EHR data).
  • Achieved SQL Database connection with Python using the Pyodbc module.
  • Performed retrospective cohort and carried out Principal Component Analysis (PCA) for dimensionality reduction. Calculated Relative Risk along with Confidence Interval.
  • Utilized Python to build regression models, Classification models, Logistic regression, Random Forest and SVM models p performance analysis procedures
  • Created dashboards in Tableau for data reporting.
  • Work extensively with multiple data sources including - Claims data, Clinical data and Quality measures to analyze and report ACO trends and insights to the internal and external stakeholders.
  • Perform data cleaning and data standardization of excel data before importing the data into the SQL
  • Assist in developing SQL queries to build data extracts based on the requirements of the Community Health System’s (CHS) ACO clients and non - CHS ACO clients.
  • Perform predictive modeling techniques using Python and R to enhance the patient experience and health outcomes.
  • Develop user-friendly, informative Tableau dashboards based on the client requirements, publish the developed dashboards on to the production site after the testing of their usability on the test site.
  • Assist in achieving standardization and uniform staging of the data across all the visualizations built along with including security to the interfaces by providing restrictive access to the data based on the role of the user.
  • Worked extensively on Data Profiling, Data cleansing, Data Mapping and Data Quality.
  • Created Tableau Dashboards with interactive views, trends and drill downs along with user level security. Assisted the team for standardization of reports using SAS macros and SQL.
  • Responsible to design, develop and test the software (Informatica, PL SQL, UNIX shell scripts) to maintain the data marts (Load data, analyze using OLAP tools).
  • Experience in building Data Integration, Workflow Solutions and Extract, Transform, and Load (ETL) solutions for data warehousing using SQL Server Integration Service (SSIS).
  • Facilitated (JAD) Joint Application Development sessions to identify business rules and requirements and documented them in a format that can be reviewed and understood by both business people and technical people.