Data Engineer Resume
Needham, MA
SUMMARY
- Extensive IT experience around 7 years with multinational clients which includes of Big Data related architecture experience developing Spark/Hadoop applications.
- Developed end to end pipelines using Airflow and databricks mounted notebook to perform ETL operations.
- Used AWS S3, Redshift, spectrum, Athena for Business user reporting.
- Developed shell script to schedule the jobs on airflow
- Developed multiple notification applications and automatic alert mechanisms using python modules.
- Using pandas, applied aggregations on various data sources and provided outputs in csv formats.
- For arrays processing utilized numpy python modules to process.
- Implemented POC to migrate python spark into spark data frames using python.
- Experienced in working with Spark ecosystem using python modules pyspark, sparkQL and Scala queries on different data file formats like .txt, .csv etc.
- Also, working towards improvement of knowledge on No - SQL databases like MongoDB.
- Hands-on experience in scripting skills in Python, Linux, and UNIX Shell.
- Experience in developing web-based applications.
- Working with relative ease with different working strategies like Agile, Waterfall and Scrum methodologies.
- Excellent communication and analytical skills and flexible to adapt to evolving technology.
- Experience in visualizing infographics to deliver meaningful insights of data using Excel, Tableau and RShiny.
- Experience in building Data pipelines, Data Engineering, Data Mining & programming Machine Learning Algorithms (supervised and unsupervised) to gather insights off the data.
TECHNICAL SKILLS
Programming skills: Python, R, C/C++, Java/Scala, Unix, Bash Scripting, pySpark, React, LaTex
Apache Technologies: Apache Spark
Cloud technologies: Databricks, AWS, Airflow, Docker
Big Data / Cloud technologies: Spark, Kafka, Redshift, Airflow, Kubernetes, Google Cloud Platform, AWS, Azure Devops, Hadoop, JIRA, CI/CD
Databases: PLSQL, Postgres, MS Azure, MS SQL Server 2017, SSIS, ERWin modeller, T-SQL, MySQL, Cassandra, HBase, DynamoDb
Analytical Skills: ETL, Data Warehousing, Informatica, Data Management, Collection, Predictive Models, data modeling, TensorFlow, Sparkml, a/b test, Data analysis, Redshift, Parquet
Business Intelligence: Tableau, SAS, Looker, Power BI, Cognos, Matplotlib, Seaborn, A/B testing, Looker, BigQuery, Alteryx, SSIS
Machine Learning: Logistic regression, random forests/decision trees, statistical models, neural net, svm, predictive analytics, Ensembles, NLP, Caffe, MxNet, Pytorch, Keras, RNN, attribution/forecasting, scikit-learn, SciPy, Matplotlib, Pandas
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential
Responsibilities:
- Delivered Sales CDL team efforts to upgrade the pipeline from CDL 1.0 to CDL 3.0 in DEV, UAT and PROD environment, worked with Gitlab, CI/CD
- Developed core component, Metric Engine, in iDNA platform, created comprehensive airflow UI to drive iDnA platform, removed operation’s pain points, increased the satisfaction of business clients
- Orchestrated automation tools to speed up iDNA CDL process for patient domain
- Led the efforts to standardize cluster params of iDNA platform, implemented new features to modify params with flexibility and adaptation
- Implemented CDL ingestion pipelines for sales, multi channel marketing and Confidential mdh data, worked with different data formats such as zip, txt, GZ, bzip, csv from S3 to Redshift for Business users
- Accelerated Data Validation process by reducing the manual work, made easier to debug code for migration by coding PySpark based automation script
- Responsible for driving Sales Data Services weekly/monthly execution for various Airflow or Databricks issues like partition, concurrency.
- Maintained quality reference data in RDS by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment.
- Analyzed customer requirements and sales database and used SparkSQL to build ETL on Databricks for downstream systems.
- Developed and written multiple pipeline rules in graylog to process different types data.
- Developed big data applications using python modules like pyspark and other big data python modules
- Designed and Developed Spark code using python, PySpark & Spark SQL for high speed data processing to meet critical business requirement.
- Collected performance on existing classification process and reinvented the process using Spark 2.2 and Oozie
- Worked on all four stages - data ingest, data transform, data tabulate and data export.
- Maintained fully automated CI/CD pipelines for code deployment (Gitlab/ Jenkins).
- Built code using Java, Spring boot, Maven, and Jenkins for building and automating our data workflow
- Responsible for implementing Object-Oriented Programming concepts to build UI components that could be reused across the Web Applications and working on client-side frameworks like Spring frameworks and using version control tools like Git, GITHUB and iterative development tools like Atlassian Bitbucket and JIRA.
Tools: Used: HDFS, Spark, Spark SQL, Oozie, PySpark, Kafka, Hive, HBase, MapReduce, DatabricksAWS - Redshift, S3, EC2, EMR.
Data Engineer
Confidential | Needham, MA
Responsibilities:
- Developed data pipeline using Flume, Spark, and Hive to ingest, transform and analyzing Data
- Implemented various Pig UDF's for converting unstructured data into structured data.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Implemented Spark using python and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed custom web pages using Airflow boiler plate, HTML, javascript, flask to generate the line plot for all the DAGs to identify to optimization requirements for various tasks.
- Built and maintained PL/SQL lifecycle code base to generate ad hoc and weekly financial reports for external and internal clients.
- Created python tools using Pandas by automating content creation inside presentations, docs, reduce data validation cost to 0.
- 95 % speed improvement in Signal Spotting Trend Prediction, runtime from 500 min + to 20 min, back end using NumPy
- Created ETL pipeline using python, data integrated with Tableau, creating custom reporting, generated dashboards.
- Built regression model like Market Mix Modelling to estimate impact of marketing channels on sales, automate data preparation.
- Analyzed Data to identify purchase KPI’s, use cases, storyteller dashboard, communicate insights, improve strategy, leadership
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Developed the Apache Spark, Flume, and HDFS integration project to do a real-time data Analysis
- Using SQL to extract data from various client sources such as AWS S3, Redshift and aggregating them into Database.
- Established system support for SQL Server, SQL query performance tuning for incoming data, reduced data ingestion by 25%
- Designed, developed, and implemented data models with quality and integrity at the top of mind to support our products.
- Load and transform large sets of structured, semi structured and unstructured data.
- Performing analysis using high level languages like Python.
- Launching Amazon EC2 cloud instances using Amazon images and configuring launched instances with respect to specific applications.
Tools: Used: Databricks, pyspark, Airflow, AWS, Python, Java Script, jQuery, R, Pandas, NumPy, SQL, D3. Js, GitLab, Hadoop, Pig, Sqoop, Oozie, MapReduce, HDFS, Hive, Java Eclipse, UNIX Shell Scripting
Data Engineer
Confidential
Responsibilities:
- Executed web scraping using Python and building databases of audit and compliance of capital market companies
- Composed codes in Python to web-scrape, and fed the data into Postgres Relational Database, created the top 100 artists from Facebook GRAPH API to scrape the posts, fans count, comments, likes of more than 2000+ artist fan pages
- Spearheaded Toad Data Modeller tool to design relationship between various entities of artists in Relational Database
- Applied SQL queries to identify various KPI of artists over the audience using Facebook as a tool
- Collaborate Research on sentimental analysis on text using Natural Language Processing Tool kit to determine insights on opinion over artist influence, further SQL queries for time series analysis to identify trends, develop metrics
- Implemented frontend website using HTML, CSS and hosted on linux server
Data Scientist
Confidential | Needham, MA
Responsibilities:
- Established data pipelines into SQL Server for incoming data analysis using SQL and reduced time for data ingestion by 25%
- Proposing solutions, debug, analyse A/B testing to increase efficiency of marketing campaigns, improve product sales by 25%.
- Recommend innovative design policies by developing ETL pipeline using Talend, SSMS with Tableau, problem solving, reporting
- Developed and executed Unit Test plans using JUnit, ensuring that results are documented and reviewed with Quality Assurance teams responsible for integrated testing.
- Developed User Interface by using React, HTML5, Spring Web Flow, XHTML, DHTML and CSS3.
- Involved in all phases of Software Development Life Cycle (SDLC) like Analysis, Designing, Developing, Testing, Finalizing.
- Used Agile software development with Scrum methodology.
- Worked on user validations by using Angular 2.0.
- Implemented Web-Services to integrate between different applications (internal and third-party components using SOAP and RESTFUL services.
- Performed Branching, Tagging, Release Activities on Version Control Tools: SVN, GitHub
- Derive high quality industry trends, engagement Prediction Models using RF helped delivering retail Sales retention up by 12%
- Build R-drake data driven pipelines for marketing models like Market Mix Modelling, automate data preparation, reporting.
- Led analytics team for Marketing through dashboards, reporting, statistical analyses, capture insights from marketing campaign
- Created an ETL pipeline using Talend, executed scrapping using Python from PCOAB, Nasdaq webs, Companyfinance, Auditors.
- Perform requirements gathering, visualized in Tableau to find Metrics, integrity, insight of predictor healthcare
Tools: Used: Python, Tableau, SQL, Docker, AML H2O, AWS S3, TPOT, AWS EC2, R, MySQL, SQL