Data Engineer Resume
SUMMARY
- Data Engineer with 7 years of experience in Healthcare, e - commerce, Automobile and Insurance domain. Skilled at performing Data Extraction, Data Screening, Data Cleaning, Data Exploration, Data Visualization and Statistical Modelling of varied datasets, structured and unstructured, as well as implementing large-scale Machine Learning algorithms and Statistical Testing to deliver resourceful insights and inferences significantly impacting business revenues and user experience.
- Experienced in facilitating the entire lifecycle of a data science project: Data Extraction, Data Pre-Processing, Feature Engineering, Algorithm Implementation & Selection, Back Testing and Validation.
- Expert at working with Statistical Tests: t-test, ANOVA along with Non-parametric tests: Chi-squared tests, Mann-Whitney U & Kruskal-Wallis test.
- Skilled in using Python libraries NumPy, Pandas, Seaborn, MatplotLib for performing Exploratory Data Analysis.
- Proficient in Data transformations using log, square-root, reciprocal, cube root, square and complete box-cox transformation depending upon the dataset.
- Experience with Agile methodology using Jira
- Adept at handling Missing Data by exploring the causes like MAR, MCAR, MNAR and analyzing Correlations and similarities, introducing dummy variables and various Imputation methods.
- Experienced in Machine Learning techniques such as Regression and Classification models like Linear Regression, Logistic Regression, Decision Trees, Support Vector Machine, K-NN, Clustering like k-means using scikit-learn on Python.
- Hands on experience with NLP Libraries like NLTK & word embedding techniques like Word2Vec, CountVectorizer, Tf-IDF.
- In-depth Knowledge of Dimensionality Reduction (PCA, LDA), Hyper-parameter tuning, Model Regularization (Ridge, Lasso, Elastic net) and Grid Search techniques to optimize model performance.
- Skilled in Big Data Technologies like Apache Spark, Spark SQL, PySpark, HDFS, Hive, Pig (Hadoop), MapReduce(Hadoop Ecosystem) & Apache Kafka.
- Proficient in Data Visualization tools such as Tableau and PowerBI, Big Data tools such as Hadoop HDFS, Spark and MapReduce, MySQL, Oracle SQL and Redshift SQL and Microsoft Excel (VLOOKUP, Pivot tables).
- Working knowledge of Database Creation and maintenance of Physical data models with Oracle, DB2 and SQL server databases as well as normalizing databases up to third form using SQL functions.
- Experience in Web Data Mining with Python’s ScraPy and BeautifulSoup packages along with working knowledge of Natural Language Processing (NLP) to analyze text patterns.
- Proficient in Natural Language Processing (NLP) concepts like Tokenization, Stemming, Lemmatization, Stop Words, Phrase Matching and libraries like SpaCy and NLTK.
- Proficient in unsupervised machine learing algorithms like K-Nearest Neighbors and clustering like K-means.
- Good Experience in working with Jenkins and Artifactory for continuous integration and deployment.
- Experience in using different hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop, Hive, Impala, Hbase, Kafka, and Crontab tools.
- Skilled at Data Visualization with Tableau, PowerBI, and utilizing Python’s visualization libraries(Seaborn, Matplotlib, Bokeh and interactive graphs using Plotly & Cufflinks).
- Familiar with Graph Databases such as Unicorn, FlockDB, Neo4j and Apache Giraph.
- Performed Data Visualization using RStudio, used ggplot2, Esquisse, lattice, highcharter, Leaflet, sunburstR, RGL to make interesting plots.
- Utilized R libraries like Dplyr, Lubridate, Knitr, Mlr, Quanteda, Caret, to do statistical analysis, data wrangling, dynamic report generation, performing Machine learning tasks, text analysis and drawing inferences from data.
- Also used Hbase for OLTP purpose for an application requiring high scalability using hadoop.
- Used Hbase for real time low latency read writes for multiple applications.
- Skilled in Data profiling, Data Cleansing, Data mapping, creating workflows and Data Validation using data integration tools like Informatica and Talend Open Studio during the ETL and ELT processes.
- Familiarity with development best practices such as code reviews, unit testing, system integration testing (SIT) and user acceptance testing (UAT).
- Knowledge of Cloud services like Amazon Web Services (AWS) and Microsoft Azure for building, training and deploying scalable models.
- Extensive experience in Requirement Analysis, Application Design & Development and profound knowledge on SDLC using agile and V-model.
- Worked on Pyspark code for AWS Glue jobs and for EMR.
- Worked on Azure Cloud Services, Azure cloud Storage and Azure SQL.
- Proficient in using PostgreSQL, Microsoft SQL server, T-SQL, Oracle SQL server and MySQL to extract data using multiple types of SQL Queries including Create, Join, Select, Conditionals, Drop, Case etc.
- Adept at writing HiveQL queries and performing ETL tasks, reporting and data analysis.
TECHNICAL SKILLS
Languages and Platforms: Python (Numpy, Pandas, Scikit-learn, Tensorflow, etc.), Spyder, JuPyter lab, R Studio(ggplot2, dplyr, lattice, highcharter etc), SQL, SAS.
Regression Methods: Linear, Polynomial, Decision Trees
Classification: Logistic Regression, K-NN, Naïve Bayes, Support Vector Machines (SVM);
Ensemble Learning: Random Forests, Gradient Boosting, Bagging;
Clustering: K-means clustering, Hierarchical clustering;
Deep Learning: Artificial Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks;
Dimensionality Reduction: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA);
Time-series Forecasting: Using Python Statsmodels and R for time-series modelling.
Database: SQL, PostgreSQL, MongoDB, Microsoft SQL Server, NoSQL, Oracle
Statistical Tests: Hypothesis Testing, ANOVA, z-test, t-test, Chi-Squared Fit test
Validation Techniques: Monte Carlo simulations, k-fold cross validation, A/B Testing
Optimization Techniques: Gradient Descent, Stochastic Gradient Descent, Gradient Optimization - Momentum, RMSProp, Adam
Big Data: Apache Hadoop, HDFS, MapReduce, Apache Spark, Giraph, HiveQL, Pig, Kafka
Data Visualization: Tableau, Microsoft PowerBI, ggplot2, Matplotlib, Seaborn, Bokeh, Plotly
Data Modeling: Entity Relationship Diagrams (ERD), Snowflake Schema, SPSS Modeler
Operating Systems: Microsoft Windows, iOS, Linux Ubuntu
Database Systems: SQL Server, Oracle, MySQL, Teradata Processing System, NoSQL (MongoDB, HBase, Cassandra), AWS (DynamoDB, ElastiCache)
Cloud Platforms: Amazon Web Services(Beautiful soup, Glue, Redshift, S3 bucket, EC2, Sagemaker), Microsoft Azure(HDInsight, Data Lake, ADF, AML), Google Cloud Service
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Worked closely with Business Analyst to understand business requirements and start deriving solutions accordingly.
- Worked with Data Engineers and Data Analysts into a cross functional team for the deployment of models and working of the projects.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
- Extracted data from HTML and XML files by web-scraping through customer reviews using Beautiful Soup and Scrapy also pre-preprocessed raw data from the company’s data warehouse.
- Extracted data using PL/SQL queries, cleaned, imputed missing values and made the datasets ready for analysis and visualization on Tableau.
- Performed Exploratory Data Analysis and Advanced Analytics with Python and Visualization tools like Matplotlib, Seaborn to identify the patterns and correlations between the features.
- Performed Feature scaling, Feature engineering, Validation, Visualization, Data Resampling, report findings, develop strategic uses of data by Python libraries like NumPy, Pandas, Scipy, Scikit-Learn, TensorFlow.
- Experience in using third party tools like Telerik, DevExpress and kendo Controls and worked containerizing applications using Docker and Vagrant and familiar with JSON based REST, SOAP, and Amazon Web Services.
- Integrated Grafana API with Airflow for job monitoring.
- Implemented Continuous Integration using Jenkins and GIT.
- Optimized the data sets by creating Dynamic Partition and Bucketing in Hive.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
- Followed AGILE development methodology to develop the application and Used the Python Programming Language to refactor and redesign databases
- Involved in various pre-processing phases of text-data like Tokenization, Stemming, Lemmatization and converting the raw text data to structured data.
- Performed data post processing using NLP techniques like TF-IDF, Word2Vec & BOW to identify the most pertinent Product Subject Headings terms that describe items.
- Implemented various statistical techniques to manipulate the data like missing data imputation, Principal Component Analysis, tSNE for dimension-reduction.
- Knowledge in working with continuous deployment using Heroku and Jenkins and experienced on Cloud innovations including Infrastructure as a Service
- Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
- Developed Spark Programs for Batch and Real-Time Processing to process incoming streams of data from Kafka sources and transform it into as Data frames and load those data frames into Hive and HDFS.
- Performed K-NN and K-Means Clustering to categorize and classify customers into certain groups, based on their product selection and built a recommender algorithm for suggesting similar products using content based recommendation.
- Extensively worked on Jenkins and Artifactory for Continuous Integration and Deployment.
- Experience in building ETL pipelines using NiFi.
- Performed collaborative filtering to generate item recommendations. Rank-based and content-based recommendations were used to address the problem of cold start.
- Worked on streaming data using Apache Kafka and AWS Kinesis, to get real-time clicks from customers logged in on the website and give recommendation based on the product the customer is viewing.
- Responsible for loading the streaming data into Hadoop Distributed File System (HDFS).
- Created custom clusters in AWS Redshift based on different product selection to provide custom search result based on customer product search.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources (from Excel, Flat Files, Oracle, SQL Server, Mongo DB, HBase, Teradata and also log data from servers) into Hadoop HDFS.
- Created ETL jobs on AWS glue to load vendor data from different sources, transformations involving data cleaning, data imputation and data mapping and storing the results into S3 buckets. The stored data was later queried using AWS Athena.
- Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
- Worked on ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/ Avro/Text Files into AWS Redshift.
- Configured instances in EC2 and utilized them Elastic Map Reduce (EMR) Jobs, while writing scripts in AWS Lambda.
- Implemented Hive custom UDF's for comprehensive data analysis.
- Performed Linear Regression onto the classified clusters of customers that were deduced from classification and clustering through K-NN and K-means.
- Worked with Hadoop Ecosystem covering HDFS, HBase, YARN and MapReduce.
- Trained machine learning models using Logistic Regression, Random Forest and Support vector machines (SVM) on selected features to predict Customer churn.
- Employed statistical methodologies such as A/B test, experiment design and hypothesis testing.
- Created Data Pipeline using Sklearn library to store the pipeline of workflow, setting up all the data transformations for future use.
- Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
- Created various types of charts like Heat Maps, Geocoding, Symbol Maps, Pie Charts, Bar Charts, Tree Maps, Gantts, Circle Views, Line Charts, Area Charts, Scatter Plots, Bullet Graphs and Histograms in R(Shiny), Table Desktop, Power BI and Excel to provide better data visualization.
- Used AWS Sagemaker to train model using protobuf recordIO and deploy the model owing to its relative simplicity and computational efficiency over Beanstalk.
- Employed various metrics such as Cross-Validation, Confusion Matrix, ROC and AUC to evaluate the performance of each model.
Environment: Python 3.6 (NumPy, Pandas, Seaborn, Jenkins, Matplotlib, NLTK, Scikit-Learn), PySpark 2.4.1, AWS, SQL Server, RStudio
Confidential
Data Scientist/Data Engineer
Responsibilities:
- Performed data collection, data cleaning, data profiling, data visualization and report creating.
- Extracted required medical data from Azure Data Lake Storage into PySpark data-frame for further exploration and visualization of the data to find insights and build prediction model.
- Developed Hive tables on data using different SERDE's, storage format and compression techniques.
- Performed data cleaning on the medical dataset which had missing data and extreme outliers from PySpark data frames and explored data to draw relationships and correlations between variables.
- Used sparkSQL for reading data from external sources and process the data.
- Worked on Docker container snapshots, attaching to a running container, removing images, managing Directory structures, and managing containers.
- Involved in loading data from UNIX file system to Hadoop-DFS.
- Designed and developed multiple real-time high-volume, scalable batch and streaming Hadoop architectures for structured and unstructured data ingestion and processing using Apache Spark, Kafka and Flume.
- Worked on Jenkins continuous integration tool for deployment of project.
- Data Extraction, aggregations and consolidation within AWS Glue using PySpark.
- Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning and used Excel VBA for some basic visualizations, performing MDM using Informatica MDM and PowerCentre.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Sqoop script, Pig script, Hive queries.
- Performed Data cleaning, Feature scaling, Feature engineering, Validation, Visualization, Data Resampling, report findings, develop strategic uses of data by Python libraries like NumPy, Pandas, Seaborn and Matplotlib.
- Developed flash report application POC to display order and sales in real time using Apache spark, Kafka, Tableau.
- Implemented data pre-processing using Scikit-Learn. Steps include Imputation for missing values, Scaling and logarithmic transform, one hot encoding etc.
- Analyzed the applicant’s medical data to find various relations which were further plotted using Alteryx, Tableau and PowerBI to get better understanding on the available data.
- Exporting data from HDFS environment into RDBMS using Sqoop for report generation and visualization purpose.
- Creating Batch (MapReduce/Spark) jobs in Talend Cloud Bigdata 7.1 and PySpark to do the transformations and loading/creating the target files.
- Utilized Python's data visualization libraries like Matplotlib and Seaborn to communicate findings to the data science, marketing, engineering teams and Stakeholders.
- Performed univariate, bivariate and multivariate analysis on the BMI, age and employment to check how the features were related in conjunction to each other and the risk factor.
- Trained machine learning models using Logistic Regression, Random Forest and Support vector machines (SVM) on selected features to predict Customer churn.
- Performed Naïve Bayes, K-NN, Logistic Regression, Random Forest and KMeans to categorize and classify customers into certain groups.
- Worked on Statistical methods like data driven Hypothesis Testing and A/B Testing to draw inferences, determined significance level and derived P-value, and to evaluate the impact of various risk factors.
- Furthered Hypothesis testing by evaluating Errors (Type 1 and Type 2) to eliminate skewed inferences.
- Implemented and tested the model on AWS EC2, by using the data exported into AWS S3 bucket and collaborated with development team to get the best algorithms and parameters.
- Prepared data-visualization designed dashboards with Tableau, and generated complex reports including summaries and graphs to interpret the findings to the team.
Environment: Python (NumPy, Pandas, Matplotlib, Sk-learn), AWS, Jupyter Notebook, HDFS, Hadoop MapReduce, PySpark, Tableau, SQL, Microsoft Azure Stack (HDInsight, Data Lake, ADF, AML)
Confidential
Data Analyst/ Scientist
Responsibilities:
- Importing and Exporting data from relational databases to Hadoop-DFS and vice-versa using scripting languages and tools (Python, MySQL, Flume, Sqoop)
- Performed exploratory data analysis and data cleaning along with data visualization using Seaborn and Matplotlib.
- Performed feature engineering to create new features and determine transactions associated with completed offer.
- Performed Advanced Analysis, Data Profiling using complex SQL queries on various sources systems including Oracle, Teradata, Data Warehouse and using SSIS packages.
- Performed Data collection, Data cleaning, Feature scaling, Feature engineering, Visualization, and data manuplation to create reports, communicate findings, develop strategic uses of data by Python libraries like NumPy, Pandas, Seaborn and Matplotlib.
- Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning and used Excel VBA for some basic visualizations, performing MDM using Informatica MDM.
- Performed data visualization using Matplotlib and S eaborn from data using features like age, income, membership duration, etc.
- Built machine learning models using customer models and transaction data.
- Built a classification model to classify customers for promotional deals to increase likelihood of purchase using Logistic Regression and Decision Tree Classifier.
- Tested out performance of classifiers like Logistic Regression, Naïve Bayes, Decision tress and Support vector classifiers.
- Employed ensemble learning techniques like Random forests and Ada Gradient Boosting to improve the model by 15%.
- Wrote Hive queries for Data analysis to meet the requirements.
- Picked the final model using ROC & AUC and fine-tuned the hyper parameters of the above models using Grid Search to find the optimum values.
- Using k-fold cross validation to test and verify the model accuracy.
- Prepared dashboard in PowerBI to summarize the model and show the summary of model’s measure.
Environment: Python (NumPy, Pandas, Matplotlib, SkLearn), Tidyverse, R, MySQL, PgAdmin