We provide IT Staff Augmentation Services!

Data Engineer/ Data Scientist Resume

4.00/5 (Submit Your Rating)

San Diego, CA

SUMMARY

  • Having 8+ years Expertise as Data Engineer / Data Scientist in Retail, Logistics, Healthcare and Banking Industries using Big Data, Spark, Real - time streaming, Kafka, Data Science, Machine Learning, NLP and Cloud(AWS,Azure,GCP).
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions dat scale a cross a massive volume of structured and unstructured data.
  • Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XGBoost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression-based models, A/B testing, Hypothesis testing, Factor analysis/ PCA, Ensembles.
  • Expertise in data modeling for data warehouse/data mart development, SQL and analysis of Online transactional Processing (OLTP), data warehouse (OLAP) and business Intelligence (BI) applications.
  • Expertise in utilizing AWS services such as EC2, RDS, S3, EFS, Glacier, Storage Gateway, DynamoDB, ElastiCache, Redshift, VPC, CloudFront, Route53, Direct Connect, API Gateway, EBS, AMI, SNS, CloudWatch, ELB, Auto Scaling, IAM.
  • Used R and Python for exploratory data analysis, A/B testing, PySpark, HQL, AWS Redshift, ANOVA test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns and provide recommendations.
  • Sterilized the models using Pickle library for deployment.
  • Built data pipelines for batch and real-time streaming using Azure Synapse.
  • Experience in automation of code deployment, support and administrative tasks across multiple cloud providers such as Amazon Web Services, Microsoft Azure, Google Cloud.
  • Hands-on experience in Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Visual Studio Online (VSO) and SQL Azure, HDInsight, DataFactory, Azure Data Lake Store
  • Hands on experience using Python 3.x undertaking data analytics and visualizations using various core analytical Python libraries such as Numpy, Scipy, Pandas, Keras, pytorch, Tensorflow and Scikit-learn.
  • Profound knowledge of various Supervised and Unsupervised machine learning algorithms such as Ensemble Methods, Clustering algorithms, Classification algorithms and Time Series models (LSTM,GRU).
  • Extensive hands-on experience and high proficiency wif structures, semi-structured and unstructured data, using a broad range of data science programming languages and big data tools including R, Python, Spark, SQL, Scikit Learn,Hadoo MapReduce.
  • Used Spark streaming to Process the streaming data and to analyze the continuous datasets using PySpark
  • Resolving complex issues reported in azure Databricks and HDInsight which were reported by Azure end customers.
  • Skilled at using Big Data Tools Sqoop, MapReduce, Hive, Spark and storage system HDFS.
  • Good Expertise in ingesting, processing, exporting, analyzing Terabytes of structured and unstructured data on Hadoop clusters in Information Security and Technology domains.

TECHNICAL SKILLS

Programming Languages: SQL, R, Python, Scala, Java

Cloud Tools: AWS (Atana, Redshift, EMR, S3, Lambda, GLUE, Data Pipeline, Sagemaker, RDS, CloudFront, VPC, Route53)AZURE (DataFactory, ADLS, Synapse Analytics, AZURE HDInsight, Azure Blob, Data Bricks, Data Lake, Azure Sql, Machine Learning Studio)GCP (Bigquery, DialogFlow, DataProc, Google Vision, Google Colab, Google Cloud Natural Language)

Machine Learning: Pytorch, TensorFlow, Keras, Scikit-learnLinear Regression, Logistic Regression, Gradient boosting, Random Forests, Maximum likelihood estimation, Clustering, Classification Association Rules, K-Nearest Neighbors (KNN), K-Means Clustering, Decision Tree (CART & CHAID), Neural Networks, TEMPPrincipal Component AnalysisWeight of Evidence (WOE) and Information Value (IV), Factor AnalysisSampling Design, Time Series Analysis, ARIMA, ARMA, GARCH, Market Basket Analysis, Text mining

Datawarehouse: AWS Redshift, Cloudera, Spark,Star Schema, Snowflake schema, SAS, SSIS and Splunk

BI/Analytic Tool: Tableau, Azure ML, SSRS

Big Data tools: HDFS, Sqoop, Hive, Spark, Kafka, HBase, Airflow

Hadoop Distributions: Cloudera, MapReduce, Hortonworks

PROFESSIONAL EXPERIENCE

Confidential, San Diego, CA

Data Engineer/ Data Scientist

Responsibilities:

  • Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into DataBase.
  • Data Extraction, aggregations and consolidation of Adobe data wifin AWS Glue using PySpark.
  • Build architecture for storing and pipelining data using in-house Data Warehouses and Amazon Web Services (AWS) including DynamoDB, Redshift, Kinesis, Lambda, EC2 and S3. created external tables wif partitions using Hive, AWS Atana and Redshift
  • Implementations of pipelines using AWS CI/CD Data pipeline.
  • Created Data Lake by extracting customer’s information from various data sources (Teradata, Mainframes,RDBMS, CSV, Excel) into HDFS
  • Migrated on premise database structure to Confidential Redshift data warehouse and wrote various data normalization jobs for new data ingested into Redshift.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
  • Migrated on premise database structure to Confidential Redshift data warehouse and was responsible for setting up the ETL and data validation using SQL Server Integration Services.
  • Performed Sentimental analysis in NLP on the email feedback of the customers to determine the tone behind the series of words by Neural Networks techniques like Long-Short Term Memory (LSTM) cells in Recurrent Neural Networks (RNN).
  • Used Long-Short Term Memory (LSTM) for analysing time series data in PyTorch.
  • Leveraged NLP libraries (NLTK, textblob, spacy, gensim) to improve the way end user team access, understand, and infer textual information from defect libraries and analytical dashboards.
  • Performed Customer Segmentation based on demographics using K-means Clustering.
  • Designed and Developed applications using Apache Spark, Scala, Python, NIFI, S3, AWS EMR on AWS cloud to format, cleanse, validate, create schema and build data stores on S3.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
  • Managed Data quality & integrity using skills in Data Warehousing, Databases & ETL.
  • Applied multiple Machine Learning (ML) and Data Mining techniques to improve the quality of product ads and personalized recommendations.
  • Collaborated wif database engineers to implement ETL process, wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
  • Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, CodePipeline) in the process right from developer code check-in to Production deployment

Environment: AWS Data Pipeline, Databricks, CI/CD, NLP, Hadoop, MapReduce, HDFS, Sqoop, Oozie, NLP,Keras, Pytorch,WinSCP, Python, HIVE, Impala, LSTM,RNN,Kinesis,Atana,Redshift, Tableau, S3, SQL Server Integration Services, AWS Data Migration Services, JIRA etc

Confidential, Chicago IL

Data Engineer / Data Scientist

Responsibilities:

  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the Cosmos Activity.
  • Implemented OLAP multidimensional cube functionality using Azure SQL Data Warehouse
  • Leveraged NLP libraries (NLTK, textblob, spacy, gensim) to improve the way end user team access, understand, and infer textual information from defect libraries and analytical dashboards.
  • Utilized Apache Spark wif Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib. involved in collecting and aggregating large amounts of log data using Storm and staging data in HDFS for further analysis.
  • Built Deep neural networks (DNN) using Keras, Azure ML studio and TensorFlow, built Support vector machine (SVM) to predict whether an invoice would result in a claim or know based on the past data.
  • Performed data integrity checks, data cleansing, exploratory analysis and feature engineering using Python libraries like Pandas, Matplotlib etc.
  • Programmed in Spark and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines.
  • Used Spark streaming to Process the streaming data and to analyze the continuous datasets using PySpark
  • Resolving complex issues reported in azure Databricks and HDInsight which were reported by Azure end customers
  • Building different HDInsight clusters like Hive, Spark, HBase, Kafka and LLAP interactive query wif Enterprise Security Package and Virtual network.
  • Worked wif Spark Session Object on Spark SQL and Data-Frames for faster execution of Hive queries
  • Built predictive models including Support Vector Machine, Decision tree, and Naive Bayes Classifier, Neural Network plus ensemble methods of the models to evaluate how the likelihood to recommend of customer groups would change in different sets of service by using python scikit-learn.
  • Used Elastic Search (Big Data) to retrieve data into application as required
  • Managed Data quality & integrity using skills in Data Warehousing, Databases & ETL.

Environment: Azure, ADF, DataLake, Databricks, Tableau,Sparksql,HBase, Kafka,HDFS, Hive, Apache Sqoop, Spark, Python, YARN, Agile Methodology, Cloudera, MySQL, sparkml

Confidential, Spring Field, IL

Responsibilities:

  • Extensively involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
  • Extracted data from different company sources, performed SQL queries to transform and load data into structured form. Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy. Performed Exploratory Data Analysis (EDA) to visualize through various plots and graphs using matplotlib and seaborn library of python, and to understand and discover the patterns on the Data, understanding correlation in the features using heatmap, performed hypothesis testing to check significance of the features.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Created Data Lake by extracting customer’s information from various data sources (Teradata, Mainframes,RDBMS, CSV, Excel) into HDFS
  • Build architecture for storing and pipelining data using in-house Data Warehouses and Amazon Web Services (AWS) including DynamoDB, Redshift, Kinesis, Lambda, EC2 and S3.
  • Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from RedShift.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Used Spark Data Frame Operations to perform required validations on the data.
  • Responsible in performing sort, join, aggregations, filter, and other transformations on the datasets.
  • Created Hive tables and working on them for data analysis to cope up wif the requirements.
  • Implemented Hive Partitioning and bucketing for data analytics.
  • Analyzed the data by performing HQL, Spark SQL.
  • Loaded the Cleaned Data into the hive tables and performed analytical functions based on requirements.

Environment: AWS, Teradata,Redshift, Kinesis, S3,LambdaJupyter Notebook, Hadoop, SQL Azure, Hive, HBase, Sqoop, Windows 10, Linux, Oozie.

Confidential

Data Engineer

Responsibilities:

  • Worked on data pre-processing and cleansing the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy and consistency.
  • Data sources are extracted, transformed and loaded to generate CSV data files wif Python programming and SQL queries.
  • Performed Data Integration, Extraction, Transformation, and Load (ETL) Processes
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Used Python to identify trends and relationships between different pieces of data and drew appropriate conclusions.
  • Involved in ingestion data from source system to Hadoop environment.
  • Involved in writing test case for validation output report.
  • Work wif product owners to establish design of experiment and the measurement system for effectiveness of product improvements
  • Work wif Project Management to provide timely estimates, updates & status
  • Work closely wif data scientists to assist on feature engineering, model training frameworks, and model deployments at scale
  • Developed Map Reduce, Hive, Pig scripts for ETL jobs.

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, Ambari, AWS, PostgresSQL, Azure, Cosmos Python, PySpark, Flink, Kafka, SQL Server 2012, T-SQL, CI-CD, Git, XML,Tableau.

Confidential

Data Engineer

Responsibilities:

  • Analyze and Prepare data, identify the patterns on dataset by applying historical models.
  • Improve efficiency and accuracy by evaluating model in Python.
  • Perform data manipulation, data transformation, data preparation, normalization, and predictive modeling.
  • Used Python for programming for improvement of model and developing different Statistical Machine Learning models.
  • Performed Data cleaning process applied Backward - Forward filling methods on dataset for handling missing values.
  • Implemented Machine learning and Deep learning (ML/DL) to analysis and predict from given data.
  • Developed predictive models using Decision Tree, Random Forest using their API’s, Cluster Analysis, and Neural Networks.
  • Performed Boosting method on predicted model for the improve efficiency of the model.
  • Built recommendation engines for both contents based filtering and collaborative filtering mechanisms for client.
  • Segmented the customers based on demographics using K-means Clustering.
  • Explored different regression and ensemble models in machine learning to perform forecasting.
  • Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm.
  • Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring.
  • Developed Tableau data visualization using Cross tabs, Heat maps, Box and Whisker charts, Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Created various types of data visualizations using R, python and Tableau.Presented Dashboards to Higher Management for more Insights using Power BI.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.

Environment: Hadoop Hortonworks2.2, Hive, Pig, HBase, Scala, Sqoop and Flume, Oozie, AWS, S3, EC2, EMR Spring, Kafka, SQL, Python, UNIX, Teradata

We'd love your feedback!