We provide IT Staff Augmentation Services!

Data Engineer / Scientist Resume

0/5 (Submit Your Rating)

VA

SUMMARY

  • 4+ years of professional experience in Statistical modeling, Machine Learning,DataVisualization and Big data.
  • Expertise in transforming business requirements into building models, designing algorithms, developingdatamining and reporting solutions that scales across massive volume of unstructureddataand structured.
  • Excellent understanding of Hadoop Architecture and Daemons such as HDFS, Name Node, Data Node.
  • Data Node, Job Tracker, Task Tracker and Map Reduce Concepts. Hands on experience in installing, configuring and using Hadoop ecosystem components like Hadoop, HDFS, MapReduce Programming, Hive, Pig, Sqoop, HBase, Impala, Oozie, Zoo Keeper, Spark, SOLR with Cloudera distribution.
  • Hands on experience in various big data application phases like data ingestion, data analytics and data visualization.
  • In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Spark Streaming.
  • Extensive experience in Natural Language Processing (NLP) like Sentiment Analysis, Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R, Python and Tableau.
  • Experience in developing different Statistical Machine Learning, Text Analytics,DataMining solutions to various business problemsand datavisualizations using R and Python.
  • Experience in integrating data, profiling, validating and data cleansing transformation and data visualization using R, SAS and Python.
  • Extensive working experience with Python Libraries including Scikit-learn, Pandas, NumPy, H20 and Pyspark.
  • Expert in data ingestion tools like Sqoop, Flume, Kafka, Spark Streaming.
  • Experience in Big Data cleansing scripts like Spark, MapReduce and Pig and developing customized UDF’s in java to extend Hive and Pig Latin functionality.
  • Intensive experience in Hive and exposure on NoSQL Db’s like HBase, Cassandra and Mongo DB.
  • Ingested data from different sources like Oracle, Teradata, SQL server.
  • Data Migration and Data generation in big data ecosystem.
  • Experience in managing and reviewing Hadoop log files
  • Experience in developing pipelines in spark using Scala and python.
  • Developing streaming pipelines using Kafka and Storm.
  • Orchestrated multiple Hadoop application jobs using Sqoop and implementing optimization techniques in Hive, Spark.
  • Experience in Python and shell scripting
  • Experience working with cloud tools like Amazon Web Services and Azure.
  • Experienced with the Apache Spark improving the performance and optimization of the existing algorithms in Hadoop using Apache Spark Context, Apache Spark-SQL, Data Frame, Pair RDD's, Apache Spark YARN.
  • Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala.
  • Hands on experience in Sequence files, combiners, Counters, Dynamic Partitions, bucket for the best practice and performance environment.
  • Used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
  • Proficient in Data visualization tools such as Tableau, Plotly, Python Matplotlib and Seaborn.

TECHNICAL SKILLS

Programming Languages: Python: CARET, glmnet, forecast, XG boost, Sci-kit learn SAS: Forecast server, SAS Procedures and Data modeling Spark: MLlib SQL: Analytical & Windowing functions, Subqueries, joins, DDL/DML statements R: CARET, Random Forest

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Sqoop, Impala, Oozie, Zookeeper, Apache Spark, Spark SQL, Spark Streaming, Apache Kafka, Apache Flume & amp; Cassandra.

Statistical Methods: Supervised Learning: Linear/Logistic Regression, Lasso, Ridge, Decision Trees, Ensemble Methods, Random Forests, Support Vector Machines, Gradient Boosting, XGB, Neural Networks. Unsupervised Learning: TEMPPrincipal Component Analysis (PCA), K-Means, Hierarchical Clustering, Market Basket Analysis, Collaborative Filtering and Low Rank Matrix Factorization Sampling Methods: Bootstrap sampling methods and Stratified sampling Model Tuning/Selection: Cross Validation, AIC/BIC Criterions, Grid Search and Regularization, Dimension Reduction

ETL/BI Tools: Tableau, Advanced-Excel, ggplot2

Natural Language Processing/ Text Mining: Document Term Matrix (DTM), Stemming, Lemmatization, Word EmbeddingSemantic, Term Frequency, Dependency Parsing, Sentiment, Natural Language Generator (NLG), Word Cloud, Named Entity Recognition (NER), Parts Of Speech(POS) Tagging

IDE: Spyder, RStudio, Jupyter, Anaconda, H2O, Pyspark, Flask, Django, Docker, R shiny

Other: GIT, Statistics, Microsoft Azure, Google Cloud Platform, Amazon AWS, Hadoop

Operating Systems: Windows, LINUX, Macintosh HD

PROFESSIONAL EXPERIENCE

Confidential, VA

Data Engineer / Scientist

Responsibilities:

  • Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, and time, Date and Time etc. Integrating with external data sources and APIs to discover interesting trends.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Experience in installation, configuring, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH 5.X) distributions.
  • Worked on Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie, Automic on the Hadoop cluster.
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Built Machine Learning models to identify fraudulent applications for loan pre-approvals and to identify fraudulent credit card transactions using the history of customer transactions with supervised learning methods.
  • Develop Oozie workflows to schedule the Scripts on a daily basis.
  • Develop Spark jobs using PySpark to create a generic framework to process all kinds of flat files.
  • Performed Data Cleaning, features scaling, featurization, features engineering.
  • Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, Automic scheduling.
  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
  • Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
  • Experience in refactoring the existing spark batch process for different logs written in Scala.
  • Implemented Big Data tools like Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive.
  • Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
  • Customer segmentation based on their behavior or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behavior patterns.
  • The results from the segmentation halps to learn the Customer Lifetime Value of every segment and discover high value and low value segments and to improve the customer service to retain the customers.
  • Performed Clustering with historical, demographic and behavioral dataas features to implement the personalized marketing that offers right product to right person at the right time on the right device.
  • Addressed over fitting and underfitting by tuning the hyper parameter of the algorithm and by using L1 and L2 Regularization.

Confidential, NJ

Software Engineer

Responsibilities:

  • Extensively involved in all phases of dataacquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
  • Extracted data from database, copied into Hadoop Distributed File system (HDFS) and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Tackled highly imbalanced Fraud dataset using sampling techniques like down-sampling, up-sampling and SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
  • Worked on loading AVRO/PARQUET/TXT files in Spark Framework using Scala language and created Spark Data frames and RDDs to process the data and save file in parquet format in HDFS to load into fact table using ORC Reader.
  • Migrated Map Reduce programs into Spark transformations using Scala.
  • Implemented a Python-based distributed random forest via PySpark and MLlib.
  • Used AWS S3, DynamoDB, AWS lambda, AWS EC2 for data storage and models' deployment.
  • Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau
  • Implemented CICD allowing for deploy to multiple client Kubernetes/AWS environments.
  • Worked on Hive to implement Web Interfacing and stored the data in Hive external tables.
  • Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
  • Involved in Data Querying and Summarization using Hive and created UDF’s, UDAF’s and UDTF’s.

Confidential, TX

Jr. Software Engineer

Responsibilities:

  • Performedadvanced and predictive data analytics using data science technology to predict either medical claim is legit or fraud by using very effective and power machine learning algorithms.
  • Worked on a POC for extracting real time data using Kafka and spark streaming by Creating
  • DStreams and converting them into RDD, processing it and stored it into Cassandra.
  • Imported data from RDBMS systems like MySQL into HDFS using Sqoop.
  • Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters.
  • Developed Oozie Bundles to schedule pig, Sqoop and hive jobs to create data pipelines.
  • Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled Structured data using Spark SQL.
  • Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.
  • Processed the Raw data from CSV files in to organized form by applying DataCleaning techniques using Pandas and NumPy.
  • Performed analysis using Count Vectorizer, TF-IDF, Linear SVC and pipelined those to develop model to predict spam and classify which emails to respond back and which to put in the junk emails.
  • Built Confusion Matrix and Classification report to see the performance of the algorithm.
  • The jobs were made to run successfully by solving data quality issues using SQL, efficient coding practices, macros and stored procedures.

We'd love your feedback!