Data Scientist / Bigdata Engineer Resume
Austin, TX
SUMMARY
- 8+ years of IT experience includes in BigData, Data Science (Machine Learning, Deep Learning, NLP/ Text Mining), Data/Business Analytics, Data Visualization, Data Operations, and BI.
- Experience in Data Science/Machine Learning in the different domains such as Data Analytics, Machine Learning (ML), Predictive Modelling, Natural Language Processing (NLP) and Deep Learning algorithms.
- Proficient at wide varieties of Data Science programming languages Python, R, SQL, PySpark, Sci - kit Learn, NumPy, SciPy and Pandas, NLTK, TextBlob, Genism, SpaCy, Keras and TensorFlow.
- Excellent understanding of Hadoop architecture and complete understanding of Hadoop-Daemons and various components such as HDFS, YARN, Resource Manager, Node Manager, Name Node, Data Node and Map Reduce programming paradigm.
- Experience exclusively on Big Data Ecosystem using HADOOP framework and related technologies such as HDFS, MapReduce, HIVE, PIG, HBASE, STORM, YARN, OOZIE, SQOOP, AirFlow and Zookeeper and also includes working experience in Spark Core, Spark SQL, Spark Streaming, Scala and Kafka.
- Experienced in facilitating the entire lifecycle of a data science project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation, Back Testing and Validation.
- Expert in Machine Learning algorithms such as Ensemble Methods (Random forests), Linear, Polynomial, Logistic Regression, Regularized Linear Regression, Support Vector Machines (SVM), Deep Neural Networks, Extreme Gradient Boosting, Decision Trees, K-Means, K-NN, Gaussian Mixture Models, Naive Bayes.
- Experienced in working with Datasets, Spark-SQL, Data Frames, RDD's, handling large data frames using Partitions, Spark in-Memory capabilities, Effective & efficient Joins, Broadcast Variables, User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs), actions, transformations and other during ingestion process itself.
- Experience in converting Hive/SQL queries into RDD transformations in spark environment using Scala and Python.
- Well versed with dealing with Structured and Unstructured data, Time Series data and statistical methodologies like Hypothesis Testing, ANOVA, multivariate statistics, modeling, decision theory and time-series analysis.
- Proficient in Data transformations using log, square-root, reciprocal, cube root, square and complete box-cox transformation depending upon the dataset.
- Experience with relational and non-relational databases such as MySQL, SQL, Oracle, MongoDB, Cassandra and PostgreSQL.
- Adroit at employing various Data Visualization tools like Tableau, Matplotlib, Seaborn, ggplot2, and Plotly.
- Experience on practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), Simple Storage Services (S3), Virtual Private Cloud (VPC), Lambda, EBS, and EMR.
- Proficient with container systems like Docker and container orchestration like EC2 Container Service, Kubernetes, worked with Terraform.
- Managed Docker orchestration and Docker containerization using Kubernetes.
- Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
- Expertise in building, publishing customized interactivereportsanddashboardswith customized parameters and user - filters usingTableau.
- Experience with complex Data processing pipelines, including ETL and Data ingestion dealing with unstructured and semi-structured Data.
- Good communication and presentation skills, willing to learn, adapt to new technologies and third-party products.
TECHNICAL SKILLS
Languages: Python, R, Java, SQL, PySpark, SAS, Scala, C++, C
BigData Technologies: HDFS, MapReduce, Hive, Pig, Yarn, Sqoop, Flume, HBase, Kafka, Impala, Oozie, Spark, Zookeeper, Airflow.
Hadoop Distributions: Apache Hadoop, Cloudera CDP, Hortonworks HDP
Databases: SQL Server, Oracle, SQL Lite, HBase, MongoDB, Cassandra, PostgreSQL, Dynamo DB.
Operating Systems: Windows, Linux, Unix, Mac OS.
Cloud Technologies: AWS, Azure, Google Cloud Platform (GCP).
Reporting Tools: Tableau, Power BI, BI - (SSIS, SSRS, SSAS)
Machine Learning: RNN, CNN, Regression (Linear and Logistic), Decision trees, Random Forest, SVM, KNN, PCA.
ML Frameworks: pyTorch, Pandas, Keras, NumPy, TensorFlow, Scikit-Learn, NLTK, OpenCV, Caffe.
Methodologies: Software Development Lifecycle (SDLC), Waterfall, Agile
PROFESSIONAL EXPERIENCE
Confidential, Austin, TX
Data Scientist / BigData Engineer
Responsibilities:
- Participated in all phases of project life cycle including data collection, development, validation and delivery of algorithms, statistical models and reports creating.
- Involved in team meetings, discussions with business teams to understand the business use cases.
- Used the Supervised and unsupervised techniques such as Logistic Regression Classifier, Random Forest Classifier, Auto encoder neural networks, DBSCAN, Isolation Forest, Local Outlier Factor, Elliptic Envelope, One-Class Support Vector Machines to classify the providers in Fraud and Non-Fraud category.
- Used computer vision such as object detection, image classification, image anomaly detection of CT scan and MRI image was done using CNN, Mask R-CNN (Mask Region-based Convolution Neural Network) model, Keras and TensorFlow.
- Analyzed data using SQL, R, Scala, Python, and presented analytical reports to management and technical teams.
- Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark frame work.
- Created Generic UDF's in Hive to process business logic that varies based on policy.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.
- Performed data cleaning and feature selection using Machine Learning package in PySpark and working with deep learning frameworks such as Caffe, TensorFlowand Keras etc.
- Computer vision API was used for Optical Character Recognition (OCR) to extract text digits and integers from images and PDF documents.
- Natural Language Processing (NLP) such as sentiment analysis, entity recognition, Topic Modeling and Text summarization was done using advanced python library such as NLTK, TextBlob, Spacy and Gensim.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming and Kafka
- Design, built and deployed a set of python modeling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior pattern prediction and support multiple marketing segmentation programs.
- Generated various models by using different machine learning and deep learning frameworks and tuned the best performance model using Signal Hub andAWS Sage Maker/AWS Data bricks.
- Implemented AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling group
- Deploying, managing, and operating scale, highly available, and fault tolerant systems to AWS.
- Worked on NOSQL databases like MongoDB, HBase.
- Executed Hadoop/Spark jobs on AWS EMR using programs and data is stored in S3 Buckets.
- Extracted the data from HDFS using Hive and performed data analysis using Spark with Scala, PySpark, Redshift and feature selection and created nonparametric models in Spark.
- Involved in scheduling Airflow workflows run multiple Hive jobs
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive
- Experience with container-based deployments using Docker, working with Docker images, Docker Hub and Docker-registries and Kubernetes.
- Used Jenkins pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes
- Segmented the customers based on demographics, geographic, behavioral and psychographic data using K-means Clustering. Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using Python and Tableau.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Extensively used Power BI, Pivot Table and Tableau to manipulate large data and develop visualization dashboard.
- Apache Spark is used for bigdata processing, streaming, SQL, Machine Learning (ML)
- Pandas Data frame, NumPy, Jupyter Notebook, SciPy, scikit-learn, TensorFlow, Keras, and Theano was used as a tool for Machine Learning and Deep Learning.
- Wrote complex SQL statements to interact with the RDBMS database to filter the data and data analytics.
- Involved in agile environment to implement agile management such as sprint planning, daily standups, managing project timelines, and communicate with clients to ensure project progress satisfactorily.
Environment: AWS, EC2, EMR, Hadoop Framework, S3, Map Reduce, HDFS, Spark (Pyspark, MLlib, Spark SQL), Python (Scikit-Learn/Scipy/Numpy/Pandas/NLTK/Matplotlib/Seaborn), Hive, Scala, Docker, Kubernetes, Jenkins, Tableau Desktop, Tableau Server, Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGboost, LightGBM, Ensemble), NLP, Power BI, Kafka, AirFlow, Agile/SCRUM
Confidential, Dallas, TX
Data Scientist / BigData Engineer
Responsibilities:
- Responsible for clarifying business objective, data cleaning, data preprocessing, Exploratory data analysis, feature scaling, machine learning modeling, model tuning and model testing.
- Worked closely with internal stakeholders such as business teams, product managers, engineering teams and partner teams
- Implemented different kind of Visualization in Tableau like Text Table, Packed Bubble, Horizontal Stack bars, Pie Charts, Bar Graphs, and Tree Maps. Created data stories and dashboards in tableau and presented the results in an innovative format and informative visualizations for stakeholders.
- Developed Map Reduce/ Spark modules for machine learning & predictive analytics in Hadoop on AWS.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Used Python to analyze the data, plot the visualizations and implemented ML algorithms for large dataset analysis, matplotlib & seaborn.
- Selecting features, building and optimizing classifiers using Machine learning techniques.
- Evaluated the accuracy and precision of the algorithm using a variety of validation techniques.
- Implemented Machine Learning models in Spark using the PySpark.
- Evaluated the models using Cross validation and ROC curves. Testing the model with different performance metrics such as f-1 score, precision, recall, log-loss, accuracy and AUC etc.
- Data wrangling to clean, transform and reshape the data utilizing Numpy and Pandas library.
- Proactively monitored systems and services, architecture design and implementation of Hadoop deployment configuration management, backup and disaster recovery systems and procedures.
- Involved in transferring Streaming data from different data sources into HDFS and NoSQL databases.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- UpdatedPython scriptsto match training data with our database stored inAWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Built Enterprise ingestion Spark framework to ingest data from different sources which is 100% metadata driven and 100% code reuse which lets others developers to concentrate on core business logic rather an Spark/Scala coding.
- Created spark jobs to process the complex/nested json data, apply transformations based on the business rules and write the dataframes to upsert the records into rdbms(postgres & db2).
- Spark submit using the dynamic memory allocation, shuffle partitions and read the configurations from application config file using type safe dependency.
- Developed the Sqoop scripts to make the interaction between Hive and vertica Database.
- Processed data into HDFS by developing solutions and analyzed the data using Map Reduce and Hive to produce summary results from Hadoop to downstream systems.
- Worked on data loading and transformation tasks using external sources, merge data, performed data enrichment and loaded in to target data destinations. Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Developed data pipelines for real time use cases using Kafka and Spark Streaming. Tuned multiple spark applications for better optimization.
- Involved in development of test environment on Docker containers and configuring the Docker containers using Kubernetes.
- Create Tableau dashboards and reports to regularly communicate results and monitor key metrics.
Environment: Python, HDFS, Apache Spark, Spark Machine learning, AWS, EMR, Pandas, Data frames, Tensor flow, Docker, Kubernetes, NumPy, SciPy, scikit -learn, Spark SQL, Scala, Map Reduce, RDD, Hive, Hbase, Kafka, GIT, GITLAB, Shell scripts, Tableau.
Confidential, Dearborn, MI
Data Scientist / BigData Engineer
Responsibilities:
- Performed Variable Identification and checked for percentage of Missing Values, Data Types, Outliers etc.,
- Performed Univariate Analysis and analyzed Descriptive Statistics like Mean, Median, Mode, Range, Standard Deviation, Variance and check for Missing data, Detect Outliers, Normality Check with Skewness and Kurtosis, Presented the results on Histograms, Box Plots etc.,
- Performed Bivariate analysis using Correlation and Inferential Statistical tests like Z-test, T-test, Chi-Square, ANOVA to Check Multicollinearity and Singularity and presented the results using scatter plots, bar charts, line charts etc.,
- Performed Outlier Detection and Treatment in Python using different techniques like Median Absolute Deviation (MAD), Minimum Covariance Determinant, Histograms and Box plots.
- Performed Feature Engineering such as Missing Value Imputation, Normalization and Scaling, Outliers Detection and Treatment, One-Hot-Encoding, Splitting Features and used Label Encoder to convert categorical variables to numerical values using python scikit-learn library.
- Performed Exploratory Data Analysis (EDA) to visualize through various plots and graphs using matplotlib, NumPy, Pandas, Scikit-learn and seaborn libraries of python, and to understand and discover the patterns on the Data. Calculated Pearson Correlation Coefficient to deal with Multicollinearity.
- Applied various Classification models such as Naïve Bayes, Logistic Regression, Random Forests, Support Vector Classifiers, from scikit-learn library and improved performance of the model by using various Ensemble learning like Random Forests, XGBoost and Gradient Boosting using Scikit-learn
- Addressed Overfitting and Underfitting by using K-fold Cross Validation.
- Applied K-means clustering to look for churn patterns among customers based of various features.
- Performed Confusion Matrix and Classification report to evaluate accuracy and performance of different models used. Evaluated the model’s performance using various metrics like Precision, Recall, F-Score, AUC-ROC, Cross Validation to test the models with different batches of data to optimize models.
- Applied and Tried manual Hyper-parameter tuning using Grid Search to get better performance to train the model. Created 3 node cluster using spark and Applied different Transformations and Actions on spark.
- Used different spark APIs like Spark SQL to create SparkDataFrames, spark.ml and spark.mllib to create machine learning models using spark.
- Migrated Existing Map Reduce programs to Spark Models using Python.
- The system pulls information from multiple data sources and ingests it into the system data lake.
- Developed a Sqoop incremental Import Job, Shell script for importing data into HDFS
- Imported data from HDFS into Hive using Hive commands.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
- Assisted reporting teams in developing Tableau visualizations and dashboards using Tableau Desktop.
- Created Hive Tables, loaded retail transactional data from Teradata using Sqoop.
- Worked on Kafka to collect and load the data on Hadoop file systems.
- Migrating the coding from Hive to Apache Spark using Spark, SparkSQL, RDD
- Created Oozie workflow for filewatcher configurations and job scheduling
- Hands-on experience in using Flume to transfer log data files to Hadoop Distributed File System.
- Worked on Caching, Persisting and Repartitioning the DataFrames.
- Created AWS RUNDECK to run spark job on AWS EMR cluster.
Environment: Python, Sprak, Scala, AWS, EMR, Hive, Oozie, HDFS, Kafka, SQL, HBase, Map Reduce, Tableau, Shell Script.
Confidential, Atlanta, GA
BigData Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop cluster environment with HortonWorks distribution.
- Used Sqoop to load the data from relational databases.
- Involved in converting Hive/SQL queries into spark transformations using Spark RDD’s.
- Worked with CSV, Jason, Avro and Parquet file formats.
- Elastic App Search provides a rich set of APIs for ingestion and searching content along with an intuitive UI for analyzing and tuning relevance, and an open-source library to quickly implement rich search experiences
- Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions, Buckets on HIVE tables.
- In depth understanding of Map Reduce and AWS cloud concepts and its critical role in data analysis of huge and complex datasets.
- Developed and implemented real-time data pipelines with Spark Streaming.
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store HBase for data access and analysis.
- Worked with Python, to develop analytical jobs using PySpark API of spark.
- Using Job management scheduler apache Oozie to execute the workflow.
- Using Ambari to monitor node’s health, status of the jobs and to run the analytics jobs in Hadoop clusters.
- Experience with pyspark for using spark libraries by using python scripting for data analysis.
- Worked on Tableau to build customized interactive reports, worksheets, and dashboards.
- Involved in performance tuning of spark jobs using Cache and by utilizing complete advantage of cluster environment.
Environment: Hadoop, Spark, Scala, Python, AWS, Talend, Map Reduce, Hive, Sqoop, Pyspark, Ambari, Oozie, HBase, Tableau, Jenkins, Horton Works.
Confidential
Hadoop Developer
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, implementation, application integration, and coding.
- Recommended structural changes and enhancements to systems and databases.
- Conducted design and technical reviews with other project stakeholders.
- Was a part of the complete life cycle of the project from the requirements to the production support
- Created test plan documents for all back-end database modules
- Used MS Excel, MS Access and SQL to write and run various queries.
- Worked extensively on creating tables, views and SQL queries in Oracle 10g.
- Worked with internal architects and, assisting in the development of current and target state data architectures.
- Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
- Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
Environment: UNIX, SQL, Oracle 10g, MS Office and MS Visio.
