Data Science Engineer Resume
Plano, TX
SUMMARY:
- Having 5+ years of working experience on various IT Systems & application using open source technologies involving Analysis, Design, Coding, Testing, Implementation and Training, Excellent skills in state - of-the-art technology of client server computing with good understanding of Big Data Technologies, Machine Learning.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Experience in Elastic Search Engine Lucene/Index based search, ELK log analytics tool, Elastic Search, Logstash, Kibana.
- Design NoSQL database schema to help migrating legacy application's datastore to Elastic Search.
- Designing Elasticsearch, Kibana and Logstash based logs & metrics pipeline and performing KPI based cloud monitoring.
- Experienced in performing in memory data processing for batch, real time, and advanced analytics using Apache Spark (Spark SQL &Spark-Shell) .
- Good knowledge of integrating Spark Streaming with Kafka for real time processing of streaming data
- Aggregated Data through Kafka, HDFS, Hive, Scala and Spark Streams in Amazon AWS .
- Worked on Big Data Analytics, Hadoop ecosystems (Hadoop, Hive) and Spark, integration with R.
- Extensive Knowledge in implementation of machine learning programs in Python and R
- Experience using and developing solutions utilizing the Hadoop ecosystem such as Hadoop, Spark, Hive, Sqoop, Zookeeper, Kafka, NoSQL databases like Hbase.
- Experience with working on cloud infrastructure like Amazon Web Services(AWS)
TECHNICAL SKILLS:
Roles: Data Science Engineer, Big Data Engineer, Spark Developer, Data Analyst, Project Engineer
Programming: Python, R, C, SQL (Familiarity Scala, SAS)
Tools: Spyder, IPython Notebook/Jupyter, Spark Notebook, Zeppelin notebook (Familiarity Git, Docker)
Cloud: AWS/EMR/EC2/S3 (also direct-Hadoop-EC2)
Big Data: ELK Stack, Spark, Hadoop, Hive, Pig, Sqoop, (Familiarity Cloudera Search)
DB Languages: SQL, PL/SQL, Oracle, Hive, Spark SQL, Memsql
Domain: Big Data, Data Mining, Data Analytics, Machine Learning, Natural Language Processing
PROFESSIONAL EXPERIENCE:
Data Science Engineer
Confidential, Plano, TX
Responsibilities:
- Responsible for being the technical point of contact to upper management, business analysts, project management, and miscellaneous other groups for the proactive monitoring project.
- Analyzed and solved anomaly detection problem for root-cause events for STB failures. Chose and prototyped ELK stack.
- Core developer of Elasticsearch along with X-Pack for machine-learning for proactive monitoring, anomaly detection, and alert generation.
- Created real-time dashboard to report KPI / Performance Monitor / Geo-based error display / Historical search used by IHD / Command Center reps to support customer’s STB troubleshooting / diagnostic
- Created Machine Learning Jobs from Kibana X-Pack ML component for anomalies and watcher for ML jobs for root-cause analyses to predict KPI Metrics and errors. created anomalies alerting so can be alerted if such an occurrence happens.
- Generate scheduled reports for Kibana Dashboards and Visualizations.
Technologies used: Elasticsearch, Logstash, Kibana, Kafka, Machine Learning, Python
Big Data Engineer
Confidential, Warren, NJ
Responsibilities:
- Designing the architecture and re-writing the DMAT Application from scratch using ELK Stack and Integration with other applications
- Strategy to improve Business KPIs: Analyzed existing products and KPIs and recommended short term and long-term ideas to improve various KPIs via DMAT.
- ETL process for continuously bulk importing dmat data from sql server into Elasticsearch.
- Design/Implement large scale pub-sub message queues using Apache Kafka
- Worked on Configuring Zookeeper, Kafka and logstash cluster for data ingestion and Elasticsearch performance and optimization and Worked on Kafka for live streaming of data.
- Indexing and search/query substantial number of documents (~400 million) inside Elasticsearch and created a Kibana dashboard for sanity-checking the data and Working with the Kibana dashboard for the overall build status with drill down features
- Setup/Optimise ELK {Elasticsearch, Logstash, Kibana} Stack and Integrated Apache Kafka for data ingestion
- Creating geo-mapping visualizations using Kibana to show data points on US based map and Utilize reporting via Kibana.
- Developed a Spark job which loads enormous data from HDFS and imply some transformations along with pre-processing on the fly and load the data into Elasticsearch.
- Data migrating into Elasticsearch through ES-Spark integration and Created mapping are indexing in Elasticsearch for quick retrieval
- Data Discovery, visualizations and dashboards are created in KIBANA for quick analysis on data
- POC work on replacing Sql server-backed for data retrieving data points with Elasticsearch, resulting in a thousand-fold speedup.
- Designed and developed data import, aggregation and advanced analytics on top of Memsql a very quick poc's in the initial stages of the product
Environment: Python, Spark, Elasticsearch, Hive, HDFS, Kafka, Logstash, Kibana, Jupyter, IntelliJ, MemSQL
Data Analytics
Confidential
Responsibilities:
- Conducting an exploratory analysis on data and to study different imputation methods that could be used for the data.
- Perform ad-hoc data visualizations using ggplot2 in R for evaluating existing models
- Spark implementation for using python, spark SQL to access hive tables into spark for faster processing of data.
- Program that I used Python (Pandas, NumPy, scikitlearn, matplotlib) and R ggplot2
- Involved in converting Hive/Sql queries into Spark transformations using Spark RDD's.
- Used various spark transformations like map, reducebykey, filter to clean the input data.
Environment: Python, R, Machine learning, AWS, Apache Hadoop, HDFS, Hive, Pig, Apache Spark, Spark Streaming, Spark SQL, HBase, Kafka, Sqoop, Git.
Confidential
Project Engineer
Responsibilities:
- Configuring essential parameters before deploying Elasticsearch cluster to production.
- ETL process for continuously bulk importing TeamCenter data from Sql server into Elasticsearch
- Setting up Logstash for centralizing and analyzing TeamCenter data management and exchange operations.
- Using Kibana interface to filter and visualize log messages gathered by an Elasticsearch ELK stack.
- Generating a histogram or even a date histogram (a histogram over time) using Elasticsearch giving it an interval to bucket the data into two weeks.
- Learned to index and search/query millions of documents inside Elasticsearch.
- Analyzing Electrolux sales data, writing the results to hdfs in Avro as well as Elasticsearch and created a Kibana dashboard for sanity-checking the data.
- Involved in data migration from one cluster to another cluster
- Handled importing of data from various data sources, performed transformations using Hive. (External tables, partitioning).
- Configuration, research and develop various 'use cases', aid in the use and operation of BI Tools, predictive data modelling and data analytics and or integration platform software such as TalenD.
- Worked on Teamcenter implementation and Data Migration projects, Teamcenter Engineering, and Teamcenter Manufacturing to suit as per the requirements of AB Electrolux
Technologies used: Hadoop, map-reduce, Hbase, Sqoop, Elasticsearch, Talend, TeamCenter