- Over 5 years of experience in manipulation, wrangling, model building and visualization with large data sets.
- An analytical and detail oriented Data science professional with proven records of success in the collection and manipulation of large datasets.
- Demonstrated expertise in decisive leadership and in delivering research based, data driven solutions that move organizations vision forward.
- Highly competent Confidential researching, visualizing and analyzing raw data in order to identify recommendations for meeting organizational challenges.
- Proven excellence in personal management and program development.
- Ability to perform Data preparation and exploration to build the appropriate machine learning model.
- Proficient in Statistical Modeling and Machine Learning techniques in Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, PCA, Ensembles.
- Expertise in Machine Learning models like Linear, Logistic Regression, Decision Trees, Naive Bayes, SVM, Neural Networks, K - Nearest Neighbors, clustering (K-means, Hierarchical)
- Implement and practice Machine learning techniques on structured and unstructured data with equal proficiency.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables.
- Ability to use dimensionality reduction techniques and regularization techniques.
- Highly skilled in using visualization tools like Matplotlib, ggplot2 and Seaborn for creating dashboards.
- Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive, Sqoop, and Apache Spark (PySpark).
- Experience working with RDBMS such as SQL Server, MySQL and NoSQL databases such as MongoDB, Cassandra, HBase.
- Experience in importing and exporting data from different RDBMS like MySql, Oracle and SQL Server into HDFS and Hive using Sqoop.
- Good Knowledge about scalable, secure cloud architecture based on Amazon Web Services (AWS cloud services: EC2, EMR and S3).
- Strong communication skills with professional attitude and can take the pressures to drive with enthusiasm to support with full potential.
Programming: Python, R, SCALA
Python: Data Manipulation, Numpy, Pandas, Matplotlib, Seaborn, Plotly, Scikit learn (machine learning libraries and others)
Big Data: Hadoop, Map Reduce, HDFS, Hive, Kafka, Pig, Oozie, Flume, Sqoop, Impala, Spark
Spark: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX, PySpark, Data Frame
Platforms: Ubuntu, Linux, MacOS
Analytical Tools: SQL, Jupyter Notebook, Apache Zeppelin, MS Excel
Methodologies: Agile, Scrum, Software development Life Cycle(SDLC)
NoSQL: MongoDB, Cassandra, HBase
Others: AWS, S3, EC2, EMR, MySQL, PostgreSQL
Confidential - Bothell, WA
- Understanding the business, problem statement and manual approaches company has followed since years
- Gathered all the data that is required from multiple data sources such as data warehouse, Billing department
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS.
- This included data from Excel, Flat Files, RDBMS, SQL Server, HBase, and also log data from servers
- Perform data cleaning and transformations that is suitable for applying models using Pandas, Numpy
- Performed transformations of data using Spark and Hive to generate the final dataset to be consumed by analytical applications
- Performed Exploratory Data Analysis (EDA)
- Participated in features engineering such as feature generating, PCA, feature normalization with Scikit-learn preprocessing
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib
- Experimented and built predictive models using Logistic regression, Decision Tree, Support Vector Machine and KNN to predict customer churn
- Model performance accuracy was evaluated by using Confusion Matrix, Precision, and Recall
- Developed logistic regression model with 61 percent of model accuracy
Environment: HDFS, Hive, Sqoop, Spark, Spark MLlib, SQL, Excel, MongoDB Python3 (Scikit -Learn/ Scipy/ Numpy/ Pandas/ Matplotlib/ Seaborn), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/PCA)
Confidential - San Francisco, CA
- Responsible for researching and developing the action plan required for the development of the model
- Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Sqoop, Spark, Hive, Kafka, MapReduce and HDFS
- Worked with data-sets of varying degrees of size and complexity including both structured and unstructured data
- Performed data integrity checks, data cleansing, exploratory analysis and feature engineer using python and data visualization packages such as Matplotlib, Seaborn
- Utilized data wrangling tools and advanced statistical/machine learning techniques to create high-performing predictive models and actionable insights to address business objectives and client needs
- Used various metrics (RMSE, MAE, F-Score, ROC and AUC) to evaluate the performance of each model
- Used big data tools Spark (PySpark, SparkSQL, MLlib) to conduct real time analysis of customer behavior
- Communicated effectively with internal stakeholders on product design, data specification, model implementations, with partners on collaboration ideas and specifics, with clients and account teams on project/test results
- Recommended and evaluated marketing approaches based on quality analytics on customer behavior
- Designed rich data visualizations to model data into human-readable form with Seaborn and Matplotlib
Environment: Hadoop, Spark, HDFS, Hive, MongoDB, Cassandra, Kafka, Sqoop, SQL, Python 3 (Scikit -Learn/ Scipy/ Numpy/ Pandas/ Matplotlib/ Seaborn), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/PCA)
Confidential - Dublin, CA
Big Data Engineer
- Responsible for data engineering functions such as data extraction, injection and transformation
- Imported data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from RDBMS into HDFS using SQOOP
- Optimized Hive pipelines in Data lake by implementing Partitioning, and bucketing concepts for improving performance
- Exported the analyzed data to the RDBMS using SQOOP for visualization and to generate reports for the BI team
- Stored the resultant data from transformation into HBase, MongoDB and also in parquet file format
- Worked closely with data scientists to assist on feature engineering, model training frameworks, and model deployments Confidential scale
- Worked with application developers and DBAs to diagnose and resolve query performance problems
- Collaborated with Marketing, Finance, Business Development, Product & other teams to help them uncover the insights from the data
Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Cloudera, Python, MongoDB, Cassandra, MySQL
Confidential - Emeryville, CA
- Worked on Hadoop Cluster with size of 30 nodes and 50 TB capacity
- Extracted and Loaded customer data from databases to HDFS and HIVE tables using Sqoop
- Loading the data into Hive managed tables using partitions and buckets
- Performed data transformations, cleaning and filtering, using Hive and Pig
- Analyzed and studied customer behavior by running Hive queries
- Stored the resultant from transformation into parquet, seq, avro file format
- Work closely with the business and analytics teams in gathering the system requirements
- Documentation of the day to day tasks
Environment: Hadoop, HDFS, YARN, Map-Reduce, Hive, Pig, Sqoop, Linux Python.
Confidential - Mountain View, CA
Jr. SQL Developer
- Worked closely with all teams within the organization to understand business processes, gather requirements, understand complexities and end goals to come up with the best plan of execution
- Created database objects like Tables, Indexes, Stored Procedures, Views, User Defined Functions, Cursors and Triggers.
- Developed Report Services using SSRS
- Assisted managers and business analysts in developing reports, presentations, and analysis for upper management