Data Scientist Resume
SUMMARY
- Over 5 years of experience in manipulation, wrangling, model building and visualization with large data sets.
- An analytical and detail oriented Data science professional with proven records of success in the collection and manipulation of large datasets.
- Demonstrated expertise in decisive leadership and in delivering research based, data driven solutions that move organizations vision forward.
- Highly competent Confidential researching, visualizing and analyzing raw data in order to identify recommendations for meeting organizational challenges.
- Proven excellence in personal management and program development.
- Ability to perform Data preparation and exploration to build the appropriate machine learning model.
- Proficient in Statistical Modeling and Machine Learning techniques in Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, PCA, Ensembles.
- Expertise in Machine Learning models like Linear, Logistic Regression, Decision Trees, Naive Bayes, SVM, Neural Networks, K - Nearest Neighbors, clustering (K-means, Hierarchical)
- Implement and practice Machine learning techniques on structured and unstructured data with equal proficiency.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables.
- Ability to use dimensionality reduction techniques and regularization techniques.
- Highly skilled in using visualization tools like Matplotlib, ggplot2 and Seaborn for creating dashboards.
- Experience working with Big Data tools such as Hadoop - HDFS and MapReduce, Hive, Sqoop, and Apache Spark (PySpark).
- Experience working with RDBMS such as SQL Server, MySQL and NoSQL databases such as MongoDB, Cassandra, HBase.
- Experience in importing and exporting data from different RDBMS like MySql, Oracle and SQL Server into HDFS and Hive using Sqoop.
- Good Knowledge about scalable, secure cloud architecture based on Amazon Web Services (AWS cloud services: EC2, EMR and S3).
- Strong communication skills with professional attitude and can take the pressures to drive with enthusiasm to support with full potential.
TECHNICAL SKILLS
Programming: Python, R, SCALA
Python: Data Manipulation, Numpy, Pandas, Matplotlib, Seaborn, Plotly, Scikit learn (machine learning libraries and others)
Big Data: Hadoop, Map Reduce, HDFS, Hive, Kafka, Pig, Oozie, Flume, Sqoop, Impala, Spark
Spark: Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX, PySpark, Data Frame
Platforms: Ubuntu, Linux, MacOS
Analytical Tools: SQL, Jupyter Notebook, Apache Zeppelin, MS Excel
Methodologies: Agile, Scrum, Software development Life Cycle(SDLC)
NoSQL: MongoDB, Cassandra, HBase
Others: AWS, S3, EC2, EMR, MySQL, PostgreSQL
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Understanding the business, problem statement and manual approaches company has followed since years
- Gathered all the data that is required from multiple data sources such as data warehouse, Billing department
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, RDBMS, SQL Server, HBase, and also log data from servers
- Perform data cleaning and transformations that is suitable for applying models using Pandas, Numpy
- Performed transformations of data using Spark and Hive to generate the final dataset to be consumed by analytical applications
- Performed Exploratory Data Analysis (EDA)
- Participated in features engineering such as feature generating, PCA, feature normalization with Scikit-learn preprocessing
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib
- Experimented and built predictive models using Logistic regression, Decision Tree, Support Vector Machine and KNN to predict customer churn
- Model performance accuracy was evaluated by using Confusion Matrix, Precision, and Recall
- Developed logistic regression model with 61 percent of model accuracy
Environment: HDFS, Hive, Sqoop, Spark, Spark MLlib, SQL, Excel, MongoDB
Confidential - San Francisco, CA
Data Scientist
Responsibilities:
- Responsible for researching and developing the action plan required for the development of the model
- Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Sqoop, Spark, Hive, Kafka, MapReduce and HDFS
- Worked with data-sets of varying degrees of size and complexity including both structured and unstructured data
- Performed data integrity checks, data cleansing, exploratory analysis and feature engineer using python and data visualization packages such as Matplotlib, Seaborn
- Utilized data wrangling tools and advanced statistical/machine learning techniques to create high-performing predictive models and actionable insights to address business objectives and client needs
- Used various metrics (RMSE, MAE, F-Score, ROC and AUC) to evaluate the performance of each model
- Used big data tools Spark (PySpark, SparkSQL, MLlib) to conduct real time analysis of customer behavior
- Communicated effectively with internal stakeholders on product design, data specification, model implementations, with partners on collaboration ideas and specifics, with clients and account teams on project/test results
- Recommended and evaluated marketing approaches based on quality analytics on customer behavior
- Designed rich data visualizations to model data into human-readable form with Seaborn and Matplotlib
Environment: Hadoop, Spark, HDFS, Hive, MongoDB, Cassandra, Kafka, Sqoop, SQL, Python 3 (Scikit -Learn/ Scipy/ Numpy/ Pandas/ Matplotlib/ Seaborn), Machine Learning (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/PCA)
Confidential - San Francisco, CA
Big Data Engineer
Responsibilities:
- Responsible for data engineering functions such as data extraction, injection and transformation
- Imported data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from RDBMS into HDFS using SQOOP
- Optimized Hive pipelines in Data lake by implementing Partitioning, and bucketing concepts for improving performance
- Exported the analyzed data to the RDBMS using SQOOP for visualization and to generate reports for the BI team
- Stored the resultant data from transformation into HBase, MongoDB and also in parquet file format
- Worked closely with data scientists to assist on feature engineering, model training frameworks, and model deployments Confidential scale
- Worked with application developers and DBAs to diagnose and resolve query performance problems
- Collaborated with Marketing, Finance, Business Development, Product & other teams to help them uncover the insights from the data
Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Cloudera, Python, MongoDB, Cassandra, MySQL
Confidential - Emeryville, CA
Data Engineer
Responsibilities:
- Worked on Hadoop Cluster with size of 30 nodes and 50 TB capacity
- Extracted and Loaded customer data from databases to HDFS and HIVE tables using Sqoop
- Loading the data into Hive managed tables using partitions and buckets
- Performed data transformations, cleaning and filtering, using Hive and Pig
- Analyzed and studied customer behavior by running Hive queries
- Stored the resultant from transformation into parquet, seq, avro file format
- Work closely with the business and analytics teams in gathering the system requirements
- Documentation of the day to day tasks
Environment: Hadoop, HDFS, YARN, Map-Reduce, Hive, Pig, Sqoop, Linux Python.
Confidential
Jr. SQL Developer
Responsibilities:
- Worked closely with all teams within the organization to understand business processes, gather requirements, understand complexities and end goals to come up with the best plan of execution
- Created database objects like Tables, Indexes, Stored Procedures, Views, User Defined Functions, Cursors and Triggers.
- Developed Report Services using SSRS
- Assisted managers and business analysts in developing reports, presentations, and analysis for upper management
ENVIRONMENT: Python, MYSQL, SQL
