- Professional qualified Data Scientist/Data Analyst with around 8+ years of experience in Data Science and Analytics including Data Mining, Deep Learning/Machine Learning and Statistical Analysis
- Involved in the entire data science project life cycle and actively involved in all the phases including data cleaning, data extractionand datavisualization with large data sets of structured and unstructured data, created ER diagrams and schema.
- Experienced with machine learning algorithm such as logistic regression, KNN, SVM, random forest, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 (ggplot2,, dplyr, Caret) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008/2010/2012 , NoSql databases like MongoDB 3.2
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X and build tools like Apache Maven/Ant
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile, DevOps and Scrum methodologies including creating requirements, test plans.
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Developed predictive models using Decision Tree, Naive Bayes, Logistic Regression, Random Forest, Social Network Analysis, Cluster Analysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with Python Scikit-Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Scipy, Numpy and Pandas for data analysis.
- Worked with complex applications such as R,R Shiny, SAS, Plotly, ArcGIS, Matlab and SPSS to develop neural network, cluster analysis.
- Strong SQL programming skills, with experience in working with functions, packages and triggers.
- Expertise in transforming business requirements into designing algorithms, analytical models, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation, data architecture, data ingestion and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, merge, Remap, subset, reindex, melt and reshape.
- Worked with NoSQLDatabase including Hbase, Cassandra and MongoDB.
- Experienced in Big Data with Hadoop, MapReduce, HDFS and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio, SSAS, SSISand SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL andPython and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
BigData/Hadoop Technologies: Hadoop, HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper and Oozie
Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB
Business Intelligence Tools: Tableau server, Tableau Reader, Tableau, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse
Development Tools: Microsoft SQL Studio, IntelliJ, Eclipse, NetBeans.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos 7.0/6.0.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
Confidential, Downers Grove, IL
Data Scientist/ Machine Learning
- Extracted data from HDFS and prepared data for exploratory analysis using data munging
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation, visualization and performed Gap analysis.
- A highly immersive Data Science program involving Data Manipulation&Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, MongoDB, Hadoop.
- Setup storage and data analysis tools in AWS cloud computing infrastructure.
- Installed and used Caffe Deep Learning Framework
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
- Used pandas, numpy, seaborn, matplotlib, scikit-learn, scipy, NLTK in Python for developing various machine learning algorithms.
- Data Manipulation and Aggregation from different source using Nexus, Business Objects, Toad, Power BI and Smart View.
- Implemented Agile Methodology for building an internal application.
- Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
- Coded proprietary packages to analyze and visualize SPCfile data to identify bad spectra and samples to reduce unnecessary procedures and costs.
- Programmed a utility in Python that used multiple packages (numpy, scipy, pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, Naive Bayes, KNN.
- As Architect delivered various complex OLAPdatabases/cubes, scorecards, dashboards and reports.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
- Data transformation from various resources, data organization, features extraction from raw and stored.
- Validated the machine learning classifiers using ROC Curves and Lift Charts.
Environment: Unix, Python 3.5.2, MLLib, SAS, regression, logistic regression, Hadoop 2.7.4, NoSQL, Teradata, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and MapReduce.
Confidential, Boston, MAData Scientist/ Machine Learning
- Utilized Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Worked onanalyzing data from Google Analytics, AdWords, Facebook etc.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
- Categorized comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics
- Performed Multinomial Logistic Regression, Decision Tree, Random forest, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve datafrom Oracle database and used ETL for data transformation.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
- Developed Spark/Scala,R Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Tracking operations using sensors until certain criteria is met using AirFlowtechnology.
- Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP,BTEQ, MLOAD, FLOAD etc
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Addressed over fitting by implementing of the algorithm regularization methods like L1 and L2.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Developed MapReduce pipeline for feature extraction using Hive and Pig.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, AWS, Linux, Spark, Tableau Desktop, SQL Server 2014, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Confidential, Deerfield, IL
- Worked with BI team in gathering the report requirements and also Sqoop to export data into HDFS and Hive
- Involved in the below phases of Analytics using R, Python and Jupyter notebook. a. Data collection and treatment: Analysed existing internal data and external data, worked on entry errors,classification errors and defined criteria for missing values b. Data Mining: Used cluster analysis for identifying customer segments, Decision trees used for profitable and non-profitable customers, Market Basket Analysis used for customer purchasing behaviour and part/product association.
- Developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Assisted with data capacity planning and node forecasting.
- Installed, Configured and managed Flume Infrastructure .
- Administrator for Pig, Hive and HBase installing updates patches and upgrades.
- Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.
- Worked on performing major upgrade of cluster from CDH3u6 to CDH4.4.0
- Developed Map Reduce programs to extract and transform the data sets and results were exported back to RDBMS using Sqoop.
- Patterns were observed in fraudulent claims using text mining in R and Hive.
- Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
- Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
- Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data
- Using HiveQL developed many queries and extracted the required information.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Was responsible for importing the data (mostly log files) from various sources into HDFS using Flume
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems.
- Managed and reviewed Hadoop log files.
- Tested raw data and executed performance scripts.
Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera, Python.
Confidential, Plano, TX
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Analyzed data using Hadoop Components Hive and Pig.
- Worked on Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Involved in loading data from UNIX file system to HDFS.
- Involved in development using Cloudera distribution system.
- Worked Hands on with ETL process.
- Developed Hadoop Streaming jobs to ingest large amount of data.
- Load and transform large data sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Imported data using Sqoop from Teradata using Teradata connector.
- Created Sub-Queries for filtering and faster execution of data.
- Created multiple Join tables and fetched the required data.
- Worked with Hadoop clusters using Cloudera (CDH5) distributions.
- Perform Importing and Exporting the Data using SQOOP from HDFS to Relational Database systems.
- Install and Set up HBASE and Impala.
- Used python libraries like Beautiful Soap, NumPy and SQLAlchemy.
- Used Apache Impala to read, write and query the Hadoop data in HDFS, Hbase and Cassandra.
- Implemented Partitioning, Dynamic Partitions and Buckets in Hive.
- Supported Map Reduce Programs those are running on the cluster.
- Worked on debugging, performance tuning of Hive&Pig Jobs.
- Bulk load the data into Oracle using JDBC template.
- Worked on Python OpenStack APIs and used NumPy for Numerical analysis.
Environment: Cloudera, HDFS, Pig, Hive, Map Reduce, python, Sqoop, Storm, Kafka, LINUX, Hbase, Impala, Java, SQL, Cassandra, MongoDB, SVN.
- Involved in the design, development and testing phases of application using AGILE methodology.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using Flask, SQLAlchemy and PostgreSQL.
- Participated in requirement gathering and worked closely with the architect in designing and modeling.
- Worked on Restful web services which enforced a stateless client server and support JSON few changes from SOAP to RESTFUL Technology Involved in detailed analysis based on the requirement documents.
- Involved in writing SQL queries implementing functions, triggers, cursors, object types, sequences, indexes etc.
- Created and managed all of hosted or local repositories through Source Tree's simple interface of GIT client, collaborated with GIT command lines and Stash.
- Responsible for setting up Python REST API framework and spring frame work using Django
- Develope consumer based features and applications using Python, Django, HTML, behavior Driven Development (BDD) and pair based programming.
- Designed and developed components using Python with Django framework. Implemented code in python to retrieve and manipulate data.
- Involved in development of the enterprise social network application using Python, Twisted, and Cassandra.
- Used Python and Django creating graphics, XML processing of documents, data exchange and business logic implementation between servers. orked closely with back-end developer to find ways to push the limits of existing Web technology.
- Designed and developed the UI for the website with HTML, XHTML, CSS, Java Script and AJAX
- Used AJAX&JSON communication for accessing RESTfulweb services data payload.
- Created and implemented SQL Queries, Stored procedures, Functions, Packages and Triggers in SQL Server.
- Successfully implemented Auto Complete/Auto Suggest functionality using JQuery, Ajax, Web Service and JSON.
Environment: Python 2.5, Java/J2EE, Django1.0, HTML,CSS Linux, Shell Scripting, Java Script, Ajax, JQuery, JSON, XML, PostgreSQL, Jenkins, ANT, Maven, Subversion, Python
- Designed the application by implementing Struts based on MVC Architecture, used simple Java Beans as a Model, JSP-UI Components as View and ActionServlet as a Controller.
- Implemented EJB's Container Managed Persistent strategy.
- Requirement gathering, Design Analysis and Code development.
- Implemented Struts framework based on the Model View Controller design paradigm.
- Implemented the MVC architecture using Struts MVC.
- Used JDBC for data access from Oracle tables.
- Worked on triggers and stored procedures on Oracle database.
- Worked on Eclipse IDE to write the code and integrate the application.
- Application was deployed on WebSphere Application Server.
- Coordinated with testing team for timely release of product.
- Apache ANT was used for the entire build process.
- JUnit was used to implement test cases for beans.