- Highly analytical and process - oriented data analyst with 7+ years of experience in data analysis, data modeling and data management having proven ability to work efficiently in both independent and team work environments. Excellent team player, good communication and interpersonal skills with solid team leading capabilities.
- Expertise and experience in domains like Retail Solutions, Finance, Healthcare, Banking, Digital advertisement and e-commerce.
- Having the 5 years of experience of working with Python, R and SAS analytical platforms.
- Expertise in SQL Queries and 6 years of experience in creating the databases, populating it, to extract data from data tables along with creation of tables, Sub queries, Joins, Views, Indexes, SQL Functions and other functionalities.
- Proficient knowledge of the SDLC and extensive experience in Agile (Scrum and XP) and Waterfall models and CRISP data cycle.
- Experience in data modeling, data analysis and working with OLTP and OLAP systems and data mining techniques such as EDW, MOLAP DM and ROLAP.
- Worked with various RDBMS like Oracle, MYSQL, SQL Server and expertise in creating tables, data population and data extraction from these databases.
- Worked with NoSQL databases like Apache Cassandra to deal with stream processing/real time analysis regarding unstructured data using KAFKA.
- Strong Experience in implementing Data warehouse solutions in Oracle and SQL Server.
- Experience in extracting, transforming and loading (ETL) data from spreadsheets, database tables, flat files and other sources using Informatica.
- Skilled in Data chunking, Data profiling, Data Cleansing, Data mapping, creating workflows and Data Validation using data integration tools like Informatica during the ETL and ELT processes.
- Having good knowledge in Normalization and De-Normalization techniques for optimum schema designing.
- A great experience in ERD and UML modelling, and conceptualize these models to create the physical models from logical model.
- Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modelling.
- Experience on Apache Hadoop Ecosystem with good knowledge of Apache Hadoop Distributed file system (HDFS), Map Reduce, Hive, Pig, Python, HBase, Sqoop, Kafka, Flume, Cassandra, Oozie, Spark.
- Experience with conceptual, logical and physical data modeling considering Meta data standards.
- Experience with DBA tasks involving database creation, performance tuning, creation of indexes, creating and modifying table spaces for optimization purposes.
- Knowledge of Machine Learning techniques like Regression Models, Artificial Neural Networks, Clustering Analysis, Decision Tree, ANOVA, Natural Language Processing (NLP), t-tests, Neural networks and SVM.
- Experience in Base SAS, R, SQL, Tableau, Python, MS EXCEL (Pivot charts, Macros).
- Expertise in creating Tableau Dashboards for data visualization and deploying it to the servers.
Analytical Techniques: Hypothesis testing, Predictive analysis, Machine Learning, Regression Modelling, Logistic Modelling, Time Series Analysis, Decision Tree, Neural Networks, Support Vector Machines (SVM), Monte Carlo methods, Random Forest, Time series analysis.
Analytical tool: R Studio, SAS, Jupyter notebook, NLP, MATLAB, GGPLOT, WEKA
Data Visualization Tool: Tableau, Microsoft Power BI, Excel, VISIO
Data modeling: Entity relationship Diagrams (ERD), Snowflake schema, Star schema, Erwin, ERstudio
Languages: SQL, HIVE QL, C, R, Python, SAS
Database Systems: SQL Server 10.0/11.0/13.0, Oracle, MYSQL 5.1/5.6/5.7
NOSQL Databases: HBASE, Apache Cassandra
ETL Tools: Informatica Power House 9.0, Informatica IDQ, KAFKA, FLUME
Big Data: Apache Hadoop, HDFS, Sqoop, Flume, Kafka, Hive, Impala, MapReduce, Splunk ML-SPL, Splunk Hadoop Connect, Oozie, Spark
SDLC Methodology and Tools: Waterfall, Agile / Scrum Methodology / XP, SeeNowDoScrum, MS Project, CRISP
Confidential, MOLINE, ILLINOIS
- Responsible for data identification, collection, exploration, and cleaning for modeling, participate in model development.
- Performed Data Cleaning, features scaling, features engineering.
- Responsible for loading, extracting and validation of client data.
- Creating statistical models using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
- Missing value treatment, outlier capping and anomalies treatment using statistical methods, deriving customized key metrics.
- Created Conceptual, Logical and Physical data models with ER Studio and did design review meetings with members in order to finalize the model.
- Performed analysis using industry leading text mining, data mining, and analytical tools and open source software.
- Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
- Worked on AWS and architecting a solution to load data creates data models and run BI on it.
- Optimized the ETL workflows for better performance in data migration and performed the required transformation based on the requirements of the project.
- Applied analysis methods such as Hypotheses testing and Analysis of variance (ANOVA) for validating the existing models on the observed data.
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit - learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Setup storage and data analysis tools in Amazon Web Services cloud computing infrastructure.
- Dummy variables were created for certain datasets to into the regression.
- Built multiple features of machine learning using python and R based on need.
- Strong skills in data visualization like MatPlotLib and seaborn library.
- Create different charts such as Heat maps, Bar charts, Line charts, etc to provide a better visualization of the result within R and python.
- Developed predictive models using Decision Tree, Random Forest, Na ve Bayes, Logistic Regression, Cluster Analysis, and Neural Networks to predict analytical Online Advertising Pricing Model to maximize client's net revenues, predict accurate Revenue per Click estimates and build a fraud traffic detection system to flag potential bot sessions that cause inflated billings to the client's customers.
- Worked on Microsoft Visio to do process mining for different processes in different departments.
- Visualize, interpret, report findings, and develop strategic uses of data by python Libraries like Numpy, Scikit-learn, MatPlotLib.
Environment: Python 3.x, R, HDFS, Hive, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, MATLAB, Spark SQL, PySpark, Apache Hadoop Distribution 2.7.X, Machine Learning, Data Marts, Data Warehouse, Informatica, SQL, Hive QL, Kafka, Flume, Spark, scipy scikit-learn, NLP, Data integration, Numpy, Pandas, MySQL, Decision Tree, Salesforce, Random Forest, Naïve Bayes, Logistic Regression, Cluster Analysis, Neural Networks.
Confidential, PHILADELPHIA, PENNSYLVANIA
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation, and visualization.
- Created DDL scripts using Erwin and source to target mappings to bring the data from source to the warehouse.
- Developed machine learning models that predicted click propensity of users based on attributes such as user demographics, historic click behaviour and other related attributes. Predicting user propensity to click helped show and place relevant features on the website.
- Clustered the supply chain of stores based on volume, volatility in demand and proximity to warehouses using Hierarchical clustering and identified strategies for each of the clusters to better optimize the service level to stores.
- Experience in integration of SalesForce and SQL server using SQL Server Integration Services
- Predicted the likelihood of customer attrition by developing classification models based on customer attributes like user demographics, historic clicks, user acquisition channels etc. The models deployed in production environment helped detect churn in advance and aided sales/marketing teams plan for various retention strategies in advance like tailored promotions and custom offers.
- Hands-on experience in data modeling for AWS Platform such as AWS Redshift.
- Experimented with predictive models including Logistic Regression, Support Vector Machine, Random Forest, XGBoost algorithms and identified best models based on accuracy and explainability of the models.
- Forecasted sales and improved accuracy by 10-20% by implementing advanced forecasting algorithms that were effective in detecting seasonality and trends in the patterns in addition to incorporating exogenous covariates. Increased accuracy helped business plan better with respect to budgeting and sales and operations planning.
- Tuned model parameters (p,d,q for ARIMA) using walk forward validation techniques for optimal model performance.
- Developed and refined complex marketing mix statistical models in a team environment and worked with diverse functional groups with over $100MM in annual marketing spend.
- Responsible for all stages in the modeling process, from collecting, verifying, & cleaning data to visualizing model results, presenting results, and making client recommendations.
- Developed customer segments using unsupervised learning techniques like K-Means. The clusters helped business simplify complex patterns to manageable set of 5 patterns that helped set strategic and tactical objectives pertaining to customer retention, acquisition, spend and loyalty.
Environment: Apache Hadoop, HBASE, Apache Spark, Kafka, Informatica, Tableau, R, SAS, Spark’s SQL, Predictive analysis, Machine Learning, MS office suite, HIVE (UDF), NLP (Natural language processing), Python, R, Salesforce, Informatica, numpy, scipy scikit-learn, SQL, Kakfa, Flume, ggplot, AWS Redshift, Cassandra, Oozie, Hive QL, Logistic Regression, Support Vector Machine, MySQL, Data mart, Data warehouse, Random Forest, K-Means clustering.
Confidential, MELVIN, NEW YORK
Data Scientist/ Data Analyst
- Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Worked very close with Data Architectures and DBA team to implement data model changes in database in all environments.
- Data modeling in Erwin; design of target data models for enterprise data warehouse.
- Extensively worked on creating the migration plan to Amazon web services (AWS).
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering’s such as feature generating, PCA, feature normalization and label encoding with Scikit-learn pre-processing.
- Used Python 2.x/3.X (NumPy, SciPy, Pandas, Scikit-learn, Seaborn to develop a variety of models and algorithms for analytic purposes.
- Experimented and built predictive models including ensemble methods such as Gradient boosting trees and Neural Network by Keras to predict Sales amount.
- Conducted analysis and patterns on customers' shopping habits in a different location, different categories and different months by using time series Modeling techniques.
- Used RMSE/MSE to evaluate different models' performance.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Environment: Python 2.x/3.x, Scikit-Learn/SciPy, NumPy, Pandas, Matplotlib, Seaborn, Apache Hadoop, HBASE, Apache Spark, Tableau, Machine Learning algorithms, Data Mart, Natural Language Processing(NLP), Data Warehouse, Salesforce, HIVE, Cassandra, Oozie, Random Forest, Gradient Boosting tree, AWS Redshift, Neural network by Keras, SQL, Hive QL, Kafka, Flume, MS Office suite, Principal Component analysis (PCA), MySQL, Factor analysis.
Data Scientist/ Data analyst
- Involved in all phases of data acquisition, data collection, data cleaning, model development, model validation, and visualization to deliver data science solutions.
- Created classification models to recognize web request with product association in order to classify the orders and scoring the products for analytics which improved the online sales percentage by 13%.
- Worked on NLTK library in python for doing sentiment analysis on customer product reviews and other third-party websites using web scrapping.
- Used Pandas, NumPy, Scikit-learn in Python for developing various machine learning models such Decision Tree and Random forest.
- Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
- Implemented and developed fraud detection model by implementing a Feed Forward Multilayer Perceptron which is a type of ANN. Worked with ANN (Artificial Neural Networks).
- Used pruning algorithms to cut away the connections and perceptron’s to significantly improve the performance of back-propagation algorithm.
- Implemented a structured learning method that is based on search and scoring method.
- Customer segmentation based on their behaviour or specific characteristics like age, region, income, geographical location and applying Clustering algorithms to group the customers based on their similar behaviour patterns.
- Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.
- Worked with numerous data visualization tools in python like matplotlib, seaborn, ggplot.
Environment: Pycharm, Pandas, Scikit, Numpy, Tableau, SPSS, Excel, Python, R, SAS/STAT, SQL, MYSQL, Predictive analysis, Neural Networks, Apache Hadoop Distribution 2.7.X, Machine learning, Natural Language Processing (NLP), HDFS, Linux, Microsoft PowerBI, MS office suite, NLP (Natural language Processing), ggplot, Cassandra, Oozie, Hive QL, HBASE, Kafka.
- Involved in Data mapping specifications to create and execute detailed system test plans. The data mapping specifies what data will be extracted from an internal data warehouse, transformed and sent to an external entity.
- Worked closely with stakeholders to understand, define, document business questions needed.
- Review system/application requirements (functional specifications), test results and metrics for quality and completeness.
- Designed and Developed Oracle PL/SQL Procedures for Data Import/Export and Data Conversions.
- Analysed the source data coming from different sources: SQL Server, Oracle and also from flat files like Access and Excel and working with business users and developers to develop the Model.
- Have Used Informatica Data Quality as ETL tool to transform the data from various sources and bring them into one common format and load them into target database for the analysis purpose from Data Warehouse.
- Executed SQL queries to validate actual test results and match expected results as per financial rules.
- Responsible for maintaining the integrity of the SQL database and reporting any issues to the database architect.
- Design and model the reporting data warehouse considering current and future reporting requirement.
- Involved in the daily maintenance of the database that involved monitoring the daily run of the scripts as well as troubleshooting in the event of any errors in the entire process.
Environment: Windows XP/NT/2000, SQL Server 2005/2008, SQL, MYSQL, Oracle, Microsoft Visio, MS Office 2010, MS Access 2010, Tableau, Informatica, Microsoft Excel, Data Warehouse.