- Around 8 years of IT experience as a Data Scientist, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
- Proficient in Statistical Methods like Regression models, hypothesis testing, confidence intervals, principal component analysis and dimensionality reduction.
- Experience in Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools such as Ab Initio and Informatica Power Center.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Expert in R and Python scripting. Worked in stats function with Numpy, visualization using Matplotlib/Seaborn and Pandas for organizing data.
- Experience in using various packages in R and pythonlike ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr, SciPy, Scikit - learn, Beautiful Soup, Rpy2.
- Installing, configuring and maintaining Continuous Integration, Automation and Configuration Management tools
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Experience in Creating Store procedures and functions in Sql server to import data in to Elastic Search and converting relational data in to documents.
- Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munge.
- Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
- Experience in setting up the whole app stack, setup and debug log stash to send Apache logs to AWS Elastic search.
- Continuous Integration-Continuous Delivery: Administered and Implemented CI tools like Atlassian Bamboo, Anthill Pro for automated builds. Automating CI-CD pipelines using Bamboo/Jenkins, Rundeck, Anthill Pro, Urban Code Deploy.
- Professional working experience in Machine Learning algorithms such as LDA, linear regression, logistic regression, Naive Bayes, Decision Trees, Clustering, and Principle Component Analysis.
- Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, Spark SQL.
- Experience working with data modeling tools like Erwin, Power Designer and ER Studio.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS Architecture.
- Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
- Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from various sources, prepared data for data exploration using data munging and Teradata.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Ability to work with managers and executives to understand the business objectives and deliver as per the business needs and a firm believer in team work.
- Experience and domain knowledge in various industries such as healthcare, insurance, retail, banking, media and technology.
- Work closely with customer's, cross-functional teams, research scientists, software developers, and business teams in an Agile/Scrum work environment to drive data model implementations and algorithms into practice.
- Strong written and oral communication skills for giving presentations to non-technical stakeholders.
Bigdata/Hadoop Technologies: Hadoop, HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper
Languages: HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB
Business Intelligence Tools: Tableau server, Tableau Reader, Tableau, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse
Deployment Tools: Bamboo, Jenkins, Rundeck.
Development Tools: Microsoft SQL Studio, IntelliJ, Eclipse, NetBeans.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos 7.0/6.0.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
Confidential, Minneapolis, MN
- Extracted data from HDFS and prepared data for exploratory analysis using data munging
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation, visualization and performed Gap analysis.
- A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping,
- Machine Learning, Python programming, SQL, GIT, MongoDB, Hadoop.
- Setup storage and data analysis tools in AWS cloud computing infrastructure.
- Installed and used Caffe Deep Learning Framework.
- Developed Spark scripts by using Scala IDE as per the business requirement.
- Developing Models onscalaand Spark for users, prediction models, sequential algorithms
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Configure, implement, and automate Continuous Integration and Deployment pipelines for software delivery teams utilizing Jenkins and other supporting tools
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
- Used pandas, numpy, seaborn, matplotlib, scikit-learn, scipy, NLTK in Python for developing various machine learning algorithms.
- Performed the ongoing delivery, migrating client mini-datawarehouses or functionaldata-marts from different environments to MSSQLserver.
- Developed SSIS packages to exportdatafrom Excel (Spreadsheets) toSQLServer, automated all the SSIS packages and monitored errors usingSQLJob daily
- DevelopedHivequeries and UDFS to analyze/transform thedatain HDFS
- Data Manipulation and Aggregation from different source using Nexus, Business Objects, Toad, Power BI and Smart View.
- Experience in handling multiple relational databases likeSQLServer, Oracle
- Implemented Agile Methodology for building an internal application.
- Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
- Good knowledge on Spark components like Spark SQL, MLib, Spark Streaming and GraphX,
- Extensively worked on SparkStreaming and Apache Kafka to fetch live stream data.
- Implemented novel algorithm for test and control team using Spark /Scala, Oozie, HDFS and Python on P&G Yarn cluster.
- Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
- Developed scalable model using Spark (RDD, MLlib, Ml, Data frames) in Scala
- Integrated Tesseract, ghost script with Spark to access datain hdfs and saving data in hive table
- Programmed a utility in Python that used multiple packages (numpy, scipy, pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, Naive Bayes, KNN.
- Performed Importing and exportingdatainto HDFS andHiveusing Sqoop
- As Architect delivered various complex OLAPdatabases/cubes, scorecards, dashboards and reports.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification. f
- Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
- Data transformation from various resources, data organization, features extraction from raw and stored.
- Validated the machine learning classifiers using ROC Curves and Lift Charts.
Environment: Python, BI, ER Studio 9.7, JSON XMA, XGBoost, CI/CD, HDFX, OLTP, Scala etc.….
- Responsible for analyzing large data sets to develop multiple custom models and algorithms to drive innovative business solutions.
- Perform preliminary data analysis and handle anomalies such as missing, duplicates, outliers, and imputed irrelevant data.
- Remove outliers using Proximity Distance and Density based techniques.
- Involved in Analysis, Design and Implementation/translation of Business User requirements.
- Experienced in using supervised, unsupervised and regression techniques in building models.
- Creating Automated build process by using Jenkins CI tool.
- Performed Market Basket Analysis to identify the groups of assets moving together and recommended the client their risks
- Experience in determine trends and significant data relationships using advanced Statistical Methods.
- Implemented techniques like forward selection, backward elimination and step wise approach for selection of most significant independent variables.
- Performed Feature selection and Feature extraction dimensionality reduction methods to figure out significant variables.
- Used RMSE score, Confusion matrix, ROC, Cross validation and A/B testing to evaluate model performance in both simulated environment and real world.
- Performed Exploratory Data Analysis using R. Also involved in generating various graphs and charts for analyzing the data using Python Libraries.
- Involved in the execution of multiple business plans and projects Ensures business needs are being met Interpret data to identify trends to go across future data sets.
- Developed interactive dashboards, created various Ad Hoc reports for users in Tableau by connecting various data sources.
Environment: Python, SQL server, Hadoop, HDFS, HBase, MapReduce, Hive, Impala, Pig, Sqoop, Mahout, Spark MLlib, MongoDB, Tableau, ETL, Unix/Linux.
Confidential, Pleasanton, CA
- Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
- Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling.
- Created data visualization with ggplot2 in R to understand annual sales pattern.
- Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through use of comparison, T-test, F-test, R-squared, P-value etc.
- Applied linear regression, multiple regression, ordinary least square method, mean-variance, theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, Scipy, Numpy and Pandas module of Python.
- Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and Scipy anddeveloped visualizations and dashboards using ggplot2, Tableau.
- Python and R scripting to wrangle and aggregate a war dataset consisting of 2+ million records and inconsistent formats.Functions used such as is.na, median and filters like which ().
- Reset data frame index in R for misaligned data and generate qplot for data visualization.
- Developed large data sets from structured and unstructured data. Perform data mining.
- Partnered with modelers to develop data frame requirements for projects and converting vector data into matrices by using rbind () and nbind () functions.
- Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
- Tracked various campaigns, generating customer profiling analysis and data manipulation.
- Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
- Analyzed large datasets to answer business questions by generating reports and outcome.
- Worked in a team of programmers and data analysts to develop insightful deliverables that support data- driven marketing strategies.
- Executed SQL queries from R/Python on complex table configurations.
- Retrieving data from database through SQL as per business requirements.
- Prepared data frames by using Gsub () function in R for identifying missing data that used for production data analysis.
- Create, maintain, modify and optimize SQL Server databases and troubleshoot server problems.
- Data collection, cleaned, filtered and transformed data in the specified format.
- Prepared the workspace for Markdown.
- Accomplished Data analysis, statistical analysis, generated reports, listings, and graphs.
- Worked on R and Python to identify business performance via Classification, tree map, and regression models along with visualizing data for interactive understanding and decision-making.
- Documented all programs and procedures to ensure an accurate historical record of work completed on an assigned project, which improved quality and efficiency of process by 15%.
- Adhering to best practices for project support and documentation.
- Managing the Reporting/Dash boarding for the Key metrics of the business.
Environment: MS Excel, PL/SQL, R, Python, SAS, SQL, MS Word, MS Excel, Hadoop, and Tableau
Confidential, Santa Ana CA
- Experience in working on Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
- Worked on installation of Kafka on Hadoop cluster and to use it for streaming & cleansing of raw data and have extracted useful information using Hive and stored the results in Hbase.
- R and Python languages used to identify chemical performance via Classification, tree map and regression models along with visualizing data for interactive understanding and decision-making.
- Identified outliers, anomalies and trends in any given data sets by using R.
- Provided daily change management process support, ensuring that all changes to program baselines are properly documented and approved, maintained, managed and issue change schedules.
- Developed, installed, maintained and monitored company databases in high performance/high availability environment with supported configuration, performance tuning to ensure optimal resource usage.
- Documented all programs and procedures to ensure an accurate historical record of work completed on assigned project as well as to improve quality and efficacy
- Produced quality reports for management for decision-making and Participated in all phases of research including data collection, data cleaning, data mining, developing models and visualizations.
- Performed data imputation using Scikit-learn package in Python.
- Performed data processing using Python libraries like Numpy and Pandas and Worked with data analysis usingggplot2 library in R to do data visualizations for better understanding of customers' behaviors.
Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/Scipy/ Numpy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau 14.
- Worked with several R packages including dplyr, Spark, Causal Infer, spacetime.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R and Hadoop.
- Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
- Worked with xml's extracting tag information using Xpath and Scala XML libraries from compressed blob datatypes.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis.
- Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau.
- Worked with Data governance, Data quality, data lineage, Data architect to design various models and processes.
- Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
- Designed data models and data flow diagrams using Erwin and MS Visio.
- Reviewed the logical model with Business users, ETL Team, DBA's and testing team to provide
- Information about the data model and business requirements.
- Extensively worked in Oracle SQL, PL/SQL, SQL*Loader, Query performance tuning, created DDL scripts, created database objects like Tables, Views Indexes, Synonyms and Sequences.
- Strong programming skills using R, Elastic Search & Machine Learning Algorithms.
- Designed and implemented machine learning algorithms to enhance existing data mining capabilities.
- Used variety of analytical tools and techniques (regression, logistic, GLM, decision trees, machine learning etc.) to carry out analysis and derive conclusions.
- Visualize, interpret, report findings and develop strategic uses of data.
Environment: Unix, Python 3.5, MLLib, SAS, regression, logistic regression, Hadoop 2.7, NoSQL, Teradata, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML.
- Involved in the design, development and testing phases of application using AGILE methodology.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using Flask, SQL Alchemy and PostgreSQL.
- Participated in requirement gathering and worked closely with the architect in designing and modeling.
- Worked on Restful web services which enforced a stateless client server and support JSON few changes from SOAP to RESTFUL Technology Involved in detailed analysis based on the requirement documents.
- Involved in writing SQL queries implementing functions, triggers, cursors, object types, sequences, indexes etc.
- Created and managed all hosted or local repositories through Source Tree's simple interface of GIT client, collaborated with GIT command lines and Stash.
- Responsible for setting up Python REST API framework and spring frame work using Django
- Developed consumer-based features and applications using Python, Django, HTML, behavior Driven Development (BDD) and pair-based programming.
- Designed and developed components using Python with Django framework. Implemented code in python to retrieve and manipulate data.
- Involved in development of the enterprise social network application using Python, Twisted, and Cassandra.
- Used Python and Django creating graphics, XML processing of documents, data exchange and business logic implementation between servers. rked closely with back-end developer to find ways to push the limits of existing Web technology.
- Designed and developed the UI for the website with HTML, XHTML, CSS, Java Script and AJAX
- Used AJAX&JSON communication for accessing RESTfulweb services data payload.
- Created and implemented SQL Queries, Stored procedures, Functions, Packages and Triggers in SQL Server.
- Successfully implemented Auto Complete/Auto Suggest functionality using JQuery, Ajax, Web Service and JSON.
Environment: Python 2.5, Java/J2EE, Django1.0, HTML, CSS Linux, Shell Scripting, Java Script, Ajax, JQuery, JSON, XML, PostgreSQL, Jenkins, ANT, Maven, Subversion, Python