We provide IT Staff Augmentation Services!

Data Scientist Resume

0/5 (Submit Your Rating)

Chicago, IL

SUMMARY

  • Around 8+ years of experience in IT and 5+years' experience in Data scientist with strong technical expertise, business experience, and communication skills to drive high - impact business outcomes through data-driven innovations and decisions.
  • Hands on experience on Spark-Mlib utilities such as classification, regression, clustering, collaborativefiltering, dimensionalityreductions.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, DataMining solutions to various business problems and generating data visualizations using R, Python and Tableau
  • Strong knowledge of statistical methods (regression, timeseries, hypothesistesting, randomizedexperiment), machineleaning, algorithms, datastructures and datainfrastructure.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volumes of structures and unstructured data.
  • Extensive hands on experience and high proficiency with structures, semi-structured and unstructured data, using a broad range of data science programming languages and big data tools including R, Python, Spark, SQL, ScikitLearn, HadoopMapReduce
  • Expertise in the implementation of Coreconcepts of Java, JEETechnologies, JSP, Servlets, JSTL, EJB, JMS, Struts, Spring, Hibernate, JDBC, XML, WebServices, and JNDI.
  • Extensive experience working in a Test-DrivenDevelopment and Agile-ScrumDevelopment.
  • Experience in working on both Windows, Linux and UNIX platforms including programming and debugging skills in UNIX Shell Scripting.
  • Flexible with Unix/Linux and WindowsEnvironments, working with OperatingSystems like Centos5/6, Ubuntu13/14, Cosmos.
  • Defining job flows in Hadoop environment-using tools like Oozie for data scrubbing and processing.
  • Experience in Datamigration from existing data stores to Hadoop.
  • Developed MapReduce programs to perform DataTransformation and analysis.
  • Experience in analyzing data with Hive and Pig using on reading data schema.
  • Created Development Environments in AmazonWebServices using services like VPC, ELB, EC2, ECS and RDS instances
  • Strong experience in SoftwareDevelopmentLifeCycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies
  • Proficient in DataScienceprogramming using Programing in R, Python and SQL
  • Proficient in SQL, Database, DataModeling, DataWarehousing, ETL and reporting tools
  • Strong knowledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB, and its integration with Hadoopcluster..
  • Proficient in using AJAX for implementing dynamic Web Pages.
  • Solid team player, teambuilder, and an excellentcommunicator.

TECHNICAL SKILLS

Languages: Java 8, Python, R

Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr, pandas, numPy, seaborn, sciPy, matplot lib, scikit-learn, Beautiful Soup, Rpy2, sqlalchemy.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, Mongo DB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau,Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub.

Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Data Scientist

Responsibilities:

  • Identifying the Customer and account attributes required for MDM implementation from disparate sources and preparing detailed documentation.
  • Performing data profiling and analysis on different source systems that are required for CustomerMaster.
  • Worked closely with the DataGovernanceOfficeteam in assessing the source systems for project deliverables.
  • Used Confidential -SQL queries to pull the data from disparate systems and Data warehouse in different environments.
  • Used DataQualityvalidation techniques to validate CriticalData Elements (CDE) and identified various anomalies.
  • Extensively used open source tools - RStudio (R) and Spyder (Python) for statistical analysis and building the machinelearning.
  • Involved in defining the Source To businessrules, Targetdatamappings, datadefinitions.
  • Presented DQ analysis reports and score cards on all the validated data elements and presented -to the business teams and stakeholders.
  • Performing DataValidation / DataReconciliation between disparate source and target systems (Salesforce, Cisco-UIC, Cognos, DataWarehouse) for various projects.
  • Interacting with the Business teams and Project Managers to clearly articulate the anomalies, issues, findings during data validation.
  • Writing complexSQL queries for validating the data against different kinds of reports generated by Cognos.
  • Extracting data from different databases as per the business requirements using SqlServerManagementStudio.
  • Interacting with the ETL, BIteams to understand / support on various ongoing projects.
  • Extensively using MSExcel for datavalidation.
  • Generating weekly, monthly reports for various business users according to the business requirements.
  • Manipulating/mining data from database tables (Redshift, Oracle, DataWarehouse)
  • Providing analytical network support to improve quality and standard work results.
  • Create statistical models using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
  • Interface with other technology teams to load (ETL), extract and transform data from a wide variety of data sources
  • Utilize a broad variety of statistical packages like SAS, R, MLIB, Graphs, Hadoop, Spark, MapReduce and others
  • Provides input and recommendations on technical issues to Business&DataAnalysts, BIEngineers and DataScientists.

Environment: Data Governance, SQL Server, ETL, MS Office Suite - Excel (Pivot, VLOOKUP), DB2, R, Python, Visio, HP ALM, Agile, Sypder, Word, Azure, MDM, SharePoint, Data Quality, Tableau and Reference Data Management.

Confidential, Chicago, IL

Data Scientist

Responsibilities:

  • Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Worked onanalyzing data from GoogleAnalytics, AdWords, Facebook etc.
  • Evaluated models using CrossValidation, Loglossfunction, ROCcurves and used AUC for feature selection and elastic technologies like ElasticSearch, Kibana.
  • Performed DataProfiling to learn about behavior with various features such as trafficpattern, location, Date and Time etc.
  • Categorized comments into positive and negative clusters from different social networking sites using SentimentAnalysis and TextAnalytics
  • Used Pythonscripts to update content in the database and manipulate files
  • Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
  • Performed Multinomial LogisticRegression, DecisionTree, Randomforest, SVM to classify package is going to deliver on time for the new route.
  • Performed data analysis by using Hive to retrieve the data from Hadoopcluster, Sql to retrieve datafrom Oracledatabase and used ETL for data transformation.
  • Performed DataCleaning, features scaling, features engineering using pandas and numpy packages in python.
  • Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation
  • Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
  • Developed Spark/Scala, RPython for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Tracking operations using sensors until certain criteria is met using AirFlowtechnology.
  • Responsible for different Datamapping activities from Source systems to Teradata using utilities like TPump, FEXP, BTEQ, MLOAD, FLOAD etc
  • Analyze traffic patterns by calculating autocorrelation with different time lags.
  • Ensured that the model has low FalsePositiveRate and Textclassification and sentiment analysis for unstructured and semi-structured data.
  • Addressed over fitting by implementing of the algorithm regularization methods like L1 and L2.
  • Used Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Used MLlib, Spark'sMachinelearning library to build and evaluate different models.
  • Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
  • Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
  • Developed MapReduce pipeline for feature extraction using Hive and Pig.
  • Communicated the results with operations team for taking best decisions.
  • Collected data needs and requirements by Interacting with the other departments.

Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, AWS, Linux, Spark, Tableau Desktop, SQL Server 2014, Microsoft Excel, MATLAB, Spark SQL, Pyspark.

Confidential, Washington, District of Columbia

Data Analyst

Responsibilities:

  • Worked with BI team in gathering the report requirements and also Sqoop to export data into HDFS and Hive.
  • Involved in the below phases of Analytics using R, Python and Jupyter notebook.
  • Data collection and treatment: Analysed existing internal data and external data, worked on entry errors, classification errors and defined criteria for missing values
  • Data Mining: Used cluster analysis for identifying customersegments
  • Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
  • Assisted with data capacity planning and node forecasting.
  • Installed, Configured and managed Flume Infrastructure
  • Administrator for Pig, Hive and HBase installing updates patches and upgrades.
  • Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.
  • Worked on performing major upgrade of cluster from CDH3u6 to CDH4.4.0
  • Developed MapReduce programs to extract and transform the data sets and results were exported back to RDBMS using Sqoop.
  • Patterns were observed in fraudulent claims using text mining in R and Hive.
  • Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
  • Adept in statisticalprogramminglanguages like Rand Python including BigData technologies like Hadoop, and Hive.
  • Experience working as DataEngineer, BigDataSparkDeveloper, FrontEndDeveloper and ResearchAssistant.
  • Created tables in Hive and loaded the structured (resulted from MapReduce jobs) data
  • Using HiveQL developed many queries and extracted the required information.
  • Created Hivequeries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Was responsible for importing the data (mostly log files) from various sources into HDFS using Flume
  • Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the HadoopDistributedFile System and PIG to pre-process the data.
  • Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems.
  • Managed and reviewed Hadoop log files.
  • Tested raw data and executed performance scripts.

Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera, Python.

Confidential, SFO, CA

Python Developer

Responsibilities:

  • Developed a portal to manage and entities in a content management system using Flask
  • Designed the database schema for the content management system.
  • Designed email marketing campaigns and also created responsive web forms that saved data into a database usingPython/ Django Framework.
  • Worked on Hadoopsinglenode, Apachespark, Hiveinstallations
  • Developed views and templates in Django to create a user-friendly website interface.
  • Configured Django to manage URLs and application parameters.
  • Supported MapReduce Programs those are running on the cluster
  • Worked on CSV files while trying to get input from the MySQL database.
  • Wrote programs for performance calculations using Numpyandsqlalchemy.
  • Administered and monitored multi DatacenterCassandracluster based on the understanding of the CassandraArchitecture.
  • Extensively worked with Informatica in designing/developing ETL process to load data from xml sources to target database
  • Designed, automated the process of installation and configuration of secure DataStaxEnterpriseCassandra using chef
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Worked in stages such as analysis and design, development, testing and debugging.
  • Built Web pages that are more user-interactive using jQueryplugins for Drag and Drop, AutoComplete, JSON, AngularJS, JavaScript.

Environment: Python 2.7, Windows, MySQL, ETL, Ansibleflask and Python Libraries such as Numpy, sqlalchemy, Angular Js, MySQL DB.

Confidential

SAS Programmer

Responsibilities:

  • Analyzed high volume, high dimensional client and survey data from different sources using SAS and R.
  • Manipulated large financial datasets, primarily in SQL and R
  • Used R for large matrix computation
  • Developed Algorithms (DataMiningQuery's) to extract data from data warehouse & databases to build Rules for the Analyst & Models Team.
  • Used R to import high volume of data
  • High level programming efficiency in the use of statistical modeling tools such as SAS, SPSS and R.
  • Developed predictive models using R to predict customers churn and classification of customers
  • Worked on Shiny and R application displaying machine learning for improving the forecast of business.
  • Developed, reviewed, tested & documented SAS programs/macros.
  • Created Templates by using SAS macro for existing reports to reduce the manual intervention.
  • Created Self-service tools for Onshore/Offshore team for data retrieval.
  • Worked on daily reports and used them for further analysis.
  • Developed/Designed templates for new data extraction requests.
  • Executed weekly reports for CommercialDataAnalyticsTeam.
  • Communicated progress to key Business partners and Analysts through status reports and tracked issues until resolution.
  • Created predictive and other analytically derived models for assessing sales.
  • Provided support in the design and implementation of ad hoc requests for Sales-RelatedPortfolioData.
  • Responsible for preparing test case documents and Technical specification documents.

Confidential

SAS Developer/Analyst

Responsibilities:

  • Integrates all transaction data from multiple data sources used by Actuarial into a single repository.
  • Implemented and executes monthly incremental updates to the data environment.
  • Interacts with IT and finance and executes data validation tie-out reports.
  • Developed new programs and modified existing programs passing SAS macro variables to improve ease and efficiency as well as consistency of results.
  • Created Data transformation and DataLoading (ETL) scripts for DataWarehouses.
  • Implement fully automated data flow into Actuarial front end (Excel) Models using SAS process.
  • Creating SAS programs using SASDI Studio.
  • Validated the entire data process using SAS and BI tools.
  • Extensively used PROCSQL for column modifications, field populations on warehouse tables.
  • Developed distinct OLAP Cubes from SASDataset and generated results into the excel sheets.
  • Involved in discussions with business users to define metadata for tables to perform ETL process.

Environment: Python 2.7, Windows, MySQL, ETL, Ansibleflask and Python Libraries such as Numpy, sqlalchemy, Angular Js, MySQL DB.

We'd love your feedback!