We provide IT Staff Augmentation Services!

Sr. Data Scientist/architect Resume

2.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Over 9 + years of hands on experience & comprehensive industry knowledge of Machine Learning Statistic Modeling, Predictive Modeling, Data Analytics, Data Modeling, Data Architecture, Data Analysis, Data Mining, Text Mining and Natural Language Processing (NLP), Artificial Intelligence algorithms, Business Intelligence, Analytics Models (like Decision Trees, Linear & Logistic Regression, Hadoop (Hive, PIG, MapReduce), R, Python, Spark, Scala, AWS (EC2, S3, Redshift), MS Excel, SQL and Postgre SQL, Erwin.
  • Experienced in utilizing analytical applications like R, SPSS, and Python to identify trends and relationships between different pieces of data draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Experienced in designing star schema (identification of facts, measures and dimensions), Snowflake schema for Data Warehouse, ODS Architecture by using tools like Erwin Data Modeler, Power Designer, E - R Studio and Microsoft Visio.
  • Expertise with the tools inHadoopEcosystem including Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Kafka, Yarn, Oozie, and Zookeeper.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Experienced in designing and building aDataLakeusingHadoopand its ecosystem components.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms inHadoopusing Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Expertise in applying data mining techniques and optimization techniques in B2B and B2C industries and proficient in Machine Learning, Data/Text Mining, Statistical Analysis and Predictive Modeling.
  • Experienced writing spark streaming and spark batch jobs using spark MLlib for analytics and Hands on experience on clustering algorithms like K-means & Medoids and Predictive algorithms.
  • Experienced in SQL Queries and optimizing the queries in Oracle, SQL Server, DB2, Netezza and Teradata.
  • Experienced Data Modeler with conceptual, Logical and Physical Data Modeling skills, Data Profiling skills, Maintaining Data Quality, Teradata 15/14, experienced with JAD sessions for requirements gathering, creating Data Mapping, documents, writing functional specifications and queries.
  • Expertise in Model Development, Data Mining, Predictive Modeling, Data Visualization, Data Clearing and Management, and Database Management.
  • Proficient in Hadoop, Hive, MapReduce, Pig and NOSQL databases like MongoDB, HBase, Cassandra.
  • Excellent experience in SQL Loader, SQL Data, SQL Data Modeling, Reporting, SQL Database Development to load data from the Legacy systems into Oracle Databases using control files and used Oracle External Tables feature to read the data from flat files into Oracle staging tables.
  • Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research, Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
  • Experienced in Data Modeling retaining concepts of RDBMS, Logical and Physical Data Modeling until 3NormalForm (3NF) and Multidimensional Data Modeling Schema (Star schema, Snow-Flake Modeling, Facts and dimensions).
  • Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2 and SQL Server databases.
  • Expertise in Excel Macros, Pivot Tables, vlookups and other advanced functions and expertise R user with knowledge of statistical programming languages SAS.
  • Experience in Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools such as Ab Initio and Informatica Power Center.
  • Experience in SQL and good knowledge in PL/SQL programming and developed Stored Procedures and Triggers and Data Stage, DB2, UNIX, Cognos, MDM, UNIX, Hadoop and Pig.
  • Expertise in data acquisition, storage, analysis, integration, predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, cluster analysis, and other advanced statistical techniques.
  • Very good knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and Identifying Data Mismatch.
  • Excellent experience on Teradata SQL queries, Teradata Indexes, Utilities such as MLOAD, TPump, Fast load and Fast Export.
  • Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
  • Experienced in Database performance tuning and Data Access optimization, writing complex SQL quires and PL/SQL blocks like stored procedures, Functions, Triggers, Cursors and ETL packages.

TECHNICAL SKILLS

Data Modeling Tools: Erwin r9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.

Programming Languages: Oracle PL/SQL, Python, Scala, SQL, T-SQL, UNIX shell scripting.

Scripting Languages: Python (NumPy, SciPy, Pandas, Gensim, Keras), R (Caret, Weka, ggplot)

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka, HDFS, Sqoop, Oozie, Spark and Scala.

Reporting Tools: Crystal reports, Business Intelligence, SSRS, Business Objects, Tableau.

ETL: Informatica Power Centre, SSIS.

Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD)

BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, Qlikview, SAP Business Intelligence, Amazon Redshift, or Azure

Data Warehouse Tools: MS-Office suite (Word, Excel, MS Project and Outlook), Spark MLlib, Scala NLP, MariaDB, Azure, SAS.

Databases: Oracle, Teradata, Netezza, Microsoft SQL Server, MongoDB, HBase, Cassandra.

Operating Systems: Windows, UNIX, MS DOS, Sun Solaris.

PROFESSIONAL EXPERIENCE

Confidential, DALLAS, TX

SR. DATA SCIENTIST/ARCHITECT

Responsibilities:

  • Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS and implemented a Python-based distributed random forest via Python streaming.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
  • Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
  • Involved in analyzing data coming from various sources and creating Meta-files and control files to ingest the data in to theDataLake and involved in configuring batch job to perform ingestion of the source files in to theDataLake.
  • Supporting data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into S3.
  • Worked on monitoring and troubleshooting the Kafka-Storm-HDFS data pipeline for real-time data ingestion inDatalakein HDFS.
  • Conducted studies, rapid plots and using advance data mining and statistical modeling techniques to build solution that optimize the quality and performance of data.
  • Developed multiple POCs using PySpark and deployed on the YARN cluster, compared the performance of Spark, with Hive and SQL and Involved in End-to-End implementation of ETL logic.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Involved in converting Hive /SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
  • Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
  • Createddatamodels for AWS Redshift and Hive from dimensionaldatamodels and worked onDatamodeling, Advanced SQL with Columnar Databases using AWS
  • Developed, deployed and managed several MongoDB clusters whilst implementing robustness and scalability via Sharing and replication, including automating tasks with own scripts and open-source tools for performance tuning and system monitoring.
  • Data Consolidation was implemented using SPARK, HIVE to generate data in the required formats by applying various ETL tasks for data repair, massaging data to identify source for audit purpose, data filtering and store back to HDFS.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.

ENVIRONMENT: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.

Confidential, CHICAGO, IL

SR. DATA SCIENTIST/ARCHITECT

Responsibilities:

  • Involved in the entiredatascience project life cycle and actively involved in all the phases includingdataextraction,datacleaning, statistical modeling anddatavisualization with largedatasets of structured and unstructureddata
  • Worked with data compliance teams, data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
  • Handled importingdatafrom variousdatasources, performed transformations using Hive, Map Reduce, and loadeddatainto HDFS.
  • Applied breadth of knowledge in programming (Python, R), Descriptive, Inferential, and Experimental Design statistics, advanced mathematics, and database functionality (SQL, Hadoop)
  • Worked with machine learning algorithm such as logistic regression, random forest, XGBoost, KNN, SVM, neural network, linear regression, lasso regression and k-means
  • Transformed Logical Data Model to Erwin, Physical Data Model ensuring the Primary Key and Foreign Key relationships in PDM, Consistency of definitions of Data Attributes and Primary Index Considerations.
  • Worked on real-time data processing using Spark/Storm and Kafka using Scala and worked on writing Scala programs using Spark on Yarn for analyzing data and worked on writing Scala programs using Spark/Spark-SQL in performing aggregations and developed web services in play framework using Scala in building stream data platform.
  • DevelopedDataScience content involvingDataManipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT and ETL forDataExtraction.
  • Converted INFORMATICA ETL logic, which is re written in SPARK, SCALA using SPARK Data Frames API for data transformations, ETL jobs, SPARK SQL for processing data as per BI aggregations, reporting needs.
  • Designed and developed architecture fordataservices ecosystem spanning Relational, NoSQL, and BigDatatechnologies.
  • Used Netezza SQL, Stored Procedures, and NZload utilities as part of the DWH appliance framework and worked with the UNIX team and installed TIDAL job scheduler on QA and Production Netezza environment.
  • Developed enhancements to MongoDB architecture to improve performance and scalability and worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
  • Utilized Ansible and AWS lambda, kinesis, elastic cache and cloudwatch logs to automate the creation of log aggregation pipeline with ElasticSearch, Logstash, Kibana stack (ELK stack) to send all of our team's logs coming into cloudwatch, to process them and send them off to ElasticSearch.
  • Developed scripts in Python (Pandas, Numpy) fordataingestion, analyzing anddatacleaning.
  • Created Hive queries that helped analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics and processed the data using HQL (like SQL) on top of Map-reduce.
  • Developed Python scripts to automate and provide Control flow to Pig scripts for extracting the data and load into HDFS.
  • Designed the ETL process to Extract translates and load data from OLTP Oracle database system to Teradata data warehouse and worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
  • Use different machine learning algorithms such as linear and logistic regression, ANOVA/ANCOVA, decision trees, support vector machines, KNN, random forest, Deep learning neural networks and XGBoost.
  • Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website and managed and reviewed Hadoop log files.
  • Used Erwin for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
  • Responsible for developing efficient MapReduce on AWS cloud programs like claim data to detect and separate fraudulent claims
  • Designed and developed user interfaces and customization of Reports using Tableau and OBIEE and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
  • Developed and implemented SSIS, SSRS and SSAS application solutions for various business units across the organization and created SSIS Packages using Pivot Transformation, Execute SQL Task, Data Flow Task, etc to import data into the data warehouse.

ENVIRONMENT: ERwin9.x, Python, Spark, Scala, Teradata, Oracle11g, Hadoop, HDFS, Pig, Hive, MapReduce, PL/SQL, UNIX, Informatica Power Center, MDM, SQL Server, Netezza, DB2, Tableau, Aginity, SAS/Graph, SAS/SQL, Tableau, SAS/Connect and SAS/Access, HBase, MongoDB, Kafka, Sqoop, AWS S3, EMR, EC2, Redshift.

We'd love your feedback!