We provide IT Staff Augmentation Services!

Big Data Engineer/data Scientist Resume

0/5 (Submit Your Rating)

Nashville, TN

SUMMARY

  • Around 9 years of experience working with data and 3+ years of experience in developing big data Machine Learning models to drive business planning. Expertise in the big data ecosystem to build efficient and scalable analytical solutions. Expertise in Spark ecosystem, Hadoop ecosystem, including HDFS, PIG, Hive, Sqoop, Oozie, HBase, Flume, Zookeeper, Cassandra and others.
  • Proficient in understanding data required for and exploring and cleaning data for modeling ANOVA, t - tests, RFM, Clustering, Classification, Decision Trees, Neural Networks, Regression Analysis, Random Forests, and others. Adept in predictive modeling data including variable selection, data imputation, collinearity diagnostics, factor analysis, and variable interaction analysis.
  • Experience in architecting end-to-end ETL flow on big data environment. Experience in building real time data pipeline using Apache Spark, Kafka, and Cassandra. Proficient in building real time data pipelines and visualizations using Kafka and Spark Streaming.
  • Experience building data lake in Microsoft Azure using data factory and other ingestion tools. Hands on experience in Sequence files, Avro, Parquet, and JSON Combiners, Counters, Dynamic Partitions, and Bucketing for best practice and performance improvement. Experience with data analysis, cleansing, validation, verification, conversion, migration and mining. Proficient in building data acquisition systems and modeling data warehouses using ETL tools like IBM Datastage, IBM Qualitystage, Microsoft SSIS, and Microsoft SSRS.
  • Experience in distributed processing using Spark on AWS, Microsoft Azure, and Google Cloud. Good knowledge on AWS infrastructure services like Amazon S3, EMR and EC2. Experience using Azure Machine Learning studio and Microsoft Data Science VMs.
  • Deep understanding of Statistical Modeling, Multivariate Analysis, and Standard Procedures. Familiar with model testing, problem analysis, model comparison, and solution validation.
  • Experience mining time series data and build models to predict stock market prices and develop models for risk analysis. Experience working on NLP models using NLTK and word embeddings for sentiment analysis and classification of several documents and agreements.

TECHNICAL SKILLS

Big Data Ecosystem

: Nifi, Kafka, Spark, Hive, PIG, Sqoop, Flume, Oozie, Hadoop and Zookeeper

Data Analysis Tools: SparkSQL, Numpy, Scipy, Scikit-learn, Pandas, R, and R studio

ETL Tools: IBM Information Server Suite (ver 11.3, 9.1, 8.7), Informatica Power Center, and Pentaho 5.1

Machine Learning Tools: Spark ML, SparkMLlib, H2O, TensorFlow, Keras, and Theano

Cloud Computing: Amazon AWS, Microsoft Azure

Statistical Methods: Clustering, Classification, Regression, Random Forests, Neural Networks, SVM, Deep Learning, Hypothesis Testing, Principal Component Analysis, and Dimensionality Reduction.

Visualizations: Matplotlib, D3js, ggplot, and PowerBI

Reports/Dashboards: Tableau and Cognos

NoSQL Databases: Cassandra, HBase and MongoDB

Programming Languages: C, C++, Java, C#, Scala, Python, R, and PLSQL

Relational Databases: Oracle, DB2, SQL Server and SQLite

Operating Systems: Unix, Linux, Windows and MacOS

PROFESSIONAL EXPERIENCE

Big Data Engineer/Data Scientist

Confidential, Nashville, TN

Responsibilities:

  • Built real time data pipelines using Apache Nifi, Kafka and Spark.
  • Used Spark for streaming real time data and Spark core for data quality transformations.
  • Utilized Spark ML/MLlib for Machine learning modeling.
  • Participated in all phases of data mining, data collection, data cleaning, model development, validation, and visualization.
  • Developed real time visualizations creating dashboards using D3.js and node.js.
  • Analyzed complex datasets designing Machine Learning/ Statistical Models and formulating metrics and dashboards to provide recommendations to management on advanced analytics projects.
  • Used sales training data in supervised classification algorithm to predict customer churn.
  • Expert in performance tuning Hive table using various techniques like (Clustering, Skewing, Partitioning, and SQL rewrite).
  • Experience working on several file formats in Hive like ORC, Parquet etc., boosting performance using Tez engine, and modifying configuration parameters.
  • Supported sales forecasting and planning team by improving time series & principal component analysis.
  • Utilized machine learning techniques for predictions and forecasting based on the sales training data.
  • Executed overall data aggregation/ alignment and process improvement reporting within the sales department.
  • Used Hive to perform transformations, joins, filtering and aggregations before loading the data onto HDFS.
  • Responsible for building scalable distributed data solutions using Spark.
  • Experience in developing job processing scripts using OOZIE workflow.
  • Configured, deployed and maintained multi-node Dev and Test Kafka clusters.
  • Managed data quality and integrity using skills in data warehousing, databases, and ETL.
  • Monitored and maintained high levels of data analytic quality, accuracy, and process consistency.
  • Assisted sales management in data modeling.
  • Ensured on-time execution and implementation of sales planning analysis and reporting objectives.
  • Worked with sales management team to refine predictive methods & sales planning analytical process.
  • Executed and monitored the accuracy and efficiency for sales forecasts & reporting.
  • Supported consistent implementation of company reporting and sales process initiatives.
  • Applied Machine Learning to identify fraud cases in the data and real time fraud prevention.
  • Analyzed and interpreted large amount of data using Hadoop, Hive and Sqoop, and articulated business insight from the analysis.
  • Built Source - Reference data matching (Fuzzy Matching) algorithms using TFIDF, SOLR/LUCENE, and Elastic Search.
  • Worked with multiple teams like Engineering, BI, Data Science, Legal and Business to ensure the delivery of the qualitative features in a timely manner.

Environment: Python, R, Kafka, Zookeeper, Spark, Spark Streaming, Spark SQL, Elastic Search, Lucene, Solr, Cognos, Tableau, Hive, Pig, Oozie Sqoop SQL/Server, Oracle10 &11g, MS-Office, Netezza, Teradata, ER Studio, TOAD, PL/SQL, and DB2.

Big Data Engineer/Data Scientist

Confidential, Chicago, IL

Responsibilities:

  • Responsible for predictive analysis of credit scoring to predict whether or not credit extended to a new or an existing applicant will likely result in profit or losses.
  • Consulted with project stakeholders & business leaders to identify data problems, risks and opportunities.
  • Reviewed and determined risk profiles of data based on metadata and underlying data elements.
  • Made determination of data and system Classification based on policy.
  • Implemented Clustering algorithm on customer database to improve personalized marketing and run Classification algorithm to determine whether a transaction is Fraud or not.
  • Developed predictive customer churn model for identifying customers with high potential of discontinuing from the bank product. This was accomplished using a Random Forest, but first looking at models from Neural Network and Logistic Regression.
  • Configured Spark streaming to get ongoing information from the Kafka and stored the stream information in HDFS.
  • Used various Spark Transformations and Actions for cleansing the input data.
  • Developed Shell Scripts to generate Hive Create statements from the data and Load data into the table.
  • Wrote Map Reduce jobs using Pig Latin.
  • Optimized Hive QL/ Pig scripts using execution engine like Spark.
  • Involved in developing a Linear Regression model to predict a continuous measurement for improving the observation on data developed using Spark.
  • Worked extensively on Spark and MLlib to develop a Regression model for logistic information.
  • Extracted real time feed using Kafka and Spark Streaming, Converted it to RDD, processed data in the form of Data Frame, and saved the data as Parquet format in HDFS.
  • Used Spark and Spark-SQL to read the Parquet data and create tables in Hive.
  • Used the Spark application master to monitor the Spark jobs, and captured the logs for the Spark jobs.
  • Enabled predictive analytics to complement Yield Management System and Recommendation system.
  • Participated in all phases of data mining, data collection, data cleaning, model development, validation, and visualization.
  • Adopted Principal Component Analysis. Developed an automated process for data imputation for handling missing values.
  • Computed Credit Risk parameters such as Probability of Default, Loss Given Default, and Exposure at Default.
  • Used Logistic Regression, Clustering, and Multivariate modeling to provide valuable analytical insights.
  • Used k-fold cross validation to avoid over fitting.
  • Used Kolmogorov-Smirnov test to measure the quality of the models.

Environment: Python, R, Kafka, Zookeeper, Spark, Spark Streaming, Spark Sql, Cognos, Tableau, Hive, Pig, Oozie Sqoop SQL/Server, Oracle10 &11g, MS-Office, Netezza, Teradata, ER Studio, TOAD, PL/SQL, andDB2.

Big Data Engineer/Data Analyst

Confidential, Atlanta, GA

Responsibilities:

  • Executed appropriate Statistical tests to evaluate raw data and interpret results into long-term and local economic impacts.
  • Captured customer-centric data and analyzed with BI solution including purchase types, personal information, customer feedback, customer sentiment, and frequency of customer spend.
  • Analyzed and reported on data sets such as sales, point of sale data, supply chain logistics, merchandizing, demographics, promotional marketing, and customer experience.
  • Utilized Excel V-Lookup and Data Validation to extract data across tables and create query functionality.
  • Installed Hadoop, Map Reduce, and HDFS and developed multiple MapReduce jobs in Pig and Hive for data cleaning and pre-processing.
  • Analyzed Hadoop clusters, using Analytical tools Hive, Pig and databases like HBase on AWS RDS.
  • Worked on analyzing/transforming the data with Hive and Pig.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Spark scripts, UDFs using both Data Frames/SQL and RDD/MapReduce in Spark 1.5 for data aggregation, queries, and writing data back into OLTP system directly or through Sqoop.
  • Developed Spark scripts and Spark-SQL/Streaming for faster testing and processing of data.
  • Imported the data from different sources like HDFS/Hbase /Cassandra NoSQL into Spark RDD using Spark Steaming
  • Developed a data pipeline using Kafka and Spark to store data into HDFS by using Spark Streaming
  • Performed real time analysis on the incoming data.
  • Performed transformations, cleaning and filtering on imported data using Hive and Map Reduce, and loaded final data into HDFS.
  • Loaded the data into Spark RDD and performed in memory data computation to generate the output response.
  • Involved in importing the real time data to Hadoop using Kafka, and implemented the Oozie job for daily imports.
  • Involved in loading data from LINUX file system to HDFS.

Environment: Cloudera CDH5, Spark, Spark Streaming, Spark Sql, Hive, Pig, Oozie, Sqoop SQL/Server, Oracle10 &11g, MS-Office, Netezza, Teradata, ER Studio, TOAD, PL/SQL, and DB2.

ETL Developer/Data Analyst

Confidential, Basking Ridge, NJ

Responsibilities:

  • Worked in Data Acquisition project, where we needed to extract data from different sources, process the data, generate the files, and transfer these files to target Systems.
  • Involved in defining system requirements, designing, prototyping, developing, testing, training, and implementation of the applications.
  • Defined, and documented the technical architecture of the Data Warehouse, including the physical components and their functionality.
  • Designed Star Schema with dimensional modeling of the data warehouse/ OLAP applications by applying required facts and dimensions.
  • Created mappings for the source systems to target systems, after analyzing the data.
  • Worked on data mapping, data cleansing, program development for loads, and data verification of converted data to legacy data.
  • Worked on several change requests, which were created because of production incidents and requirement changes to the code in production environment.
  • Responsible for using different types of stages such as ODBC Connector, Oracle Connector, DB2 Connector, Teradata Connector, Transformer, Join, and Sequential File to develop different jobs.
  • Developing Datastage Parallel and Sequence Jobs.
  • Developed common jobs, shared containers, and server routines, which are used across the project in most of the interfaces.
  • Created Unix Shell Scripts and Autosys jobs that take care of end-to-end automation. Developed UNIX shell Scripts that trigger Datastage jobs, transfer the output files, and performed basic validations on file.
  • Extensively used SQL tuning techniques to improve the performance of Datastage jobs.
  • Tuned Datastage transformations and jobs to enhance their performance.

Environment: SQL/Server, Oracle10 &11g, MS-Office, Netezza, Teradata, ER Studio, TOAD, PL/SQL, IBM Data stage 9.1 (Director, Designer, Administrator), IBM DB2, and Mainframes.

ETL Developer/Data Analyst

Confidential

Responsibilities:

  • Designed ETL jobs incorporating complex transform methodologies using Datastage tool resulting in development of efficient interfaces between source and target systems.
  • Performed data profiling in the source systems that are required for data marts.
  • Manipulated, cleansed and processed data using Excel, Access and SQL.
  • Responsible for loading, extracting and validation of client data.
  • Analyzed raw data, drawing conclusions and developing recommendations.
  • Wrote SQL scripts to manipulate data for data loads and extracts.
  • Developed data analytical databases from complex data sources. Performed daily system checks, data entry, data auditing, creation of data reports, and monitoring of all data for accuracy.
  • Monitored the automated loading processes. Involved in defining the source to target data mappings, business rules, and business and data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Developed ETL jobs to load data from Oracle, DB2 databases, XML, flat files, and CSV files to target in high volume databases on Mainframes.
  • Worked with stages like Complex Flat File, Transformer, Aggregator, Sort, Join, Lookup, and Data Masking pack.
  • Participated in requirements gathering and created source to target mappings for development.
  • Extensively designed, developed and implemented Parallel Extender jobs using Parallel Processing (pipeline and partition) techniques to improve job performance while working with bulk data sources.
  • Created and used Data Stage Shared Containers and Local Containers for DS jobs.
  • Extensively Worked on Job Sequences to Control the execution of the job flow using various Triggers (Conditional and Unconditional) and Activities like Job Activity, Email Notification, Sequencer, Routine Activity, and Exec Command Activity.
  • Tuned the jobs for optimum performance.
  • Used Datastage Director to validate, run and monitor the Datastage jobs.
  • Experience in generating and interpreting mapping documentation, and translating into detailed design specifications using ETL code.

Environment: SQL/Server, Oracle10 &11g, MS-Office, Netezza, Teradata, ER Studio, TOAD, PL/SQL. IBM Data stage 8.7 (Director, Designer, Administrator), IBM DB2, and Mainframes

Software Engineer

Confidential

Responsibilities:

  • Designed and implemented ETL processes that scan and read data from more than 20 tables from relational database to staging table doing the required transformations.
  • Wrote stored procedures/functions/packages in various schemas as per business requirements and was involved with the tuning, performance and optimization of queries and standardization of the code.
  • Designed and developed several views like Materialized View for data retrieval for the reports.
  • Implemented nightly load processes to read changed benefit and geographic data sent from mainframe as flat files which includes transformations, scanning tables for missing fields, error reporting etc.
  • Designed, built, and maintained data abstraction/virtualization layer to facilitate data blending across disparate data sources using Cisco Composite System.
  • Worked on procedures that load data into Star Schema from staging tables.
  • Involved in performance fine-tuning of the queries/reports using Explain Plan, Tkprof, PL/SQL, and SQL Plus.
  • Wrote scripts for creating tables, Indexes, Grants and Synonyms in different schemas, and made modifications for the existing tables as per business logic.
  • Implemented utilities that use DBMS scheduling to scan relational database tables for updates at specified intervals.
  • Used database trigger for making history of insertion, updating, deletion and all kinds of Audit.
  • Extensively used Oracle packages, stored procedures, sequences, ref-cursors, dynamic queries, utl-files, bulk inserts, exception handling, DB links etc.
  • Worked on porting of existing reports based on relational database to Star Schema, which involves defining queries involving fact and dimension tables.
  • Gathered requirements, built data models, and provided data sources for BI Reporting teams.
  • Support ETL team in critical data movements, and created source to target mapping documents.

Environment: TOAD, PLSQL Developer, Oracle 10g, 9i, Windows XP, PL/SQL, and Shell Scripts.

We'd love your feedback!