We provide IT Staff Augmentation Services!

Data Scientist/data Engineer Resume

Indianapolis, IN


  • 8+years of experience in Analysis, Design, Development and Implementation as a Data Engineer.
  • Expert in providing ETL solutions for any type of business model.
  • Provided and constructed solutions for complex data issues.
  • Experience in development and design of various scalable systems using Hadoop technologies in various environments. Extensive experience in analyzing data using Hadoop Ecosystems including HDFS, MapReduce, Hive & PIG.
  • Experience in understanding the security requirements for Hadoop.
  • Extensive experience in working with Informatica PowerCenter
  • Implemented Integration solutions for cloud platforms with Informatica Cloud.
  • Worked with Java based ETL tool, Talend.
  • Proficient in SQL, PL/SQL and Python coding.
  • Experience developing On - premise and Real Time processes.
  • Excellent understanding of best practices of Enterprise Data Warehouse and involved in Full life cycle development of Data Warehousing.
  • Expertise in DBMS concepts.
  • Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store (ODS) applications.
  • Skilled in designing and implementing ETL Architecture for cost effective and efficient environment.
  • Optimized and tuned ETL processes & SQL Queries for better performance.
  • Performed complex data analysis and provided critical reports to support various departments.
  • Work with Business Intelligence tools like Business Objects and Data Visualization tools like Tableau.
  • Extensive Shell/Python scripting experience for Scheduling and Process Automation.
  • Good exposure to Development, Testing, Implementation, Documentation and Production support.
  • Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively manage client expectations.
  • Solid knowledge and experience in Deep Learning techniques including Feedforward Neural Network, Convolutional Neural Network (CNN), Recursive Neural Network (RNN)
  • Hypothesis Testing, T-Test, Z Test, Gradient descent, Newton’s Method, ANOVA test, Chi-square test. Libraries: Numpy , Pandas , Matplotlib , Scikit-learn , NLTK , plotly , Seaborn , Scikit-Image , Open CV Tools
  • Actively contributed in all phases of the project life cycle including Data Acquisition (Web Scraping), Data Cleaning, Data Engineering (Dimensionality Reduction (PCA & LDA), normalization, weight of evidence, information value), Feature Selection, Features Scaling & Features Engineering, Statistical Modeling (decision trees, regression models, neural networks, SVM, clustering), Testing and Validation (ROC plot, k-fold cross validation) and Data Visualization.
  • Implemented Baye’s Net, Viterbi algorithm, Image processing using Gaussian noise
  • Worked with various text analytics or Word Embedding libraries like Word2Vec, Count Vectorizer, GloVe, LDA etc.
  • Skilled in Advanced Regression Modeling, Time Series Analysis, Statistical Testing, Correlation, Multivariate Analysis, Forecasting, Model Building, Business Intelligence tools and application of Statistical Concepts.
  • Worked on several python packages like NumPy, Pandas, Matplotlib, SciPy, Seaborn and Scikit-learn.
  • Experience in using cloud services AWS, Azure, GCP including EC2, S3, AWS Lambda and EMR.
  • Experience working with statistical and regression analysis, multi-objective optimization.
  • Good knowledge on Performance metrics to evaluate Algorithm's performance.
  • Worked with clients to identify analytical needs and documented them for further use.
  • Worked with outlier analysis with various methods like Z-Score value analysis, Liner regression, Dbscan (Density Based Spatial Clustering of Applications with Noise) and Isolation forest.
  • Worked on Gradient Boosting decision trees with XGBoost to improve performance and accuracy in solving problems. Also worked with several boosting methodologies like ADA Boost, Gradient Boosting and XGBoost.
  • Worked and extracted data from various database sources like Oracle, SQL Server, DB2, MongoDB and Teradata.


Languages: R, SQL, Python, Shell scripting, Java, Scala, C++.

IDE: R Studio, Jupyter Notebook, PyCharm, Atom.

Databases: Oracle 11g, SQL Server, MS Access, MySQL, MongoDBCassandra PL/SQL, ETL.

Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Impala, kafka, Spark MLLib. PySpark, Sqoop.

Systems: Windows XP/7/8/10, Ubuntu, Unix, Linux

Packages: ggplot2, caret, dplyr, RWeka, gmodels, RCurl, tm, C50Wordcloud, Kernlab, Neuralnet, twitter, NLP, Reshape2, rjsonplyr, pandas, NumPy, seaborn, SciPy, matplotlib, scikit-learnBeautiful Soup, Rpy2, Tensorflow, Pytorch, CNN, RNN, XGBoost

Technologies: HTML, CSS, PHP, JavaScript

Tools: R console, Python (NumPy, pandas, SciKit-learn, SciPy), SPSS.

Visualization: Tableau, SSAS, SSRS, QuickView, Business Objects, Power BI, and Cognos.

Data Warehousing: Informatica Power Center 9.x/8.x/7.x, Informatica Cloud, Talend Open studio

Version Controls: GIT, SVN

Cloud: Google Cloud, Azure, AWS


Confidential, Indianapolis, IN

Data Scientist/Data Engineer


  • Analyze and cleanse raw data using HiveQL
  • Experience in data transformations using Map-Reduce, HIVE for different file formats.
  • Involved in converting Hive/SQL queries into transformations using Python
  • Performed complex joins on tables in hive with various optimization techniques
  • Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency
  • Worked extensively with HIVE DDLS and Hive Query language (HQLs)
  • Involved in loading data from edge node to HDFS using shell scripting.
  • Understand and manage Hadoop Log Files.
  • Manage Hadoop infrastructure with Cloudera Manager.
  • Created and maintained technical documentation for launching Hadoop cluster and for executing Hive queries.
  • Build Integration between applications primarily Salesforce.
  • Extensive work in Informatica Cloud.
  • Expertise in Informatica cloud apps Data Synchronization (ds), Data Replication (dr), Task Flows, Mapping configurations, Real Time apps like process designer and process developer.
  • Work extensively with flat files. Loading them into on-premise applications and retrieve data from applications to files.
  • Work with WSDL, SOAP UI for APIs
  • Write SOQL queries, create test data in salesforce for informatica cloud mappings unit testing.
  • Prepare TDDs, Test Case documents after each process has been developed.
  • Identify and validate data between source and target applications.
  • Verify data consistency between systems.
  • Responsible for supervising the Data cleansing, Validation, data classifications and data modelling activities.
  • To develop algorithms in python like K - Means, Random Forest linear regression, XG Boost and SVM.as part of data analysis.
  • Built streaming pipeline with confluent AWS with python to support CI/CD

Environment : Python, Bigdata ECO systems, Hadoop, HDFS, Hive, PIG, Cloudera, MapReduce, Python, Informatica Cloud Services, Salesforce, Unix scripts, Flat Files, XML files, and AWS.

Confidential, Austin, TX

Data Scientist/ Data Engineer


  • Designed a data workflow model to create a data lake in Hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
  • Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
  • Developed Py Spark and Spark SQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
  • Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Loading log data directly into HDFS using Flume.
  • Leveraged AWS S3 as storage layer for HDFS.
  • Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
  • Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
  • Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
  • Used Confluence to store the design documents and the STMs
  • Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
  • Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology
  • Involved in creating various regression and classification algorithms by using various sklearn libraries such as Linear Regression, Decision Trees, and Random Forest.
  • Involved in creating Machine Learning models for hyper tuning test content which is useful for making better decisions regarding the products.

Environment : SPARK, Hive, Pig, Flume Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL, Spyder IDE, Tableau.


Python Developer/Data Analyst


  • The work will involve the development of workflows triggered by events from other systems.
  • Develop easy to use documentation for the frameworks and tools developed for adaption by other teams.
  • Developed Hive UDFs and Pig UDFs using Python in Microsoft HDInsight environment.
  • Implemented end-to-end systems for Data Analytics, Data Automation and customized visualization tools using Python, R, Hadoop and MongoDB.
  • Used Pandas, NumPy, seaborn, SciPy, matplotlib, SciKit-learn, Keras, Tensorflow, Open CV, PyTorch in Python for developing various machine learning algorithms.
  • Performed data profiling to merge the data from multiple data sources.
  • Worked on csv, json, excel different types of files for the data cleaning and data analysis.
  • Used Python for statistical operations on the data and ggplot2 for the visualizing the data.
  • Worked with several use cases like campaign sales analysis, forecasting sales, KPI analysis.
  • Managed offshore projects and coordinated work for 24-hour productivity cycle
  • Designed and developed a horizontally scalable APIs using Python Flask.
  • Experience in developing entire frontend and backend modules using Python on Django and Flask Web Frameworks.
  • Worked on development of SQL and stored procedures on MYSQL, SQLAlchemy.

Environment: Python, JavaScript, Django Framework 1.3, Flask, HTML, CSS, SQL, MySQL, LAMP, JQuery, Apache web server, SQLAlchemy.


ETL/Informatica Developer


  • Analyze requirements from Business users
  • Perform data analysis for any requirement and provide source to target mapping rule document
  • Data validation/profiling by writing complex SQL queries by joining several tables.
  • Identifying the source to target mapping attributes under different source systems.
  • Designed data models to support user's business requirements.
  • Designed and developed complex aggregate, joiner, look up transformation rules (business rules) to generate consolidated (fact/summary) data identified by dimensions using Informatica ETL Power Center.
  • Used the Slowly Changing Dimensions wizard (type 2) to update the data in the target dimension tables.
  • Created sessions, database connections and batches using Informatica Server Manager/Workflow Manager.
  • Optimized mappings, sessions/tasks, source, and target databases as part of the performance tuning.
  • Configured the server and email variables using Informatica Server Manager/Workflow Manager.
  • Used all types of caches like dynamic, static and persistent caches while creating sessions/tasks.
  • Used Metadata Reporter to run reports against the repository.
  • Designed the physical structures necessary to support the logical database design.
  • Designed processes to extract, transform, and load data to the Data Mart.
  • Involved in Informatica mappings development using Power Center designer and server manager/Workflow Manager to create the sessions and did lot of testing and data cleansing.

Environment: Informatica Power Center 8X (Repository Manger, Designer, Workflow Monitor, Workflow Manager), SQL server, Netezza 4.2, SQL, PL/SQL, UNIX.

Hire Now