Data Scientist/data Engineer Resume
Indianapolis, IN
SUMMARY:
- 8+years of experience in Analysis, Design, Development and Implementation as a Data Engineer.
- Expert in providing ETL solutions for any type of business model.
- Provided and constructed solutions for complex data issues.
- Experience in development and design of various scalable systems using Hadoop technologies in various environments. Extensive experience in analyzing data using Hadoop Ecosystems including HDFS, MapReduce, Hive & PIG.
- Experience in understanding the security requirements for Hadoop.
- Extensive experience in working with Informatica PowerCenter
- Implemented Integration solutions for cloud platforms with Informatica Cloud.
- Worked with Java based ETL tool, Talend.
- Proficient in SQL, PL/SQL and Python coding.
- Experience developing On - premise and Real Time processes.
- Excellent understanding of best practices of Enterprise Data Warehouse and involved in Full life cycle development of Data Warehousing.
- Expertise in DBMS concepts.
- Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store (ODS) applications.
- Skilled in designing and implementing ETL Architecture for cost effective and efficient environment.
- Optimized and tuned ETL processes & SQL Queries for better performance.
- Performed complex data analysis and provided critical reports to support various departments.
- Work with Business Intelligence tools like Business Objects and Data Visualization tools like Tableau.
- Extensive Shell/Python scripting experience for Scheduling and Process Automation.
- Good exposure to Development, Testing, Implementation, Documentation and Production support.
- Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively manage client expectations.
- Solid knowledge and experience in Deep Learning techniques including Feedforward Neural Network, Convolutional Neural Network (CNN), Recursive Neural Network (RNN)
- Hypothesis Testing, T-Test, Z Test, Gradient descent, Newton’s Method, ANOVA test, Chi-square test. Libraries: Numpy , Pandas , Matplotlib , Scikit-learn , NLTK , plotly , Seaborn , Scikit-Image , Open CV Tools
- Actively contributed in all phases of the project life cycle including Data Acquisition (Web Scraping), Data Cleaning, Data Engineering (Dimensionality Reduction (PCA & LDA), normalization, weight of evidence, information value), Feature Selection, Features Scaling & Features Engineering, Statistical Modeling (decision trees, regression models, neural networks, SVM, clustering), Testing and Validation (ROC plot, k-fold cross validation) and Data Visualization.
- Implemented Baye’s Net, Viterbi algorithm, Image processing using Gaussian noise
- Worked with various text analytics or Word Embedding libraries like Word2Vec, Count Vectorizer, GloVe, LDA etc.
- Skilled in Advanced Regression Modeling, Time Series Analysis, Statistical Testing, Correlation, Multivariate Analysis, Forecasting, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Worked on several python packages like NumPy, Pandas, Matplotlib, SciPy, Seaborn and Scikit-learn.
- Experience in using cloud services AWS, Azure, GCP including EC2, S3, AWS Lambda and EMR.
- Experience working with statistical and regression analysis, multi-objective optimization.
- Good knowledge on Performance metrics to evaluate Algorithm's performance.
- Worked with clients to identify analytical needs and documented them for further use.
- Worked with outlier analysis with various methods like Z-Score value analysis, Liner regression, Dbscan (Density Based Spatial Clustering of Applications with Noise) and Isolation forest.
- Worked on Gradient Boosting decision trees with XGBoost to improve performance and accuracy in solving problems. Also worked with several boosting methodologies like ADA Boost, Gradient Boosting and XGBoost.
- Worked and extracted data from various database sources like Oracle, SQL Server, DB2, MongoDB and Teradata.
SKILL:
Languages: R, SQL, Python, Shell scripting, Java, Scala, C++.
IDE: R Studio, Jupyter Notebook, PyCharm, Atom.
Databases: Oracle 11g, SQL Server, MS Access, MySQL, MongoDBCassandra PL/SQL, ETL.
Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Impala, kafka, Spark MLLib. PySpark, Sqoop.
Systems: Windows XP/7/8/10, Ubuntu, Unix, Linux
Packages: ggplot2, caret, dplyr, RWeka, gmodels, RCurl, tm, C50Wordcloud, Kernlab, Neuralnet, twitter, NLP, Reshape2, rjsonplyr, pandas, NumPy, seaborn, SciPy, matplotlib, scikit-learnBeautiful Soup, Rpy2, Tensorflow, Pytorch, CNN, RNN, XGBoost
Technologies: HTML, CSS, PHP, JavaScript
Tools: R console, Python (NumPy, pandas, SciKit-learn, SciPy), SPSS.
Visualization: Tableau, SSAS, SSRS, QuickView, Business Objects, Power BI, and Cognos.
Data Warehousing: Informatica Power Center 9.x/8.x/7.x, Informatica Cloud, Talend Open studio
Version Controls: GIT, SVN
Cloud: Google Cloud, Azure, AWS
WORK EXPERIENCE:
Confidential, Indianapolis, IN
Data Scientist/Data Engineer
Responsibilities:
- Analyze and cleanse raw data using HiveQL
- Experience in data transformations using Map-Reduce, HIVE for different file formats.
- Involved in converting Hive/SQL queries into transformations using Python
- Performed complex joins on tables in hive with various optimization techniques
- Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency
- Worked extensively with HIVE DDLS and Hive Query language (HQLs)
- Involved in loading data from edge node to HDFS using shell scripting.
- Understand and manage Hadoop Log Files.
- Manage Hadoop infrastructure with Cloudera Manager.
- Created and maintained technical documentation for launching Hadoop cluster and for executing Hive queries.
- Build Integration between applications primarily Salesforce.
- Extensive work in Informatica Cloud.
- Expertise in Informatica cloud apps Data Synchronization (ds), Data Replication (dr), Task Flows, Mapping configurations, Real Time apps like process designer and process developer.
- Work extensively with flat files. Loading them into on-premise applications and retrieve data from applications to files.
- Work with WSDL, SOAP UI for APIs
- Write SOQL queries, create test data in salesforce for informatica cloud mappings unit testing.
- Prepare TDDs, Test Case documents after each process has been developed.
- Identify and validate data between source and target applications.
- Verify data consistency between systems.
- Responsible for supervising the Data cleansing, Validation, data classifications and data modelling activities.
- To develop algorithms in python like K - Means, Random Forest linear regression, XG Boost and SVM.as part of data analysis.
- Built streaming pipeline with confluent AWS with python to support CI/CD
Environment : Python, Bigdata ECO systems, Hadoop, HDFS, Hive, PIG, Cloudera, MapReduce, Python, Informatica Cloud Services, Salesforce, Unix scripts, Flat Files, XML files, and AWS.
Confidential, Austin, TX
Data Scientist/ Data Engineer
Responsibilities:
- Designed a data workflow model to create a data lake in Hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
- Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
- Developed Py Spark and Spark SQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
- Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Loading log data directly into HDFS using Flume.
- Leveraged AWS S3 as storage layer for HDFS.
- Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
- Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
- Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
- Used Confluence to store the design documents and the STMs
- Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
- Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology
- Involved in creating various regression and classification algorithms by using various sklearn libraries such as Linear Regression, Decision Trees, and Random Forest.
- Involved in creating Machine Learning models for hyper tuning test content which is useful for making better decisions regarding the products.
Environment : SPARK, Hive, Pig, Flume Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL, Spyder IDE, Tableau.
Confidential
Python Developer/Data Analyst
Responsibilities:
- The work will involve the development of workflows triggered by events from other systems.
- Develop easy to use documentation for the frameworks and tools developed for adaption by other teams.
- Developed Hive UDFs and Pig UDFs using Python in Microsoft HDInsight environment.
- Implemented end-to-end systems for Data Analytics, Data Automation and customized visualization tools using Python, R, Hadoop and MongoDB.
- Used Pandas, NumPy, seaborn, SciPy, matplotlib, SciKit-learn, Keras, Tensorflow, Open CV, PyTorch in Python for developing various machine learning algorithms.
- Performed data profiling to merge the data from multiple data sources.
- Worked on csv, json, excel different types of files for the data cleaning and data analysis.
- Used Python for statistical operations on the data and ggplot2 for the visualizing the data.
- Worked with several use cases like campaign sales analysis, forecasting sales, KPI analysis.
- Managed offshore projects and coordinated work for 24-hour productivity cycle
- Designed and developed a horizontally scalable APIs using Python Flask.
- Experience in developing entire frontend and backend modules using Python on Django and Flask Web Frameworks.
- Worked on development of SQL and stored procedures on MYSQL, SQLAlchemy.
Environment: Python, JavaScript, Django Framework 1.3, Flask, HTML, CSS, SQL, MySQL, LAMP, JQuery, Apache web server, SQLAlchemy.
Confidential
ETL/Informatica Developer
Responsibilities:
- Analyze requirements from Business users
- Perform data analysis for any requirement and provide source to target mapping rule document
- Data validation/profiling by writing complex SQL queries by joining several tables.
- Identifying the source to target mapping attributes under different source systems.
- Designed data models to support user's business requirements.
- Designed and developed complex aggregate, joiner, look up transformation rules (business rules) to generate consolidated (fact/summary) data identified by dimensions using Informatica ETL Power Center.
- Used the Slowly Changing Dimensions wizard (type 2) to update the data in the target dimension tables.
- Created sessions, database connections and batches using Informatica Server Manager/Workflow Manager.
- Optimized mappings, sessions/tasks, source, and target databases as part of the performance tuning.
- Configured the server and email variables using Informatica Server Manager/Workflow Manager.
- Used all types of caches like dynamic, static and persistent caches while creating sessions/tasks.
- Used Metadata Reporter to run reports against the repository.
- Designed the physical structures necessary to support the logical database design.
- Designed processes to extract, transform, and load data to the Data Mart.
- Involved in Informatica mappings development using Power Center designer and server manager/Workflow Manager to create the sessions and did lot of testing and data cleansing.
Environment: Informatica Power Center 8X (Repository Manger, Designer, Workflow Monitor, Workflow Manager), SQL server, Netezza 4.2, SQL, PL/SQL, UNIX.