Big Data Engineer Resume
Purchase, NY
SUMMARY
- Above 8+ years of experience in Big Data analytics, Machine Learning, Data mining with large datasets of Structured and Unstructured data, , Data Validation, Predictive modelling, Data Visualization.
- Good knowledge of Data Governance/Data classification and reporting tools.
- Advanced experience with Python and its libraries such as NumPy, Pandas, Scikit - learn, Keras, Matplotlib, TensorFlow, Scipy.
- Experience in Natural Language Processing (NLP), Forecasting using RNNs and LSTM, developing different Statistical Machine Learning solutions to various business problems.
- Predictive Modelling Algorithms: Logistic Regression, Linear Regression, Decision Trees, Bootstrap Aggregation (Bagging), Naive Bayes Classifier, Random Forests, Support Vector Machines.
- Good Knowledge in NoSQL databases like Apache Cassandra and Hadoop ecosystem, Hive, PySpark, Sqoop, Airflow, Kafka.
- Excellent Knowledge of RDBMS, Data Warehouse/OLAP concepts, and methodologies.
- Excellent performance in Model Validation and Model tuning with the Model selection, K-ford cross-validation, Hold-Out Scheme and Hyper parameter tuning by Grid search.
- Experience in designing Data marts, Star Schema, Snowflake Schema for Data Warehouse concepts like ODS, MDM architecture.
- Developed predictive data models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Social Network Analysis, Cluster Analysis, Neural Networks and Reinforcement Learning.
- Proficient in writing complex SQL queries, stored procedures, normalization, database design, creating indexes, functions, triggers, and sub-queries.
- Experience in troubleshooting test scripts, SQL queries, ETL jobs, data warehouse/data mart/data store models.
TECHNICAL SKILLS
Python Libraries: NumPy, SciPy, Pandas, Matplotlib, Plotly, Scikit-Learn, Keras. TensorFlow
Big Data Tools: Hadoop ecosystem, Apache Spark, Oozie, Hive, HDFS, Spooq
Statistics: Hypothetical Testing, ANOVA, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Cross-Validation, Correlation.
Cloud Computing Tools: Azure Databricks
Databases: Microsoft SQL Server 2008, MySQL, Oracle, DB2
NoSQL Databases: Apache Cassandra
Database Tools: SQL Server Data Tools, SQL Server Management Studio, Query Analyzer, Enterprise Manager.
Data Modelling Tools: Erwin, Rational Rose, ER/Studio, MS Visio, Oracle Designer.
PROFESSIONAL EXPERIENCE
Big Data Engineer
Confidential, Purchase, NY
Responsibilities:
- Developed solution for optimizing the Data Validation process for Mastercard Intelligence team by designing software application in Python and creating process workflows in Master Data Management.
- Architected and Implemented cost-effective Scalable Big Data solution using Apache Spark, Hive, Zookeeper and Sqoop and migrate existing Oracle based Data warehouse to Spark based solution.
- Setup in Spark in multi-node cluster and able to configure them using configuration manager.
- Implemented Spark scripts in Scala for batch processing to handle about 850 million transaction records per day from various data sources and persisted data in Delta Lake and HDFS in parquet columnar format.
- Schedule jobs using Apache Oozie and Apache Hue.
- Developed solution in building pipelines to transfer structured, unstructured data into HDFS Data Lake.
- Create complex data models based on data from Mastercard transactional data in Analyticserver workbench to help clients to derive insights.
- Generate reports with insights on portfolio performance across spend, fraud and more compared to custom benchmarks.
Senior BI Developer
Confidential, Boston, MA
Responsibilities:
- Developed Neural Network based predictive model using Azure Machine Learning Studio and risk analytics utilizing large amounts of structured and unstructured data such as industry sentiment, Stock movements and correlations in economic factors.
- Reduced time required for stock screening by 75%.
- Used Pandas, Sklearn, and NLTK for Natural Language processing news articles in Python and Time series Econometrics models.
- Implemented Spark, Hive Batch jobs for batch processing to handle massive volume of data from various data sources - Bloomberg, Government publications, unstructured news articles, etc. and data persisted in HDFS.
- Configured a CI/CD pipeline using Docker Swarm.
- Developed Tableau dashboard for analysing business cycle economic indicators for major economies to identify macro-economic investment opportunities and reduce systemic risk.
- Develop Pyspark, HIVE scripts in Databricks to filter/map/aggregate data.
- Performed Natural language processing (NLP) using NLTK to identify the trends in overall market to find stocks affected by poor sentiment or to exit companies that are becoming inflated due to sentiment with 80% accuracy.
- Stored news articles in Mongo NoSQL DB.
Senior Data Analyst
Confidential, Seattle, WA
Responsibilities:
- Worked with business users, business analysts, program managers, project managers, system analysts for reviewing business requirements for Salesforce Marketing Cloud Team
- Collaborate with manager and Insurance agents to create and execute marketing strategy using Salesforce Marketing cloud with the focus on acquiring and retaining customers.
- Developed multitude of Machine Learning algorithms such as customer segmentation, Support Vector Machine, XGBoost to improve Insurance agent’s engagement with customers and developed marketing mix model to effective reach prospective customers and improved ROI by 20%.
- Trained Data with Different Classification Models such as Random forest, Support Vector Machine models to classify the Marketing messages using Natural Language processing.
- Data cleaning, pre-processing, imputation, transformation, scaling, feature engineering, data aggregation, merge data frames, descriptive statistics, data visualization, score assessment mapping, reporting on Tableau dashboards
Data Analyst/Data Engineer
Confidential
Responsibilities:
- In the pre-processing phase, used Python Pandas, NumPy libraries to clean all the missing data, datatype casting and merging or grouping tables for the EDA process.
- Helped the analytics team in selecting appropriate combination of features in generating customer profiles and segmenting customers using Python tool based on their engagement level, geographic region, age, socioeconomic and other dimensions and finding out customers with higher chances of defaulting credit cards.
- Created predictive analytics models for finding optimal strategy for credit card customer rewards and loyalty program for each customer segment and focus on increasing the financial transactions.
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib to get some insights about the Customer Transactions and Monthly payments.
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
- Led the implementation of a python automation process for identifying and verifying anomalies in set-top box remote log files; eliminated approximately 150 hours of manual labor quarterly.
- Conceptualized and led the development of a framework to analyze large data sets to provide Business intelligence (BI) to IT leadership enabling better decisions on issues such as Hardware asset management.
- Created and populated analytic databases and ETL work with data pipelines, tuning databases and queries for fast analysis and manipulate unstructured raw data into standardized formats.
- Produced reports for analysis using system reporting tool (SSRS) and created specifications for reports based on business needs and identified critical features in the data.
- Also responsible for database migration from SQL 2000 to SQL 2008, SQL development and maintenance, SSIS package development, and report automation.
- Developed scripts in python for raw log file processing to monitor the instantaneous server traffic data and create a Django based web platform to monitor the health of server, event tables and visual process flows.