Big Data Engineer Resume Needham, MA - Hire IT People

SUMMARY

Experience in delivering end - to-end advanced analytics projects using Statistical and Machine Learning techniques.
Experience processing massive amounts of structured and unstructured data using Spark/SQL/Hive.
Proficient in implementing various statistical models and data mining algorithms (A/B testing, time-series, clustering, logistic regression, decision trees, and neural networks) using NumPy, pandas, scikit-learn, Keras, TensorFlow, and Spark MLlib.
Proficient in various data visualization tools and libraries such as Salesforce Einstein Analytics, Tableau, Matplotlib, Seaborn, Plotly, D3.js, to create visually powerful and actionable interactive reports and dashboards.
Excellent knowledge and experience with Hadoop Architecture and other components of its ecosystems like HDFS, YARN, Map Reduce, Hive, Sqoop, Oozie, Airflow and Spark.
Experienced in performing in memory data processing for batch, real time, and advanced analytic using Apache Spark (Spark SQL & Spark-Shell).
Addressing complex POCs according to business requirements from the technical end.
Proficient in database development in RDBMs: Oracle, PostgreSQL, MySQL, and MS-SQL.
Excellent understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
Excellent knowledge in creating Databases, Tables, Stored Procedure, DDL/DML Triggers, Views, User defined data types, effective functions, Cursors and Indexes.
Proficiency in Spark using Scala for loading data from the local file systems like HDFS, Amazon S3, Relational and NoSQL databases using Spark SQL and Importing data into RDDs.
Configuring Oozie workflow to run multiple Hive and Spark jobs which run independently with time and data availability.
Comprehensive knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR, and other cloud technologies such as Microsoft Azure.
Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
Solid understanding of EXALEAD CloudView components - CloudView Connectivity, CloudView Semantic Factory, CloudView Index and CloudView Mashup Builder.
Knowledge of micro-services and hand-on experience in getting Docker containers and Kubernetes up and running.

TECHNICAL SKILLS

Big Data Ecosystems: Hadoop 3.0, Spark 2.3, HBase 1.2, Hive 2.3, Sqoop 1.4, Oozie 4.3

Cloud Management: AWS/EMR/EC2/S3

OLAP Tools: Tableau 2019, Alteryx 2019, Salesforce Einstein Discovery, SSAS, Power BI

Programming: Python 3, PySpark 2.4, Unix shell scripting (Familiarity Java 9, Scala)

Databases: Oracle 12c/11g, MS SQL Server 2016/2014, Postgres 11.5

Tools: PyCharm, Jupyter Notebook, Eclipse, Apache Zeppelin, Git, Salesforce Einstein Discovery, EXALEAD CLOUDVIEW

Analytics: NumPy, Pandas, Scikit-learn, Keras, Matplotlib, Seaborn, Plot.ly, TensorFlow, Spark MLlib, mxnet

Portability, Statistics, and Machine Learning: Random Variables, Dummy Variables, Probability Distributions - Bernoulli, Binomial, Poisson, Geometric Distributions, and Gaussian Distributions, Expectation and variance, Covariance and correlation, Linear Regression, Logistic Regression, ARIMA, Decision Trees, Random Forest, k-Nearest Neighbors, Support Vector Machine, K-means, Neural Networks, Deep Learning, Markov Chain, Feature Engineering, Dimensionality Reduction, PCA, Model Evaluation, Model Selection

Web: HTML CSS, JQuery, JavaScript

PROFESSIONAL EXPERIENCE

Confidential

Advanced Analytics - Data Science

Responsibilities:

Collaborated with product lead, product owners and other stakeholders to gather business/functional requirements, framed clear problems statements, and developed roadmaps/timeline for delivery.
Participated in Sprint review/retro meetings, daily SCRUM meetings to give the daily update on work.
Extracted the data from data sources such as SFDC Core to edge node using Bulk API and wrote a Unix function that automatically removes “unwanted files” from the edge node.
Creaded Hive external tables on loaded data and writing hive queries in Hue for ad-hoc analysis.
Extracted appropriate features from the data sets to handle bad, null, partial records using Spark SQL.
Implemented various performance tuning to improve the accessibility of the data - i.e. converted dataframes into parquet files, stored in HDFS, and then using SQL queries against parquet files.
Implemented Spark SQL to access hive tables into Spark for faster data processing.
Performed uni-varaite analysis to check the distribution of each variable and bi-variate analysis such as violin plots and bar graphs to compare the distribution of different variables using a combination of matplotlib and seaborn libraries.
Derived features such as aggregations, dummy variables, and target variables and many more using python libraries NumPy and Pandas.
Applied the Imputation methods on missing data such as Mean Imputation, Last value carried forward, using information about from related observations to deal with MCAR, MAR and MNAR.
Removed irrelevant features using methods: Chi-square and ANOVA F-value, Correlation matric and thresholding Bernoulli feature variance.
Used k-fold cross validation along with stratified sampling so that the test set generated using stratified sampling has target category proportions almost identical to those in the full dataset.
Applied logistic regression, linear regression, ordinary least square method, mean-variance, theory of large numbers, Poisson distribution, Bayes, Naive Bayes, fitting function and many more to data using numpy, pandas, scikit-learn and Spark MLlib libraries.
Applied clustering algorithms on task related data to study the underlying data patterns using a variety of techniques i.e. PCA, Factor analysis, K-means using a numpy pandas and scikit-learn librairies.
Used model evaluation methods - Precision, Recall, F1 Score, Confusion Matrix, and R squared, MSE, RMSE, Log loss, Gini Index, and AUC for cross validation metrics.
Designed, configured, monitored and scheduled complex data flows in Salesforce Einstein.
Applied various computeExpression Transformations and computeRelative Transformation. Additionally, used many other transformations such as delta, dim2mea, filter, and flatten using SAQL.
Visualized data with charts such as heat map, bar chart, line chart, geo map, dual axis graphs and custom charts such as Pyramid, Scatter and Timeline charts using Einstein Analytics.
Built interactive analytics dashboard - creating a template, adding and managing widgets, making it interactive, testing, validating and optimizing.
Created technical-reports to summarize the data insights to the stakeholders and management.

Environment: PyCharm, Python 3.6, PySpark 2.2, Git, Unix, Hadoop 3.0, HDFS, Apache Hive 2.3, Hue, Apache Spark 2.3, Docker 18.02, Salesforce Einstein Discovery.

Confidential, Needham, MA

Big Data Engineer

Responsibilities:

Involved in creating database objects like master tables, view procedures, triggers, and functions using PostgreSQL to provide definition, structure and to maintain data efficiently.
Extracted data from Freshdesk in JSON using REST APIs and then loaded the data in HDFS.
Performed data cleaning and data transformation (i.e. incorrect data, empty rows, derived variable and more) using PySpark.
Created Hive external tables on loaded data using Hue that was consumed by BI and ML teams.
Built out operational dashboards in Tableau that helped executives monitor and optimize the SLA.
Built and published customized interactive reports and dashboards, scheduling using Tableau server.
Applied various graphs i.e. multi-axis line chart, dual axis line and bar charts, geo mapping and more.
Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
Built a real-time streaming model using Spark Streaming - Loaded streaming text data from Amazon S3 to Spark RDDs, wrote a program to take the live streaming of internal collaborative platform, built a machine learning model using Spark MLlib TF-IDF in Scala, invoked it from SQL to build a dashboard with Apache Zeppelin that uses the model and filter out things for users using Databricks platform as part of POC.
Assisted Sr. data scientists by performing data cleansing and data wrangling on a vast amount of marketing attribution data using NumPy and Pandas Scikit-learn.

Environment: Tableau 2018, Postgres 10, NumPy Pandas Scikit-learn, Jupyter Notebook, Python 3, Scala 2.12, AWS S3, Apache Spark 2.x, Apache Zeppelin, Databricks

Confidential, FL

Data Engineer/Scientist

Responsibilities:

Worked with product lead to understand business/functional requirements and data fields.
Performed data wrangling such as reshaping, summering, fuzzy matching, and transposing the fields.
Performed uni-varate and bi-varaite analysis such as checking contingency table, creating violin plots histograms, and scatterplots using matplotlib and seaborn, and Alteryx’s Data Investigation tab.
Joined automatic meter reading (AMR) data with weather data.
Created a time-series model to forecast water consumption - used ARIMA and ETS model and then joined with Union tool to bring both datasets together.
Designed ETL packages dealing with different data sources (SQL Server, Flat Files, and XMLs etc.) and loaded the data into target data sources by performing different kinds of transformations using TSQL in Azure data lake analytics, and then created DataViz reports using Power BI.
Feed in the output from the Union tool to TS compare tool to compare the error statistics (ie. RMSE, MAE, MPE, MAPE, and Precision Measure(PM)) of the models.
Performed feature engineering such as deriving indicator variables using Cross-tab tool and target variable using the Formula tool.
Used create sample tool to create an estimation and validation samples and then created logistic regression and random forest models using Logistic Regression and Forest model tool.
Applied feature scaling - min-max scalar and standardization for numerical variables such as meter readings - used Scikit-learn’s provided transformer called MinMax Scaler and Standard Scaler
Encoded nominal categorical data using one-hot-coding method and performed feature engineering around date-time and location fields.
Used 5-fold cross validation along with stratified sampling so that the test set generated using stratified sampling has target category proportions almost identical to those in the full dataset.
Applied logistic regression classifier, SVM and Random Forest using pandas and scikit-learn and then used confusion matrix for model evaluation.
Used performance evaluation matrices: Precision, Recall, F1 Score, Confusion Matrix, and Log loss.
Fine-tuned the model by using Scikit-Learn’s GridSearchCV - searching for the best combination of hyperparameter values for the RandomForestRegressor.

Environment: Python 3.5, NumPy Pandas Matloplib Seaborn Scikit-learn, Alteryx 2018, Tableau 2018

Confidential, NJ

Data Scientist, IOT

Responsibilities:

Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
Designed ETL packages dealing with different data sources (SQL Server, Flat Files, and XMLs etc.) and loaded the data into target data sources by performing different kinds of transformations using TSQL in Azure data lake analytics, and then created DataViz reports using Power BI.
Implemented Spark SQL to access hive tables into spark for faster processing of data.
Applied feature engineering techniques such as count, binning, time-delta, log transformations to derive relevant features from the dataset using numpy and pandas.
Applied linear regression, multiple regression, ordinary least square method, mean-variance, theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function and many more to data using Spark MLIB libraries.
Applied dimensionality reduction methods to remove irrelevant features from data using techniques like Chi-square, ANOVA F-value, Correlation matric and thresholding binary feature variance.
Applied clustering algorithms on market data to study the underlying data patterns using a variety of techniques i.e. PCA, Factor analysis, Hierarchical, K-means through Scikits.
Used model evaluation methods - Precision, Recall, F1 Score, Confusion Matrix, and R squared, MSE, RMSE, Log loss, Gini Index, and AUC for cross validation metrics
Prepared process flow/activity diagram for existing system using MS Visio and re- engineer the design based on business requirements.

Environment: Python 3, Microsoft Azure, HDInsight, Power BI, Unix, SQL Server, Jupyter notebook, Hadoop, MapReduce, Hive, Hue, Spark 2, Pig 0.16, Galaxy 2021

Confidential, Boston, MA

Data Analyst

Responsibilities:

Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements.
Created reports from OLAP, sub reports, bar charts and matrix reports using SSIS.
Designed and developed cubes using SQL Server Analysis Services (SSAS) using Microsoft Visual.
Used V-lookups, Pivot tables, and Macros in Excel to develop ad-hoc reports and scheduled report (bi-weekly, monthly, quietly/ yearly) and recommended solutions to drive business decision making
Created BI dashboards using Tableau for visual spend analytics and regression analysis to identify opportunities to reduce cost, track contract compliances, and measure supplier performance.
Updated and maintained custodian database in PeopleSoft Finance for asset management and performed QA for data duplicates, null records and more.

Environment: SQL Server, MS-Excel, V-Lookup, Tableau, SSRS, SSIS, OLAP,Power Bi, PeopleSoft

We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Needham, MA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship