Big Data Engineer/spark Engineer Resume
Boston, MA
SUMMARY
- 6+ years of experience as BigData Engineer/Data Engineer in BIGDATA using HADOOP, Spark framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Experience in implementing various Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization solutions.
- Good knowledge of Hadoop architecture and its components like HDFS, MapReduce, Job Tracker, Task Tracker, Name Node and Data Node.
- Have proven track record of working as Data Engineer on Amazon cloud services, Bigdata/Hadoop Applications and product development.
- Experience in integration of various relational and non - relational sources such as DB2, Oracle, SQL Server, NoSQL-MongoDB, XML and Flat Files, to data warehouse.
- Solid understanding of statistical modelling and supervised/ unsupervised/ reinforcement machine learning techniques with keen interests in applying these techniques to predictive analytics world using Python and R.
- Good knowledge of Apache Hadoop technologies like Pig, MapReduce, Hive, Scoop, Spark, Flume, Oozie & HBase.
- End to end experience in designing and deploying Analytical Dashboards and data visualizations using Tableau Desktop, Tableau Server, Tableau Reader and Tableau Public.
- Extensive experience in various phases of software development like analyzing, gathering and designing the data with expertise in documenting.
- Extensive Experience working on SQL Queries along with good experience in development of T-SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
- Experience in Data modeling for Data Mart/Data Warehouse development including conceptual, logical and physical model design, developing Entity Relationship Diagram (ERD), reverse/forward engineer (ERD) with CA Erwin data modeler.
- Experience in migration of Data from Excel, Flat file, Oracle to MS SQL Server by using SQLServer SSIS.
- Good working experience in develop ETL Mappings for data loads from various sources such as Oracle, Flat Files, DB2, SQL Server etc.
- Experience in worked with process improvement, Normalization/De-normalization, data extraction, cleansing, and manipulation.
- Experience in Designed and developed Data models for Database (OLTP), the Operational Data Store (ODS), and Data warehouse (OLAP), and federated databases to support client enterprise Information Management Strategy.
- Experience in Transform and Load data from heterogeneous data sources to SQL Server using SSIS.
- Strong understanding ofDatawarehouse concepts, ETL, Star Schema, Snowflake,datamodeling experience using Normalization, Business Process Analysis, DimensionalDatamodeling, FACT and dimensions tables, physical and logicaldatamodeling.
- Hands on experience with modeling using Erwin in developing Entity-Relationship, modeling Transactional Databases and Data Warehousing, Dimensional Data Modeling for Data Marts and Fact & Dimensional Tables.
- Experience in designing Star Schema, Snowflake schema for Data Warehouse, by using tools like data modeler, Power Designer and Embarcadero E-R Studio.
- Experience in big data analysis and developing data models using Hive, PIG, and MapReduce, SQL with strong data architecting skills designing data-centric solutions.
- Experience in Integration Architect & Data Scientist experience in Analytics, Big Data, ETL and Cloud technologies.
- Extensive experience in Text Analytics, developing different statistical Machine Learning, Data mining solutions to various business problems and gathering data visualization using Python and R.
- Experience in manipulating the largedatasets with R packages like tidyr, tidyverse, dplyr reshape, lubridate and visualizing thedatausing lattice and ggplot2 packages.
- Extensive experience in using various Python librarieslike beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, seaborn, NLTK, keras, tensorflow, scikit learn and MySQL dB for database connectivity.
- Experience in Data Science, Data Analysis, Data Profiling, Data Integration and Migration.
- Exploring opportunities in data science, including deep machine learning, natural language processing, and artificial intelligence (AI).
- Experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across big volume of structured and unstructured data.
- Good understanding of Deep learning using CNN, RNN and ANN.
- Experience in Data Analysis, Data Modeling, Data Architecture, Designing, developing, and implementing data models for enterprise-level applications and systems.
TECHNICAL SKILLS
Big Data Tools: Hadoop, Map Reduce, HDFS 2, Hive, Pig, HBase, Sqoop, Spark, Kafka
OLAP & ETL Tools: Tableau, Spyder, Spark, SSIS, Informatica Power Center
Data Modelling Tools: Microsoft Visio, ER Studio, Erwin
Python and R libraries: R-tidyr, tidyverse, dplyr reshape, lubridate, Python - beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas,scikit-learn,keras, tensorflow, NLTK
Languages: SQL, Python, R, Scala
Data Warehouse schemas: Star Schema, Snowflake schema
Database: MySQL, Hive, Teradata, MS Access, SQL Server, Oracle, Mongo DB, PostgreSQL
Reporting Tools: MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
Cloud Computing Tools: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Machine Learning: Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.
Data Science Tools: Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Operating Systems: Windows, Linux, Mac OS
PROFESSIONAL EXPERIENCE
Confidential, Boston, MA
Big Data Engineer/Spark Engineer
Responsibilities:
- Gathered, documentedand implemented business requirements for analysis or as part of a long-term document/report generation.
- Used big data analysis technique using Big data related techniques i.e., Hadoop, MapReduce, MongoDB, Pig/Hive, Spark/Shark, MLlib and Scala, numpy, scipy, Pandas, scikit-learn.
- Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
- Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.
- Worked to research and develop statistical learning models for data analysis using Python and R. Collaborated with product management and engineering departments.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Design, Implement and maintain Database Schema, Entity relationship diagrams, Data modeling, Tables, Stored procedures, Functions and Triggers, Constraints, clustered and non-clustered indexes, partitioning tables, Schemas, Functions, Views, Rules, Defaults and complex SQL statement for business requirements and enhancing performance.
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis.
- DesignedDataMarts by following Star Schema and Snowflake Schema Methodology, using industry leadingDatamodeling tools like ER Studio.
- Extracted, transformed, and loaded (ETL) data in PostgreSQL data base using Python scripts.
- Worked on Hadoop - MapReduce/Hive/Pig was used to store, process and analyze huge unstructured data.
- Used Big data technologies like Hadoop, Hive, Pig and Spark were used for developing and executing models.
- Used query languages like SQL, Hive, Pig and experience with NoSQL databases, like MongoDB, Cassandra, HBase.
- Build and maintain SQL scripts, indexes, and complex queries for data analysis and extraction.
- Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging) to useful data.
- Designed program on Amazon Web Services (AWS) for customer analytics and predictions.
- Kafka was used as message broker to collect large data and to analyze the collected data in the distributed system.
- Used scikit-learn, Pandas, and the stats models Python libraries to build predictive forecasting.
- Performeddatacleaning using R, filtered input variables using the correlation matrix, stepwise regression, and Random Forest.
- Tableau was used asBusiness Intelligence tool for visually analyzing the data and to shows the trends, variations and density of the data in form of graphs and charts.
- Formulated procedures for integration of R programming plans with data sources and delivery systems and R language was used for prediction.
- Built advanced analytics solutions deployed into production for prediction, forecasting, and optimization, including: Data mining, Statistical analysis, Modeling, Machine learning, Visualization using Python, R, and Tableau.
- Worked with both unstructured/structured data Machine Learning Algorithms such as Linear, Logistic, Decision Tress, Random Forests, Support Vector Machines, Neural Networks, KNN, and Time series analysis.
- Involved in Data Analysis for analyzing client business needs, managing large data, storing and extracting data.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
- Worked on Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple.
Environment: Scala, spark, SQL, R, Python, BigData, AWS, Kafka, Tableau, Hadoop, MapReduce, HBase, Snowflake, MongoDB, Pig/Hive, Spark/Shark, MLlib and Scala, numpy, scipy, Pandas, scikit-learn, S3, EC2, RDS.
Confidential, Boston, MA
Big Data/Spark Engineer
Responsibilities:
- Data Warehouse - Designed and programmed ETL and aggregation of data in target database, working with staging, de-normalized and start schemas and dimensional reporting.
- Wrote several Teradata SQL Queries using Teradata SQL Assistant for Ad Hoc Data Pull request.
- Implemented various Machine learning algorithms - Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost on UCI Machine Learning Repository.
- Built data pipeline framework using python for data extraction, data wrangling and data loading in Oracle SQL and ApacheHDFS using - Pig and Hive.
- Developed Star and Snowflake schemas based dimensional model growing thedatawarehouse
- Integrated new tools and developed technology frameworks/prototypes to accelerate the data integration process and empower the deployment of predictive analytics by developing Spark Scala modules withPython.
- Involved in running MapReduce jobs for processing millions of records.
- Involved in Data Modeling as per our requirement in HBase and for managing and scheduling Jobs on a Hadoop cluster using Oozie jobs.
- Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing thedataonto HDFS.
- Designed and Implemented Sharding and Indexing Strategies for MongoDB servers.
- Involved in Relational and DimensionalDatamodeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ERStudio.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SQL Context.
- Design and development of ETL processes using Informatica ETL tool for dimension and fact file creation.
- Responsible for business case analysis, requirements gathering, use case documentation, prioritization, product/portfolio strategic roadmap planning, high level design and data model.
- Developed and implemented Predictive analysis using R and Pythonto Management and Business users for decision making process.
- Performing statistical data analysis and data visualization using R and R-shiny.
- Worked on creating filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
- Implemented data refreshes on Tableau Server for biweekly and monthly increments based on business change to ensure that the views and dashboards were displaying the changed data accurately.
- Involved in Data analysis for data conversion - included data mapping from source to target database schemas, specification and writing data extract scripts/programming of data conversion, in test and production environments.
- Developed business predictive/historic analysis, Data Mining/Text Mining using R.
- Used clustering technique K-Means to identify outliers and to classify unlabeleddata.
- Responsible for the end to end solutions delivery, sprint planning and execution, change management, and UAT.
Environment: Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, Star Schema, Snowflake, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, R, Python, ETL, Oracle, SQL, HDFS, Pig, Hive, R Studio, Spark, Scala, Linux.