We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

5.00/5 (Submit Your Rating)

St Minneapolis, MN

SUMMARY

  • Overall, 8+ years of technical IT experience in all phases of Software Development Life Cycle (SDLC) with skills in data analysis, design, development, testing and deployment of software systems.
  • 6+ yearsof industrial experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spring Boot, Spark integration with Cassandra, Avro, Solr and Zookeeper.
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
  • Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Worked on different file formats like delimited files, avro, json and parquet. Docker container orchestration using ECS, ALB and lambda.
  • Extensive knowledge onQlikView Enterprise Management Console (QEMC), QlikView Publisher, QlikView Web Server.
  • Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology.
  • Worked as team JIRA administrator providing access, working assigned tickets, and teaming with project developers to test product requirements/bugs/new improvements.
  • CreatedSnowflake Schemasby normalizing the dimension tables as appropriate and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
  • Experienced in Pivotal Cloud Foundry (PCF) on Azure VM's to manage the containers created by PCF.
  • Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches.
  • Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB),SQL Server, Oracle,Data Warehouse etc. Build multiple Data Lakes.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like, PowerBI.
  • Worked with Google Compute Cloud Data Flow and Big Query to manage and move data within a 200 Petabyte Cloud Data Lake for GDPR Compliance. Also designed star schema in Big Query.
  • Extensive programming expertise in designing and developing web-based applications using Spring Boot, Spring MVC, Java servlets, JSP, JTS, JTA, JDBC and JNDI.
  • Experience in MVC and Microservices Architecture with Spring Boot and Docker, Swamp.
  • Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
  • Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
  • Experience in Spring Frameworks like Spring Boot, Spring LDAP, Spring JDBC, Spring Data JPA, Spring Data REST
  • Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
  • Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven Development (TDD) and Continuous Integration (CI).
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running docker images with multiple microservices.
  • Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
  • Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation, exploratory data analysis, summary statistics, feature engineering, Machine learning model development and machine learning model deployment into Server system.
  • Proficient inStatistical MethodologiesincludingHypothetical Testing,ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
  • Expertise in transforming business resources and requirements intomanageable data formatsandanalytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, HBase, Pig, Hive, Sqoop, KafkaFlume, Cassandra, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.

Cloud Technologies: AWS, Azure, Google cloud platform (GCP)

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter

Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, etc.

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE

Programming / Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, NI-FI, GCP, Google Shell, Linux, Big Query, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, St. Minneapolis, MN

Senior Big Data Engineer

Responsibilities:

  • Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
  • Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
  • Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
  • Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
  • Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
  • Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Agile methodology including test-driven and pair-programming concept.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
  • Involved in Installation QlikView 12.0 SR5, Nprinting 16/17 in both publisher and server.
  • Involved in testing dashboards of Qlikview 11.2 version to migrate it to Qlikview 12.1. Extensive experience with Extraction, Transformation, Loading (ETL) process using Ascential Data Stage EE/8.0/
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and StepFunctions.
  • Created yaml files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
  • Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
  • Analyzed existing Data Model and accommodated changes according to the business requirements.
  • Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
  • Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.

Environment: Hadoop, Map Reduce, HDFS, Hive, Ni-fi, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend Big Data Integration, Solr, Impala.

Confidential, Rochester, MN

Sr. Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Experience in deploying the Spring Boot Microservices to Pivotal Cloud Foundry (PCF) using build pack and Jenkins for continuous integration, Deployments in Pivotal Cloud Foundry (PCF) and binding of Services in Cloud and Installed Pivotal Cloud Foundry (PCF) on Azure to manage the containers created by PCF.
  • Analyzed clickstream data from Google analytics with Big Query. Designed APIs to load data from Omniture, Google Analytics, Google Big Query.
  • Maintained JIRA team and program management review dashboards and maintained COP account and JIRA team sprint metrics reportable to customer and SAIC division management
  • Maintained JIRA team Confluence System Engineering pages that included: Process Flow Management, Team Requirements, Roles and Responsibilities, and COP User Metrics.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Developed complex Tableau Dashboard reports by gathering requirement on direct interactions with Business/Operation teams.
  • Created and managed Tableau sites, projects, and workbooks, groups, data views, data sources and data connections
  • Established Tableau security, back-up and restore process.
  • Maintained and scheduled of Tableau Data Extracts using Tableau Server and the Tableau Command Utility.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
  • Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, MapReduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, numpy, scipy, Pandas, scikit-learn.
  • UtilizedSpark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Pythonand utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
  • Created User manual on using Atlassian Products (Jira/Confluence) and trained end users project wise.
  • Implemented the Atlassian Stash application as the SCM tool of choice for central repository management
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Optimizealgorithmwithstochastic gradient descent algorithmFine-tuned thealgorithm parameterwith manual tuning and automated tuning such asBayesian Optimization.
  • Strong understanding ofenterprise data warehouse architecture and big data. Responsible for data model design using ERwin/Power Designer/Cast.
  • Built strategic relationship with vendors and reduced customization and implementation cost by 50 .
  • Communicated with CxOs to align business strategy and increased customer base.
  • Architected data processes and reduced latency to close to real time. Processing time was reduced from 50 minutes to 10 seconds.
  • DevelopedData Mapping, Data Governance, TransformationandCleansingrules for the Master Data Management Architecture involving OLTP, ODS and OLAP
  • Migrated Database from SQL Databases (Oracle and SQL Server) to NO SQL Databases (Cassandra/MONGODB);
  • Studied the existing OLTP systems (3NF models) and created facts and dimensions in the data mart.Worked with different cloud - based data warehouse like SQL, Redshift.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Big Query, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper.

Confidential, Tampa, FL

Data Scientist/ Python Developer

Responsibilities:

  • Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
  • Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
  • Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
  • Automated Diagnosis of Blood Loss during Emergencies and developed Machine Learning algorithm to diagnose blood loss.
  • Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
  • Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
  • Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process usingpythonscripts.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
  • Developed predictive models using Decision Tree, Random Forest, and Naïve Bayes.
  • Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise inR, Matlab, pythonand respective libraries.
  • Research on Reinforcement Learning and control (TensorFlow, Torch), andmachinelearning model (Scikit-learn).
  • Hands on experience in implementing Naive Bayes and skilled inRandom Forests, Decision Trees, Linear,and Logistic Regression, SVM, Clustering, Principal Component Analysis.
  • Performed K-means clustering, Regression andDecision Treesin R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
  • Implemented various statistical techniques to manipulatethe datalike missingdataimputation, principal component analysis and sampling.
  • Worked on R packages to interface with Caffe Deep Learning Framework. Perform validation on machine learning output from R.
  • Applied different dimensionality reduction techniques like principal component analysis (PCA) and t-stochastic neighborhood embedding(t-SNE) on feature matrix.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.
  • Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
  • Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
  • Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
  • Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.

Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, TDD, Python, Spring Boot, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.

Confidential, CA

Big Data Developer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in oozie script
  • Hands on experiences on git bash commands like git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver.yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • DevelopedPythonscript to run SQL query as parallel to initial load data into target table. Involved in loading data from edge node to HDFS using shell scripting and assisted in designing the overall ETL strategy
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.
  • Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.

Environment: Hadoop, Spark, MapReduce, Sqoop, HBase, Oozie, Impala, Kafka YARN, HIVE, Pig, Scala, Mahout, NiFi, TDD, Python, NOSQL, Sqoop, MYSQL, Spring Boot, Hadoop, Azure, Dynamo DB, Kibana,

Confidential

Data Analyst

Responsibilities:

  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Recommended structural changes and enhancements to systems and databases.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support.
  • Created test plan documents for all back-end database modules.
  • Used MS Excel, MS Access, and SQL to write and run various queries.
  • Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
  • Worked with internal architects and assisting in the development of current and target state data architectures.
  • Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Write Python scripts to parse JSON documents and load the data in database.
  • Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
  • Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries.
  • Performed Exploratory Data Analysis, trying to find trends and clusters.
  • Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
  • Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.

We'd love your feedback!