We provide IT Staff Augmentation Services!

Sr Big Data/data Engineer Resume

3.00/5 (Submit Your Rating)

Malvern, PA

SUMMARY

  • Overall 7+ years of technical IT experience in Design, Development and Maintenance of enterprise analytical solutions using big data technologies. Proven data and business expertise in Retail, Finance and Healthcare domains.
  • Proficient in choosing and evaluating the right technologies needed for building data pipelines from ingestion, curation and consumption for both batch and streaming use cases on cloud and on - prem environments.
  • Result-oriented professional with experience in Creating Data Mapping Documents, Writing Functional Specifications and Queries, Normalizing Data from 1NF to 3NF/4NF. Requirements gathering, System & Data Analysis, Requirement Analysis, Data Architecture, Database Design, Database Modeling, Development, Implementation and Maintenance of OLTP and OLAP Databases.
  • Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in-depth noledge in Spark API's like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
  • Extensive programming expertise in designing and developing web-based applications using Spring Boot, Spring MVC, Java servlets, JSP, JTS, JTA, JDBC and JNDI.
  • Experience in MVC and Microservices Architecture with Spring Boot and Docker, Swamp.
  • Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations.
  • Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
  • Experience in Spring Frameworks like Spring Boot, Spring LDAP, Spring JDBC, Spring Data JPA, Spring Data REST.
  • Experience in Amazon web services (AWS) cloud like S3, EC2 and EMR and in Microsoft Azure
  • Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses).
  • Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
  • Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven
  • Development (TDD) and Continuous Integration (CI).
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
  • Experience in working on creating and running docker images with multiple microservices.
  • Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and TEMPeffective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
  • Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation.
  • Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA, Time Series, Principal.
  • Good understanding of Data Modeling (Dimensional and Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension Tables.
  • Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
  • Good at conceptualizing and building solutions quickly and recently developed a Data Lake using sub-pub Architecture.
  • Developed a pipeline using Scala and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
  • Installed both Cloudera (CDH4) and Hortonworks (HDP1.3-2.1) Hadoop clusters on EC2, Ubuntu 12.04, CentOS 6.5 on platforms ranging from 10-100 nodes.
  • Architected complete scalable data pipelines, data warehouse for optimized data ingestion.
  • Collaborated with data scientists and architects on several projects to create data mart as per requirement.
  • Conducted complex data analysis and reporting the results.
  • Constructed data staging layers and fast real-time systems to feed BI applications and machine learning algorithms.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Programming languages: Python, Java, R

Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis

Languages: Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions

Web Technologies: HTML, JavaScript, Restful, SOAP

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Version Control: GIT, GIT HUB

IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

Databases: Oracle, SQL Server, MySQL, DynamoDB, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).

Operating Systems: Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux

Cloud Technologies: MS Azure, Amazon Web Services (AWS), Google cloud

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.

Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven, Alteryx, Visio.

PROFESSIONAL EXPERIENCE

Sr Big Data/Data Engineer

Confidential, Malvern, PA

Responsibilities:

  • Developed ETL data pipelines using Spark, Spark streaming and Scala.
  • Loaded data from RDBMS to Hadoop using Sqoop
  • Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
  • Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
  • Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.
  • Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
  • Monitoring the Hive Meta store and the cluster nodes with the help of Hue.
  • Created AWS EC2 instances and used JIT servers.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Data Integrity checks have been handled using hive queries, Hadoop, and Spark
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala
  • Implemented the Machine learning algorithms using Spark with Python.
  • Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.
  • Responsible in handling Streaming data from web server console logs.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Developed PIG Latin scripts for the analysis of semi structured data.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Involved in NoSQL database design, integration, and implementation.
  • Loaded data into NoSQL database HBase.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with components on HDFS, Hive.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, AWS, ETL, PIG, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Hue, Oozie, Java, Scala, Python, GIT.

Sr Data Engineer

Confidential, Bothell, WA

Responsibilities:

  • Installing, configuring and maintaining Data Pipelines.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis into S3.
  • Authoring Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
  • Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment. Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
  • Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and SQL.
  • Loading data from different sources to a data warehouse to perform some data aggregations for business Intelligence using python.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Used Apache NiFi to copy data from local file system to HDP.
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement given.
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
  • Description of End-to-end development of Actimize models for trading compliance solutions of the project bank.
  • Automated and scheduled recurring reporting processes using UNIX shell scripting and Teradata utilities such as MLOAD, BTEQ and Fast Load.
  • Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
  • Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.

Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Git, Sqoop, Oozie, Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix.

Sr Data Engineer

Confidential

Responsibilities:

  • Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming using Python Programming.
  • Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
  • Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
  • Good noledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
  • Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Created a business category mapping system that automatically maps customers' business category information to any source website's category system. Category platforms include Google, Facebook, Yelp, Bing etc.
  • Developed a data quality control model to monitor business information change overtime. The model flags outdated customer information using different API’s for validation and updates it with correct data.
  • Responsible for monitoring sentimental prediction model for customer reviews and ensuring high performance ETL process.
  • Data cleaning, pre-processing and modelling using Spark and Python.
  • Implemented real-time data driven secured REST API's for data consumption using AWS (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), Swagger, Okta and Snowflake. Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
  • Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.
  • Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL).
  • Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.
  • Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, JSON.
  • Developed application for monitoring, root cause analysis and management of WLAN data by parsing log using Python Django, MongoDB and creating JSON format.

Environment: Hive, Spark SQL, PySpark, EMR, Tableau, Sqoop, AWS, Presto, Python, Snowflake, Teradata, Azure AAS & SSAS, Kafka.

Data Engineer

Confidential

Responsibilities:

  • Extensive noledge/hands on experience in architecting or designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers and indexes.
  • Created HBase tables from Hive and Wrote HiveQL statements to access HBase table's data.
  • Developed complex Hive Scripts for processing the data and created dynamic partitions and bucketing in hive to improve the query performance.
  • Developed MapReduce applications using Hadoop Map-Reduce programming framework for processing and used compression techniques to optimize MapReduce Jobs.
  • Developed Pig UDF's to no the customer behavior and Pig Latin scripts for processing the data in Hadoop.
  • Used Struts tag libraries and custom tag libraries extensively while coding JSP pages.
  • Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.
  • Develop the Oozie actions like hive, shell, and java to submit and schedule applications to run in Hadoop cluster.
  • Experienced in building Data Warehouse in Azure platform using Azure data bricks and data factory.
  • Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion.
  • Worked in Azure environment for development and deployment of Custom Hadoop Applications.
  • Designed and implemented scalable Cloud Data and Analytical solution for various public and private cloud platforms using Azure.
  • Involved in migrating Spark Jobs from Qubole to Databricks.
  • Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem.

Environment: Python, MySQL, PostgreSQL, Hadoop (Hive), AWS (S3, EMR), Tableau, Docker, Kafka.

Data Engineer

Confidential 

Responsibilities:

  • Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
  • Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Created a business category mapping system that automatically maps customers' business category information to any source website's category system. Category platforms include Google, Facebook, Yelp, Bing etc.
  • Developed a data quality control model to monitor business information change overtime. The model flags outdated customer information using different API’s for validation and updates it with correct data.
  • Responsible for monitoring sentimental prediction model for customer reviews and ensuring high performance ETL process.
  • Data cleaning, pre-processing and modelling using Spark and Python.
  • Implemented real-time data driven secured REST API's for data consumption using AWS (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), Swagger, Okta and Snowflake.
  • Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
  • Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.
  • Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL).
  • Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.
  • Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, JSON.
  • Developed application for monitoring, root cause analysis and management of WLAN data by parsing log using Python Django, MongoDB and creating JSON format.

Environment: Python, Scala, SQL, Maven, AWS (Redshift, S3, EC2), Mongo dB, MYSQL.

We'd love your feedback!