We provide IT Staff Augmentation Services!

Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Rhode, IslanD

SUMMARY

  • 6+ years of professional IT experience with Complete Software Development Life cycle (SDLC) which includes Business Requirements Gathering, System Analysis & Design, Development, Testing, and Implementation of the projects. Highly skilled in processing complex data designing, Machine Learning modules for effective data mining and modelling. Adept at Hadoop cluster management and capacity planning for end - to-end data management and performance optimization.
  • Expertise in using Python and SQL for Data Engineering and Data Modeling
  • Experience in building reliable ETL processes and data pipelines for batch and real-time streaming using SQL, Python, Spark, Databricks, Streaming, Sqoop, Hive, AWS, Azure, NiFi, Luigi, Oozie and Kafka.
  • Experience working with NoSQL database technologies, like Cassandra, MongoDB and HBase.
  • Experience in usage of Snowflake, Hadoop distribution like Cloudera, Hortonworks distribution &AWS (EC2, EMR, RDS, Redshift, DynamoDB, Snowball) and Data bricks (data factory, notebook etc.)
  • Built Spark jobs using PySpark to perform ETL for data in S3 data lake.
  • Experience in creating reports and dashboards in visualization tools like Tableau, Spotfire and PowerBI.
  • Experience with Software development tools and platforms like Jenkins, CI/CD, GIT, JIRA.
  • Good knowledge in OLAP, OLTP and Data warehousing concepts with emphasis on ETL, Reporting and Analytics.
  • Responsible for Creating, Debugging, Scheduling and Monitoring jobs using Apache Nifi, Oozie, Luigi, Airflow.
  • Expertise in Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import
  • Migrated ETL processes from Relational databases to Hive for easy data manipulation on distributed systems.
  • Involved in database development by creating SQL Server T-SQL functions, procedures, and triggers.
  • Advantage of having extensive experience in both distributed (ELT-Bigdata, Hadoop, spark (pyspark), hive, Hbase, sqoop, flume, python, Kafka) and traditional (ETL-Informatica power center, power exchange, SQL, Teradata, Oracle PL/SQL, Postgre SQL, Linux and Unix shell scripting) data processing systems.
  • Experience in troubleshooting errors in HBase Shell/API, Pig, Hive, Sqoop, Flume, Spark, and MapReduce
  • Hands on experience in Design, development and implementation of performance ELT pipelines using Apache Spark.
  • Practical understanding of theData modeling(Dimensional & Relational)concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Strong knowledge of Software Development Life Cycle (SDLC) and expertise in detailed design documentation in Agile, DevOps methodologies.
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
  • Worked at conceptual/logical/physical data model level using Erwin according to requirements.

TECHNICAL SKILLS

Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS

Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper

Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Spark MLlib, Hive

Cloud Ecosystem: Azure, AWS, GCP, Snowflake cloud data warehouse

Data Ingestion: Sqoop, Flume, NiFi, Kafka

NOSQL Databases: HBase, Cassandra, MongoDB, CouchDB

Programming Languages: Python, C, C++, Scala, Core Java, J2EE

Scripting Languages: UNIX, Python, R Language

Tools: SBT, Putty, Win SCP, Maven, Git, Jasper reports, Jenkins, Tableau, Mahout, UC4, Pentaho Data Integration, Toad

Methodologies: SDLC, Agile, Scrum, Waterfall Model

PROFESSIONAL EXPERIENCE

Confidential, Rhode Island

Big Data Engineer

Responsibilities:

  • Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Designed and implemented end to end big data platform on Teradata Appliance.
  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Worked on Apache Spark Utilizing the Spark, SQL, and Streaming components to support the intraday and real-time data processing.
  • Sharing sample data using grant access to customer for UAT/BAT.
  • Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in InAzure Databricks
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
  • Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards.
  • Created monitors, alarms, and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS.
  • Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
  • Building data pipeline ETLs for data movement to S3, then to Redshift.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Installed and configured apache airflow for workflow management and created workflows in python.
  • Write UDFs in Hadoop Pyspark to perform transformations and loads.
  • Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Working with, ORC, AVRO and JSON, Parquetted file formats. and create external tables and query on top of these files Using Big Query
  • Source Analysis, Tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
  • Identifying the jobs that load the source tables and documenting it.
  • Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages, and requirements for the applications.
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
  • Deployed the Big DataHadoop applicationusingTalend on cloudAWS (Amazon Web Services) and on Microsoft Azure.
  • Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
  • Developing automated process for code builds and deployments usingJenkins, Ant, Maven, Sonar type, Shell Script.
  • Installing and configuring the applications like docker tool and Kubernetes for the orchestration purpose
  • Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.

Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Toad 12.9, Share point, Control-M, Tidal, ServiceNow, Teradata Studio, Oracle 12c, Tableau, Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib

Confidential, Chicago

Data Engineer

Responsibilities:

  • Developed highly complexPythonandScalacode, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Responsible for Creating, Debugging, Scheduling and Monitoring jobs using Apache Nifi tool.
  • Performed data mapping between source systems to Target systems, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data.
  • Implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data implementation techniques for the missing values in the dataset using Python.
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Developed Spark applications using Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Build best practice ETL’s with PySpark to load and transform raw data into easy-to-use dimensional data for self-service reporting.
  • Created Tableau reports, Visualizations and dashboards for Marketing and Finance Teams which are scalable.
  • Designed HBase schema based on data access patterns. Used Phoenix to add SQL layer on HBase table.
  • Created indexes on Phoenix tables for optimization.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and used sqoop to load the data from relational databases.
  • Integrated Spark with Drools Engine to implement business rule management system.
  • Analyzed Stored Procedures to convert business logic into Hadoop jobs.

Environment: Hadoop, HBase, Jenkins, Tableau, Databricks, Snowflake, Airflow 1.10.11, S3, AWS, EMR, SQL, ETL, Python, Scala, PySQL, EC2, UNIX and SQL server

Confidential, New jersey

Associate Data Engineer

Responsibilities:

  • Worked extensively with Spark data frames for ingesting data from flat files to resilient distributed datasets to convert unstructured data to structured data.
  • Worked on Snowflake, Apache Spark API and Spark streaming, and creating queries for querying MapReduce files.
  • Collection, aggregation and moving data from servers to HDFS using Apache spark and spark streaming.
  • Ingested data into HDFS from SQL Server and Postgres using Sqoop, loaded them into hive tables, transformed, and analyzed large datasets using hive and Apache spark.
  • Developed Hive scripts to compute aggregates and store them in HBase for low latency applications.
  • Extensively worked on importing metadata into Hive and migrating existing schemas and applications to work on Hive and Spark.
  • Statistical model creation using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
  • Developed Python, Bash scripts to automate and provide Control flow.
  • Created SQL views and procedures to fetch required data into Tableau.
  • Analyzed trends in complex data sets with Linear Regression, Logistic Regression and Decision trees.
  • Used Jira for bug tracking, GIT and BitBucket for check-in and check-out of code changes.
  • Using python libraries like NumPy, Scikit-learn, Matplotlib for data visualization, interpretation, finding reports and developing strategic uses.
  • Data Analysis, Data migration Data cleansing, transformation, integration data import, and data export using python.

Environment: ETL, Hadoop (HDFS, PYSPARK, HIVE, SPARK SQL, SQOOP), AWS, SQL, Postgre SQL, Tableau, UNIX, and SQL server.

Confidential

Data Analyst/ Data Modeler

Responsibilities:

  • Extracted large volume of employee activity data from company’s inhouse tool to PowerBi using SQL queries.
  • Created weekly KPIs and SLAs for senior management to measure employee productivity and client’s satisfaction rate against targets using PowerBi dashboard resulting in time savings of up to 10%.
  • Extracted data from SAP to MS Excel using SQL queries to create Balance sheet and income statements for more than 10 clients.
  • Worked on building the data model using Erwin as per the requirements. Designed the grain of facts depending on reporting requirements.
  • Designed both 3NF data models for ODS, OLTP systems and dimensional data models using star and snowflake Schemas.
  • Gather all the Sales analysis report prototypes from the business analysts belonging to different Business units; Participated in JAD sessions involving the discussion of various reporting needs.
  • Created PowerBi dashboard for senior management by important monthly excel financial statements to track YoY and MoM changes using key financial profitability and liquidity metrics.
  • Created queries in SQL using joins, Group-by for more than 150 business units within the organization, filters by customers.
  • Created extended and maintained logical entity relationship models using contemporary data modeling tools and technologies such as CA Erwin.
  • Created and maintained the Data Model repository as per company standards.
  • Performed diagnostic and descriptive analysis for a client on its reconciliation accounts to identify issues with open AR items.
  • Generated SQL queries to extract consumer and sales data for a key client operating in different geographies with more than 140 product offerings.
  • Created consumer and market trends analysis using MS excel and Tableau to examine geographical, temporal, and behavioral patterns on revenue and volumes.
  • Identified key issues around customer churn rate, purchasing behavior, seasonal impact on product margins in various geographies.
  • Created final recommendation and visualizations using Tableau to improve YOY Revenue growth by 10%.
  • Performed ad-hoc financial analysis and provided year wise performance reports and presented to the client for decision making.

We'd love your feedback!