Sr. Big Data Engineer Resume
Boston, MA
SUMMARY
- Over 10+ years of solid work experience in Data Engineering filed with skills in analysis, design, and development, testing and deploying various software applications.
- Over 4+ years of working experience in ETL Development.
- Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
- Highly skilled in integrating Kafka with Spark streaming for high speed data processing.
- Hands on working experience on cloud technologies like AWS & MS Azure (MS Azure, Azure Synapse, ADF, Blob Storage, Azure Data Bricks).
- Good understanding and hands on experience with AWS S3, EC2 and Redshift.
- Experience in data management and implementation of Big Data applications using Spark and Hadoop frameworks.
- Excellent working experience & sound knowledge on Informatica and Talend ETL tool. Expertise in reusability, parameterization, workflow design, designing and developing ETL mappings and scripts.
- Good understand teh ETL specifications and build teh ETL applications like Mappings on daily basis
- Expertise in UNIX shell scripting
- Extensive Knowledge of RDBMS concepts, PL/SQL, Stored Procedure and Normal Forms.
- Strong experience and knowledge of NoSQL databases such as MongoDB, HBase, Azure SQL DB and Cassandra.
- Experience in migrating teh data using Sqoop from HDFS and Hive to Relational Database System and vice - versa according to client's requirement.
- Experience with RDBMS like SQL Server, MySQL, Oracle and data warehouses like Teradata and Netezza.
- Expertise in setting up load strategy, dynamically passing teh parameters to mappings and workflows in Informatica & workflows & data flows in SAP Business Objects data services integration tools.
- Demonstrated ability to lead projects from planning through completion under fast paced and time sensitive environments.
- Excellent knowledge of planning, estimation, project coordination and leadership in managing large scale projects.
TECHNICAL SKILLS
Big Data & Hadoop Ecosystem: MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0ETL Tools: Informatica 10.1/9.6.1, (PowerCenter/PowerMart) (Designer, Workflow Manager, Workflow Monitor, Server Manager, Power Connect), Talend, IDQ, TOS, TIS.
NoSQL DB: HBase, Azure SQL DB, Cassandra 3.11, Big Table
Reporting Tools: Power BI, Tableau and Crystal Reports 9
Cloud Platforms: AWS, EC2, EC3, Redshift MS Azure, Azure Synapse, ADF, Blob Storage, Azure Data Bricks, GCP, Big query, Google DSK.
Programming Languages: PySpark, Python, SQL, PL/SQL, UNIX shell Scripting, AWK
RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Operating Systems: Microsoft Windows 8 and 10, UNIX and Linux.
Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.
PROFESSIONAL EXPERIENCE
Confidential - Boston, MA
Sr. Big Data Engineer
Responsibilities:
- Working as Big Data Engineer to collaborate with other Product Engineering team members to develop, test and support data-related initiatives.
- Assisted in leading teh plan, building, and running states within teh Enterprise Analytics Team.
- Lead teh estimation, review teh estimates, identify teh complexities and communicate to all teh stakeholders.
- Engaged in solving and supporting real business issues with you're Hadoop distributed File systems and Open Source framework knowledge.
- Responsible for data governance rules and standards to maintain teh consistency of teh business element names in teh different data layers.
- Built teh data pipelines that will enable faster, better, data-informed decision-making within teh business.
- Identified data within different data stores, such as tables, files, folders, and documents to create a dataset in pipeline using Azure HDInsight.
- Performed detailed analysis of business problems and technical environments and use this data in designing teh solution and maintaining data architecture.
- Migrated on-primes environment on Cloud using MS Azure.
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Performed data flow transformation using teh data flow activity.
- Performed ongoing monitoring, automation, and refinement of data engineering solutions.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Developed mapping document to map columns from source to target.
- Created azure data factory (ADF pipelines) using Azure polybase and Azure blob.
- Performed ETL using Azure Data Bricks.
- Wrote UNIX shell scripts to support and automate teh ETL process.
- Worked on python scripting to automate generation of scripts. Data curation done using azure data bricks.
- Used data integration to manage data with speed and scalability using teh Apache Spark engine in Azure Databricks.
- Involved in various phases of development analyzed and developed teh system going through Agile Scrum methodology.
- Designed efficient and robust Hadoop solutions for performance improvement and end-user experiences.
- Worked in a Hadoop ecosystem implementation/administration, installing software patches along with system upgrades and configuration.
- Performed Data transformations in Hive and used partitions, buckets for performance improvements.
- Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with Azure Monitor.
- Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure.
- Extensively involved in writing PL/SQL, stored procedures, functions and packages.
- Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
- Performed Data scrubbing and processing with Apache NiFi and for workflow automation and coordination.
- Developed Simple to complex streaming jobs using Python and Hive.
- Optimized Hive queries to extract teh customer information from HDFS.
- Involved in scheduling Oozie workflow engine to run multiple Hive jobs.
- Analyzed data using Hive teh partitioned and bucketed data and compute various metrics for reporting.
- Built Azure Data Warehouse Table Data sets for Power BI Reports.
- Working on BI reporting with At Scale OLAP for Big Data.
- Developed customized classes for serialization and De-serialization in Hadoop.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Environment: Hadoop, Spark, Kafka, Azure Data Bricks, ADF, Python, PySpark, HDFS, ETL, Agile & Scrum meetings
Confidential - New York, NY
Big Data Engineer
Responsibilities:
- As a Big Data Engineer involved in Agile Scrum meetings to help, manage and organize a team of developers with regular code review sessions.
- Participated in Code Reviews, Enhancement discussion, maintenance of existing pipelines & systems, testing and bug-fix activities on-going basis.
- Worked closely with teh business analysts to convert teh Business Requirements into Technical Requirements and prepared low and high level documentation.
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Worked on Spark improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's
- Developed ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Involved in daily Scrum meetings to discuss teh development/progress and was active in making scrum meetings more productive.
- Seamlessly worked on Python to build data pipelines after teh data got loaded from Kafka.
- Used Kafka Streams to Configure Spark Streaming to get information and tan store it in HDFS.
- Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate teh Output response.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Created AWS Lambda functions and assigned IAM roles to schedule python scripts using Cloud Watch Triggers to support teh infrastructure needs (SQS, Event Bridge, SNS)
- Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Integrated Kafka-Spark streaming for high efficiency throughput and reliability.
- Developed a python script to hit REST API’s and extract data to AWS S3
- Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
- Developed Oozie workflow for scheduling and orchestrating teh ETL process.
- Developed ETL mappings using different transform components.
- Worked on functions inLambdathat aggregates teh data from incoming events, and tan stored result data in AmazonDynamo DB.
- Deployed teh project on Amazon EMR with S3 connectivity for setting a backup storage.
- Designed and Developed ETL jobs to extract data from oracle and load it in data mart in Redshift
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
- Used JSON schema to define table and column mapping from S3 data to Redshift
- Connected Redshift to Tableau for creating dynamic dashboard for analytics team
- Used JIRA to track issues and Change Management
- Involved in creating Jenkins jobs for CI/CD using GIT, Maven and Bash scripting.
- Coordinated in all testing phases and worked closely with Performance testing team to create a baseline for teh new application.
- Assisting application development teams during application design and development for highly complex and critical data projects.
Environment: Spark 3.3, AWS S3, Redshift, Glue, EMR, IAM, EC2, Tableau, Jenkins, Jira, Python, Kafka, Agile.
Confidential - Richmond, VA
Data Engineer
Responsibilities:
- As a Data Engineer I was responsible to build a data lake as a cloud based solution in AWS using Apache Spark and Hadoop.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Objective of this project is
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
- Used EMR (Elastic Map Reducing) to perform big data operations in AWS.
- Used teh Agile Scrum methodology to build teh different phases of Software development life cycle.
- Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift and designed and implemented Near Real Time ETL and Analytics using Redshift.
- Designed and customizing data models for Data warehouse supporting data from multiple source on real time.
- Designed ETL strategies for load balance, exception handling and design processes that can satisfy high data volumes.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server less data pipeline which can be written to Glue Catalog and can be queried from Atana.
- Contributed to teh development of key data integration and advanced analytics solutions leveraging Apache Hadoop.
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist teh data into HDFS.
- Developed Big Data solutions focused on pattern matching and predictive modeling.
- Developed teh code for Importing and exporting data into HDFS and Hive using Sqoop
- Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Implemented Sqoop to transform teh data from Oracle to Hadoop and load back in parquet format
- Developed incremental and complete load Python processes to ingest data intoElasticSearchfrom oracle database
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on teh dashboard
- Created Hive External tables to stage data and tan move teh data from Staging to main tables
- Pulled teh data from data lake (HDFS) and massaging teh data with various RDD transformations.
- Load teh data through HBase into Spark RDD and implement in memory data computation to generate teh output response.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
Environment: Hadoop 2.7, Spark 2.7, Hive, Sqoop 1.4.6, AWS, HBase, Kafka 2.6.2, Python 3.6, HDFS, Elastic Search & Agile Methodology
Confidential - Arlington, VA
ETL/Informatica Developer
Responsibilities:
- Designed analytics queries, rather TEMPthan transaction processing.
- Part of SDLC (Software Development Life Cycle) Requirements, Analysis, Design, Testing, Deployment of Informatica Power Center.
- Part of Informatica cloud integration with Amazon Redshift.
- Implemented ETL as a commit-intensive process, having a separate queue with a small number of slots to mitigate this issue.
- Involved in development of Informatica mappings and also tuned for better performance
- Worked with various transformations such as Expression, Aggregator, Update Strategy, Look Up, Filter, Router, Joiner and Sequence generator in Informatica for new requirement.
- Created a queue dedicated to ETL processes.
- Configured this queue with a small number of slots (5 or fewer) using Amazon Redshift.
- Incident resolution using ALM system, and production support, handling production failures and fix them within SLA.
- Created/modifying Informatica Workflows and Mappings (power center and Cloud) also involved in unit testing, internal quality analysis procedures and reviews.
- Validated and fine-tuned teh ETL logic coded into existing Power Center Mappings, leading to improved performance.
- Loaded data in bulk ETL using AWS Redshift.
- Wrote basic UNIX shell scripts and PL/SQL packages and procedures.
- Involved in performance tuning of teh mappings, session and SQL queries.
- Creating/modifying Informatica Workflows and Mappings.
- Used different control flow control like for each loop container, sequence container, execute SQL task, send email task.
- Created job lets in Talend for teh processes which can be used in most of teh jobs in a project like to Start Job and Commit job.
- Used UNLOAD to extract large result sets.
- Used event handing to send e-mail on error events at teh time of transformation
- Used login feature for analysis purpose
- Database and Log Backup, Restoration, Backup Strategies, Scheduling Backups.
- Improved teh performance of teh SQL server queries using query plan, covering index, indexed views and by rebuilding and reorganizing teh indexes.
- Performed tuning of SQL queries and stored procedures using SQL Profiler and Index Tuning Wizard.
- Used Amazon Redshift Spectrum for ad hoc ETL processing.
- Troubleshooting performance issues and fine-tuning queries and stored procedures.
- Defined Indexes, Views, Constraints and Triggers to implement business rules.
- Involved in Writing Complex T-SQL Queries.
- Backing up master & system databases and restoring them.
- Developed Stored Procedures and Functions to implement necessary business logic for interface and reports.
- Involved in testing and debugging stored procedures.
- Wrote teh DAX statements for Cube.
Environment: Informatica Power center 10, PL/SQL, UNIX shell scripting, SQL Server, Visual Studio, SSIS, SSRS, Talend, AWS
Confidential
Informatica Developer
Responsibilities:
- Actively participated in understanding business requirements, analysis and designing ETL process.
- Effectively applied all teh business requirements and transforming teh business rules into mappings.
- Developed Mappings between source systems and Warehouse components.
- Used Informatica designer to create complex mappings using different transformations to move data to a Data Warehouse.
- Developed extract logic mappings and configured sessions.
- Extensively used teh Filter Control & Expression on Source data base for filter out teh invalid data etc.
- Extensively used ETL to load data from Flat files which involved both fixed width as well as Delimited files and also from teh relational database, which was Oracle.
- Worked on Debugging, Troubleshooting and documentation of teh Data Warehouse.
- Created reusable transformations and Mapplets to use in multiple mappings.
- Handled teh performance tuning of Informatica mappings.
- Developed Shell Scripts as per requirement.
- Prepared PL/SQL scripts for data loading into Warehouse and Mart.
- Fixed SQL errors within teh deadline.
- Making appropriate changes to schedules when some jobs are delayed.
- Self-Review of Unit test cases, Integration test cases of all teh assigned modules.
Environment: Informatica Power Center 8.6, Windows XP, Oracle 10g, UNIX/LINUX, SQL Server.