We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

PROFESSIONAL SUMMARY:

  • 6+ years of technical experience in Analysis, Design, Development with Big Data technologies like Spark, Hive, Kafka and HDFS including programming languages such as Python, Scala and Java.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, NumPy and Beautiful Soup.
  • Expertise in Python and Scala, user - defined functions (UDF) for Hive and Pig using Python.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Experience working with NoSQL databases like Cassandra, HBase and MongoDB.
  • Good working knowledge of Google Cloud Platform (GCP) which includes services like Data Flow, Data Proc, Big Query, Pub/Sub, Data Studio, Data Fusion.
  • Worked with Cloudera and Hortonworks distributions.
  • Extensive experience working on spark in performing ETL using Spark-SQL, Spark Core and Real-time data processing using Spark Streaming from Kafka.
  • Strong experience working with various file formats like Avro, Parquet, Orc, Json, Csv etc.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling, granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Worked extensively on Sqoop for performing both batch loads as well as incremental loads from relational databases.
  • Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python.
  • Proficient SQL experience in data extraction, querying and developing queries for a wide range of applications.
  • Experience working with GitHub, Jenkins, and Maven.

TECHNICAL SKILLS:

Big Data Ecosystem: Hive, Spark, MapReduce, Hadoop, Yarn, HDFS, Hue, Impala, HBase, Oozie, Sqoop, Pig, Flume, Airflow, PySpark

Hadoop Distribution: Cloudera, Hortonworks, AWS EMR, Databricks, Azure

Programming Languages: Python, Scala, Java, Shell Scripting

Methodologies: Agile/Scrum, Waterfall, RAD

Build and CICD: Maven, Jenkins, GitLab

Cloud Management: GCP Data Flow, Data Proc, Big Query, Pub/Sub, Data Studio, Data Fusion, Azure data lake Analytics, Azure SQL Database, ADF, Data Bricks and Azure SQL Data warehouse, Sagemaker

Messaging Platforms: Kafka, Pub/Sub, SQS, SNS

Data Storage: MySQL, Oracle, Teradata, Hive, Cassandra, MongoDB and HBase

IDE and ETL Tools: IntelliJ, Eclipse

PROFESSIONAL EXPERIENCE:

Confidential

Sr. Big Data Engineer

Responsibilities:

  • Responsible to create an orchestration layer across the core business areas On Azure Analytics Services, design and develop scalable, reliable, and performant big data pipelines to consume, integrate and analyse various datasets
  • Developed PySpark based Spark applications for performing data cleaning, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Developed Generic and Automated python codes for a set of Data quality Rules and Evaluation (data type casting issues)
  • Implemented a Generic Framework for scheduling Daily/Weekly/Monthly data loading from Curated to Consumption layer (Synapse)
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations, and other capabilities.
  • Good experience with continuous Integration using GIT.
  • Develop, construct, test end to end incremental architecture.
  • Worked extensively with ADF for importing data from Teradata to Azure Data Lake.
  • Experienced in Databricks platform where it follows best practices for securing network access to cloud applications.
  • Using Azure Databricks, created Spark clusters and configured high concurrency clusters to speed up the preparation of high-quality data.
  • Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
  • Involved in creating delta tables, loading, and analysing data using python scripts.
  • Implemented Partitioning for Delta tables.
  • Analysed the SQL scripts and designed it by using PySpark SQL for faster performance.
  • Designed, documented operational problems by following standards and procedures using JIRA.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction.
  • Developed and operationalized batch and real-time models in Sagemaker with data science team.
  • Used Reporting tools like Power BI to connect with AAS using Visual studio for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.

Environment: Spark, Python, PySpark, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse, Teradata, Snowflake, JIRA, Azure, Shell Scripting.

Confidential

Big Data Engineer

Responsibilities:

  • Responsible for the execution ofbig data analytics, predictive analytics, and machine learning initiatives.
  • DevelopedScalascripts,UDF are using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
  • UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Implemented schema extraction forParquetandAvrofile formats inHive.
  • ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
  • Worked ondata migration to Hadoopand hive query optimization.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Primarily involved in Data Migration usingSQL, SQL Azure, Azure Storage,andAzure Data Factory, SSIS, PowerShell.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Designing and building data pipelines to load the data into GCP platform.
  • Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary queries orpythonscripts based on source.
  • Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
  • Involved in using Sqoop for importing and exporting data between RDBMS and HDFS.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query
  • Deploy the code toEMRviaCI/CD using Jenkins
  • Extensively usedCode cloudfor code check-in and checkouts for version control.

Environment: Python, Data Virtualization, Data Warehouse, Hive, HBase, Airflow, Azure, SQL Server, Sqoop, GCP, NOSQL, UNIX, HDFS, Oozie, SSIS.

Confidential | Confidential

Data Engineer

Environment: ADLS (Azure Data Lake Storage) Gen2, ADF, Databricks (uses PySpark), Synapse (SQL Datawarehouse)

Responsibilities:

  • PySpark Based generic framework and easy to implement. Which solves complex data platform problems.
  • IDF code is performance efficient and can handle any type of Big Data operations.
  • Easy to build big data/ETL Pipelines with minimum development efforts.
  • Pre-built rules such as data cleansing - Null data, Character, Length check …etc
  • Framework Involves all big data Transformation and actions.
  • Accepts multiple sources files such as CSV, parquet and target can be file, table of any SCD type.
  • Used Azure and over 250 pre-built rules are implemented in python. have handled all type of SCD implementation like SCD1, SCD2.

Environment: Azure SQL, Data Lake, Data Factory, Data Lake Analytics, NOSQL DB, HDInsight, Tableau, SQL Server, Data Lake, Data Bricks, IDF Tool, Python,Sql,Synapse

Confidential

Data Engineer

Responsibilities:

  • Worked on migrating data from Oracle Database to HDFS to get proper and fast analytics on it by which they can improve their service.
  • Migrating the business logic, which has written in thousands of Oracle Stored Procedures to Spark-SQL in optimized way
  • For existing data from Oracle dumped into Hive tables by using Sqoop and computed the different types of data using Spark SQL
  • Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark. involves implementing a recommendation engine which will be ingested into the raw storage layer through Hadoop data pipelines build using Cloudera CDH stack
  • CDR data will be ingested and stored using Kafka and Spark streaming to accommodate the frequency and volume of data files from the switches
  • Kafka Producers will stream the CDR data as it comes from the switches and will be consumed by Kafka Consumers for data wrangling and enrichment before storage.
  • Developed ETL data pipelines using Spark, Spark streaming and Scala.
  • Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
  • Analysing the Oracle stored procedures based on the business documents.
  • Mapping the functional logic for all the stored procedures defined.
  • Converting the Oracle stored procedure logic into Spark SQL using Java features.
  • Implemented the generic tool for file level validation according to the business logic.
  • Unit testing using JUnit.
  • Involved in Sonar and Emma code coverage for JUnit.
  • Code quality checks using Jenkins and peer reviews in an agile methodology.
  • Optimization of Spark SQL code.
  • Created Oozie workflows to automate and productionize the data pipelines.

Environment: Apache Spark, Java, Hive, Sqoop, HDFS, Maven, Oozie, CI/CD Jenkins, JIRA, Kafka, spark and UNIX Shell Scripting

Confidential

Data Engineer

Responsibilities:

  • Worked on a USA-based United Airlines project as a data engineer to schedule pipelines using code repositories with PySpark as an interface, and SQL to process large datasets.
  • Built implementation logic for business exposed layer level tables using Palantir foundry.
  • Used various tools like Monocle, Contour and Slate for data analysis, lineage, and visualization for analyzing data and fixing the issues to match with expected system due for migration.

Environment: Palantir tool to build the end-to-end flow, spark, python, Teradata, and SQL

We'd love your feedback!