We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

TX

SUMMARY

  • 8 years of application development experience all phases of Software Development Life Cycle (SDLC) which includes User Interaction, Business Analysis/Modeling, Development, Integration, Planning, and management of Builds.
  • Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
  • Experience in Developing Spark applications using RDD, Spark SQL, Data Frames and Spark Streaming, Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns
  • Good Understanding on Hadoop Architecture and Experience on BIG DATA using HADOOP and related technologies such as HDFS, Map Reduce, Hive, Impala, Pig, Flume, Hbase, OOZIE, SQOOP, and ZOOKEEPER
  • Experience in writing Map Reduce programs and using Apache Hadoop API for analyzing the data
  • Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight BigData Technologies (Hadoop and Apache Spark) and Databricks
  • Extract Transform and Load data from various source systems to Blob storage, Cosmos DB, Azure SQL and Azure DW (Synapse) services using a combination of Azure Data Factory, Confidential -SQL, U-SQL
  • Deep familiarity with Azure Security Services Azure Active Directory, RBAC, Key Vault, ADFS.
  • Good Knowledge on cloud-based technologies like Amazon Web services (AWS) like EC2, S3
  • Experience on with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW)
  • Designed, Built and Analyzed data solutions using Azure and Azure Databricks services. Provided data Visualization using Azure Databricks Dashboards and Power BI and provided data Insights for Business
  • Designed and Built ETL pipelines using Azure Databricks and ingested data from different sources like Snowflake, Teradata, Vertica, Oracle ETC to azure data lake storage and monitored ongoing Databricks jobs
  • Participated in migrating On-Perm data from Datalake to Azure Datalake storage using Azure Databricks Platform
  • Hands on experience in data mining process, implementing complex business logic, and optimizing the query using HiveQL and controlling the data distribution by partitioning and bucketing techniques to enhance performance.
  • Hands on experience building data Validation and data reconciliation from source to target and maintained data quality in Datalake
  • Experience working with Hive data, extending the Hive library using custom UDF's to query data in non-standard formats
  • Experience in SQL Server, MySQL, and Oracle databases.
  • Highly motivated with Strong communication skills, ability to interact with team Members, developers, users, and zeal to learn new technologies.
  • Good understanding and experience with Software Development methodologies.
  • Good interpersonal abilities, communication skills & maximum contribution to attain the team goal.

TECHNICAL SKILLS

Big Data / Hadoop: HDFS, HIVE, Apache Spark (Pyspark and Spark Scala), Sqoop, Kafka, Spark Streaming, PIG, Apache NIFI

Data Warehouse and Cloud Analytics: Snowflake

Programming Languages: Java, Scala, Python, Shell Scripting

Cloud: Azure, Azure Databricks, Cosmos DB, Event Hub, Logic App, Azure Synapsis, AWS S3

Platform: Azure Databricks

Database: Oracle, SQL, PL/SQL

Markup languages: XML, HTML

Platforms: Unix, Linux

Version Controls: SVN, Code Cloud, GIT

PROFESSIONAL EXPERIENCE

Confidential, TX

Data Engineer

Responsibilities:

  • Designed, Built and Analyzed data solutions using Azure and Azure Databricks services. Provided data Visualization using Azure Databricks Dashboards and Power BI and provided data Insights for Business
  • Designed and Built ETL pipelines using Azure Databricks and ingested data from different sources like Snowflake, Teradata, Vertica, Oracle ETC to azure data lake storage and monitored ongoing Databricks jobs
  • Participated in migrating On-Perm data from Datalake to Azure Datalake storage using Azure Databricks
  • Developed Apache Spark ETL application using used PySpark to extract data from different source like Snowflake, Teradata, Vertica, SQL Server ETC and Write to azure data lake storage
  • Provided Azure Databricks Multi-Tenant environment (PROD, STG, DEV) for Data Ingestion Teams to extract data from On-Perm to azure data lake storage and assisted ingestion teams to test On-Perm code on Azure Databricks
  • Used Spark-SQL to clean, transform and aggregate data with proper file and compression types as per requirement before Writing data to azure data lake storage
  • Developed UDF’s in Scala and PySpark and Hive to meet proper business requirements for data ingestion purposes and developed SQL scripts
  • To optimize cost on Azure Log Analytics, Developed and Automated PySpark application to extract Diagnostic logs on Azure storage accounts and extracted Diagnostic logs from the Azure containers and used Spark-Sql to analyze logs to provide usage insights using Azure Databricks and Databricks job Scheduler
  • Developed INIT shell script to install necessary libraries on Databricks Mulit-Tenant clusters to help Databricks clusters to connect to different data sources like Vertica, Snowflake, Teradata, and SQL servers and configured Databricks cluster policies according to the Connection requirement
  • Developed ETL Pipeline to process Blob Inventory data on Azure Storage accounts to find Azure containers size trend to find data expansion insights
  • Worked on Databricks overwatch historical and real-time data to find insights on Databricks cluster usage and other insights
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse and SFTP server.
  • Ingested data from Multiple source systems extensively by using Azure Data Factory.
  • Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF (Azure Data Factory) and Created pipelines,data flows and complex data transformations and manipulations using ADF and PySpark with Databricks
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Confidential -SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB)
  • Developed Stored procedures, Tables with optimal data distribution and indexing in SQL Database and Synapse and Fetched data into PowerBI from Azure Synapse Analytics, and created dashboards, reports, etc. in PowerBI for visualization purposes.
  • Utilized PySpark to write data onto SQL Server and create tables directly from Synapse Notebook.
  • Worked on Validating and Testing ETL code for multiple on-perm applications before migration to Azure Databricks

Environment: -Azure, Azure Databricks, Databricks Mulit-Tenant, Azure Synapse, Azure Data Factory, Spark Scala, Pyspark, Spark SQL, Snowflake, Teradata, Vertica, SQL Server, Confidential -SQL, U-SQL, Shell Script, SQL, Hive, Hadoop Environment, Scala, Python

Confidential, CO

Data Engineer

Responsibilities:

  • Analyze, design, develop, maintain, bug fixes for the yield management applications
  • Responsible for supporting around 3000+ Confidential ’s DIFA (Data Ingestion Framework Automation), OI (Open Ingest), DPL (Data Processing Library) jobs in Production, Testing and Development
  • Involving in resolution of ingestion failures in jobs in DEV, STG and PROD - findings the cause of failure, investigating and finding the solution etc.
  • Providing support to APP teams who are using DIFA, OI and DPL which is an internal tool to Confidential and build, run and migrate the appropriate jobs to Production.
  • Developed Spark programs using Scala API for high performance data extraction in Confidential ETL tools like OI (Open ingest), DPL (Data processing Library)
  • Implemented Apache Spark using Scala and Spark SQL for faster testing and processing of data
  • Designed and created HIVE external tables using MetaStore instead of derby with partitioning, Dynamic partitioning, and bucketing
  • Implemented Apache PIG Scripts to load data from and to store data into HIVE tables
  • Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDD’s
  • Used the JSON and XML SerDe’s for serialization and de-serialization to load JSON and XML data into HIVE tables
  • Imported data from different sources like AWS S3 and Streaming datasets into spark Data Frames
  • As an Admin to the DIFA, OI and DPL team we are responsible for granting access to Users and helping new users to understand the ingestion tools Involving in DEV, STG and PROD deployments (manually and through automation) through Pivotal cloud (PCF) for OI, Hadoop Edge nodes for DIFA and DPL
  • Worked with various HDFS file formats like AVRO, Sequence File and Various Compression formats like Snappy
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into HIVE Tables and handled Structured data Using Spark-SQL
  • Developed Spark/MapReduce Jobs to parse the JSON or XML data
  • Worked on PIG scripts to clean up the imported data and created partitions for daily loads
  • Worked on converting HIVE/SQL queries into Spark transformations using Spark RDD’s using Spark Scala
  • Worked on Converting MapReduce programs into Spark transformations Using Spark Scala
  • Used AVRO, Parquet and ORC data formats to store in to HDFS
  • Developed Data Validation and reconciliation scripts to ensure the data quality
  • Knowledge on MongoDB NOSQL data modeling, tuning, disaster recovery and Backup

Environment: Hortonworks, Apache Spark, Hadoop, Map Reduce, HDFS, HIVE, AWS, PIG, Scala, Spark SQL, SQL, Spark Streaming, Shell Script, JSON, AVRO, Parquet, ORC

Confidential, Dallas, TX

Data Engineer/Hadoop Developer

Responsibilities:

  • Working with 100+ data sources to consume the data into Confidential & Confidential Data Lake platform which includes data like Setup box customer’s usage, viewership event data, account/profile level master data sets, Confidential & Confidential streaming apps DTVNOW and WTCH TV clickstreams, app diagnostics, video session quality details, voice commands from Alexa devices and so on and considering SPI requirements by applying FPE encryption and hashing technique on sensitive attributes and maintaining the audit requirements for daily trends.
  • Developed and supported 10+ ETL jobs on Spark framework for generating large extracts prepared from 100+ data sources of size ~30 PB volume and published them to AWS S3 buckets using AWS SDK for business clients like BI teams, data science teams.
  • Participated in designing Blueprint ad-platform application based on historical customer viewership and billing data which enables Confidential & Confidential business to allocate advertisement slots on Direct TV Setup boxes for companies who approached for advertising their products.
  • Confidential & Confidential DTVNOW/OTT data analysis: Developed a Viewership platform on the core data sources available in Data Lake for generating every hour/day reports to Confidential & Confidential business which enabled them to build Ad-Platform, better marketing products and build a Machine Learning model for better customer experience.
  • Reduced time to ingest ~300 TB data into HDFS by 75%, by creating a mufti-threaded application in Scala
  • Increased the performance for existing ETL jobs by 40%, by migrating the jobs from Apache PIG to Spark framework which released the dependent jobs on time.
  • Implemented Spark streaming ability to consume the data from Kafka topic which enabled the real time data availability for Datalake users and retired the EOD SQOOP jobs for ingesting the data from traditional RDBS sources.
  • Designed and developed processing pipeline in Spark and Java to handle Billions of streaming app viewership data which comes in file formats like ORC, SNAPPY, PARQUET, AVRO, JSON, CSV and TXT. viewership data for Content supplier payment management team who generate invoice for Studios like Warner Bros, Disney, HBO etc.,
  • Experience in reconciliation of data by comparing source and target datasets and during data migration
  • Performed root cause analysis and back tracked the issue and tested the entire data reconciliation pipeline process
  • Collaborate with data science team to consume new data sources for marketing analysis, advertisements, mobile app crashes. Assisted Operations tram and supported deployed projects in production

Environment: Hortonworks, Apache Spark, Hadoop, Map Reduce, HDFS, HIVE, AWS, PIG, Scala, Spark SQL, SQL, Kafka, Shell Script, JSON, AVRO, Parquet, CSV, TXT format

Confidential, Dallas, TX

Hadoop Developer

Responsibilities:

  • Collaborate with the Internal/Client BA's in understanding the requirement and architect a data flow system.
  • Developed complete end to end Bigdata processing in Hadoop echo system. Optimized HIVE scripts to use HDFS efficient by using various compression mechanisms.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Used Spark APl over Cloudera Hadoop YARN to perform analytics on data in HIVE.
  • Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.
  • Migrated complex Map reduces programs, HIVE scripts into Spark RDD transformations and actions.
  • Writing UDF/MapReduce jobs depending on the specific requirement.
  • Created Java algorithms to find the mortgage risk factor and credit risk factors. Created algorithms for all complexes Map and reduce functionalities of all MapReduce programs.
  • Testing all the month end changes in DEV, SIT and UAT environments and getting the business approvals to perform the same in Production.
  • Successfully migrated Hadoop cluster of 120 edge nodes to other shared cluster (Haas -Hadoop as a service) and setup the environments (DEV, SIT and UAT) from scratch.
  • Worked in writing shell scripts to schedule the Hadoop process in Autosys by creating JIL files.
  • Worked in writing SPARK SQL scripts for optimizing the query performance. Importing the data from Netezza onto out HDFS cluster
  • Transferred the data using Informatica tool from AWS S3 to AWS Redshift
  • Extensively worked in code reviews and code remediations to meet the coding standards.
  • Written Sqoop scripts to import and export data in various RDBMS systems. Written PIG scripts to process unstructured data and available to process in HIVE.
  • Created HIVE schemas using performance techniques like partitioning and bucketing
  • Been part of production support team, fixing bugs reported in the production.

Environment: Horton works, Apache Spark, Hadoop, Map Reduce, HDFS, HIVE, AWS, PIG, Scala, Spark SQL, Shell Script, Netezza, SQL.

Confidential, San Jose, CA

Hadoop Developer

Responsibilities:

  • Involved in various phases of Software Development Life Cycle (SDLC).
  • Project was developed following Agile and Scrum methodologies.
  • Developed MapReduce jobs in Java for data cleansing and preprocessing.
  • Moving data from DB2, Oracle Exadata to HDFS and vice-versa using SQOOP.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
  • Worked with different file formats and compression techniques to determine standards
  • Developed data pipeline using PIG and HIVE from Teradata, DB2 data sources. These pipelines had customized UDF’S to extend the ETL functionality.
  • Developed HIVE queries and UDFS to analyze/transform the data in HDFS.
  • Developed HIVE scripts for implementing control tables logic in HDFS.
  • Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
  • Developed PIG scripts and UDF’s as per the Business logic.
  • Developed user defined functions in PIG.
  • Analyzing/Transforming data with HIVE and PIG.
  • Worked on developing, monitoring, and Jobs Scheduling using UNIX Shell Scripting.
  • Developed Oozie workflows and they are scheduled through a scheduler monthly.
  • Designed and developed read lock capability in HDFS.
  • Implemented Hadoop Float equivalent to the DB2 Decimal.
  • Involved in End-to-End implementation of ETL logic.
  • Effective coordination with offshore team and managed project deliverable on time.
  • Worked on QA support activities, test data creation and Unit testing activities on Agile environment.

Environment: Hadoop, HDFS, Map Reduce, HIVE, PIG, Shell Scripting, Teradata, SQL

We'd love your feedback!