We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Phoenix, AZ

SUMMARY

  • Dynamic and motivated IT professional with around 6+ years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytics, Cloud Data engineering, Data Warehousing.
  • Proficient in working with large - scale datasets and distributed computing environments, leveraging techniques such as data partitioning and shuffling to minimize network overhead and improve job performance.
  • Good understanding of cloud technologies such as AWS, Azure, Google Cloud, and experienced in deploying Spark applications on cloud platforms.
  • Strong knowledge of Hadoop ecosystem and related technologies such as HDFS, MapReduce, YARN, and Spark and Expertise in designing and implementing data models and schemas for Hive tables.
  • Strong understanding of SQL and programming languages such as Python/Scala and Java.
  • Proficient in using Python libraries such as Pandas, NumPy, Requests, Airflow, SQL Alchemy, and SciPy for data analysis and manipulation.
  • Experience in Developing Spark applications using Pyspark/Scala in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data using map, filter, reduceByKey, and window operations to aggregate and transform the data.
  • Used Spark SQL to query and analyze the processed data in real-time, allowing for faster decision-making and insights.
  • Extensive experience in developing batch processing pipelines using Apache Spark.
  • Built a real-time data processing and analysis pipeline using Spark Streaming to process data from various sources including Apache Kafka, Flume, and S3 buckets in AWS.
  • Optimized the Spark Streaming job performance by tuning the batch interval, reducing the data shuffle, and increasing the number of worker nodes (memory management).
  • Utilized AWS services such as EC2, S3 and EMR to provision and manage the infrastructure required for the pipeline.
  • Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, Glue, VPC, IAM, DynamoDB, Redshift, Lambda, Event Bridge, Cloud Watch, Auto Scaling, Security Groups, CloudWatch, CloudFormation, Kinesis, IAM, SQS, SNS.
  • Experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as Big query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Leveraged Snowflake as the data warehouse to store and query the processed data, providing a scalable and reliable solution for data storage and analysis.
  • Troubleshot issues with the pipeline, such as data quality issues and system failures, using Spark UI, logs, and monitoring tools.
  • Experience in designing, partitioning, clustering, loading, and performance tuning for hive tables in different file formats like Text, Avro, Parquet, ORC, etc.
  • Knowledge of job workflow scheduling and monitoring tools like Oozie and Zookeeper. Proficient in designing and implementing big data workflows using Apache Airflow.
  • Experience in developing Unix/Linux, shell scripts and python scripts to schedule the jobs using Airflow.
  • Continuously improved the pipeline's performance, scalability, and maintainability, by adopting new technologies and best practices, such as using Delta Lake for data versioning and incrementalprocessing.
  • Experience in writing Spark/Scala programming using Data Frames, Data Sets & RDD's for transforming transactional database data and load it into Data Ware housing platforms like (Red shift tables).
  • Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Confidential on Snowflake, AWS Redshift, Big Query, ADLS, Informatica Intelligent Cloud.
  • Good exposure to the usage of NoSQL databases like HBase, Cassandra, and Mongo DB.
  • Strong hands-on knowledge with Data Warehousing platforms and databases like MS SQL Server, Oracle, MySQL, DB2, Postgres Sql and Teradata and writing Ad hoc queries, Stored Procedures, Functions, Joins, and Triggers for different Data Models.
  • Familiarity with data visualization and reporting tools such as Tableau and Power BI.
  • Good knowledge about the micro services Architecture and involved in developing backend API’s
  • Experience in developing Restful and soap API's using swagger and did testing using Postman.
  • Proficient in using containerization technologies such as Docker and Kubernetes to deploy Airflow in a distributed environment.
  • Excellent working experience in Agile /Scrum framework and Waterfall project execution methodologies.
  • Worked with a team of developers, data analysts and data scientists to build a scalable and maintainable data processing pipeline, ensuring high-quality code and adherence to coding standards.
  • Highly dedicated and motivated graduate student with strong analytical and communication skills.

TECHNICAL SKILLS

Big Data: HDFS, MapReduce, Hive, Sqoop, Spark, Oozie, Airflow, Autosys, Kafka, Terraform, Tableau, Power BI.

ETL/Data Warehousing: Informatica PowerCenter, Data cleansing, OLAP, OLTP, Snowflake, Big Query, FACT & Dimension Tables, Physical & Logical Data Modeling.

Databases: Oracle, Postgres, Teradata, SQL Server, MySQL, MS-Access

Languages: Java, SQL, HiveQL, Unix/Linux shell scripting, Scala, Python

No SQL Databases: HBase

Cloud: AWS, GCP, Azure

Methodologies: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Phoenix, AZ

Data Engineer

Responsibilities:

  • Responsible for developing efficient Spark jobs on AWS cloud programs to detect and separate fraudulent claims from the claims data.
  • Designed and implemented data processing pipelines using Java/python and related technologies, such as Apache Spark, Hadoop, and Spring.
  • Hands on experience in Spark and Spark Streaming creating RDD's, applying operations, transformation, and actions.
  • Extracted high-volume real-time data using Kafka and spark structured streaming by Creating Streams and converting them into RDD, processed and further stored into Cassandra.
  • Knowledgeable in using Kafka's messaging system to efficiently process and transport data in distributed environments.
  • Designed and implemented ETL pipelines to load API or Kafka data in JSON file format into Snowflake database using Python.
  • Experienced in using Databricks for collaborative data engineering using spark and machine learning libraries.
  • Proficient in designing and implementing complex serverless workflows using AWS Step Functions and other AWS services such as Lambda, S3, EMR, Red Shift, and DynamoDB.
  • Implemented serverless data processing using AWS Lambda, S3 and Glue to improve scalability and reduce costs. Developed Lambda scripts in Python to launch the EMR clusters on-demand.
  • Used AWS EMR to transform and more large amounts of data into and out of other AWS data stores and databases, such as Amazon S3 and Amazon DynamoDB.
  • Design and Developed ETL Processes in AWS Glue to migrate data using crawlers in different file formats (Avro/ORC/Json/Parquet/CSV/Text Files) from external sources like S3 into AWS Redshift.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark. Proficient in writing complex Hive queries, optimizing them for performance and debugging issues.
  • Implemented security and access control measures using AWS IAM, KMS, and VPC to ensure the pipeline's confidentiality, integrity, and availability.
  • Written Spark Jobs for applying transformations and business rules, data munging on S3 data and joining with data using Data Frames, Data Sets & RDD's for transforming transactional data and load it into Redshift tables. Written Shell scripts for Submitting Spark Jobs.
  • Created Hive tables to store data into HDFS, loading data and writing hive queries that will run internally in map-reduce way by using Hive, AWS Athena, and Redshift
  • Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Designed and implemented DAGs for ETL pipelines using Airflow's DAG structure.
  • Migrated existing Oozie workflows to Airflow and improved the overall performance of the pipelines.
  • Worked with data integration and processing tools such as Apache Airflow and Apache NiFi.
  • Proficient in Azure HDInsight for processing and analyzing large-scale datasets using Hadoop, Spark, and other big data technologies.
  • Experienced in developing and implementing data pipelines using Azure Data Factory for ingestion, transformation, and storage of big data.
  • Proficient in designing and developing data models using Azure Data Lake Store and Azure Data Lake Analytics. Skilled in using Azure Stream Analytics for real-time data processing and analysis.
  • Experienced in working with Azure Cosmos DB for NoSQL database management and data analytics.
  • Knowledge of Azure Databricks, Azure Synapse Analytics for integrating and analyzing big data with Power BI and other visualization tools.
  • Built centralized logging to enable better debugging using AWS cloud watch, Azure Monitor, Elastic Search, Logstash and Kibana and efficiently handled periodic exporting of SQL data into Elastic search.
  • Worked on GitHub and Jenkins continuous integration tool for deployment of project packages.
  • Used Git for version control and collaboration, ensuring that the codebase is well-maintained and up to date.

Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, Oozie, Java, Scala, Docker, Splunk, AWS services (EMR, Lambda, Event Bridge Glue, S3, Dynamo DB, Red Shift, Athena, SNS, SQS, Kinesis, Quick Sight), GCP (Data proc, Big query, big table, cloud storage, composer, data flow & fusion), Azure (ADF, Databricks, Azure Devops, Azure Data Explorer (Kusto), Azure Cosmos, Azure Blob, Azure Synapse), SQL, NoSQL,, Kafka, Snowflake, Cassandra, Python, UDF, Jira, Git, Bitbucket, Jenkins, HBase, Flume, spark, Solr, Zookeeper, ETL.

Confidential, New York, NJ

Role: Big Data Developer

Responsibilities:

  • Getting data into the Big Data environment from the RDBMS databases.
  • Developed and managed technology Confidential portfolios and identified new technology approaches to solve business problems with a strong focus on leveraging enterprise data.
  • Created Hive tables as per the requirement and did some optimizing Queries.
  • Integrated UDFs into existing data processing frameworks, such as (Spark, Hadoop, Python, SQL) to enable scalable and distributed processing. Maintained a library of reusable UDFs to support efficient and consistent data processing across multiple projects.
  • Worked on data ingestion pipeline from various sources into GCP cloud storage.
  • Implemented Partitions, Buckets based on State to further process using Bucket-based Hive joins.
  • Involved in integration of hive queries into Spark environment using Spark SQL, pyspark.
  • Used spark-Streaming API to perform necessary transformations and actions on the fly to build the common learner data model which gets the data from Kafka in near real-time and persist into S3.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
  • Migrated a couple of applications on to Aws cloud in a hybrid environment.
  • Provisioned the highly available EC2 Instances using Terraform and cloud formation and wrote new plugins to support new functionality in Terraform.
  • Migrated Bigdata applications to GCP cloud.
  • Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as Big query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Have written python DAGs in airflow which orchestrate end to end data pipelines for multiple applications.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Created custom data connectors and adapters in Java to integrate with various data sources, such as relational databases, NoSQL databases, and data warehouses.
  • Designed and created a private network on the cloud and manage the communication between various resources and peering the network, organizing the accounts and policies.
  • Interacting with the offshore team members and business users in the daily calls for the project updates.
  • Responsible for data integrity checks, and duplicate checks to make sure the data is not corrupted.
  • Developed various scripting functionality using shell Script and Python.

Environment: HDFS, MapReduce, Hive, Sqoop, Pyspark, Java, Spark, Scala, Python, Linux, MySQL, GCP (Data proc, Big query, big table, Pub Sub, cloud storage, data flow & fusion), AWS (S3, Redshift, EMR, Glue, CloudWatch, Lambda), Airflow, Rest API, Jira, Git.

Confidential, Dublin, OH

Role: Hadoop Developer

Responsibilities:

  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
  • Responsible for building scalable distributed data Confidential using Hadoop.
  • Implemented nine nodes CDH3 Hadoop cluster on red hat LINUX. Involved in loading data from LINUX file system to HDFS.
  • Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration. Created HBase tables to store variable data formats of PII data coming from different portfolios.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Worked in GCP-based Big Data deployments (Batch/Real-Time) leveraging Big Query, Big Table, Cloud Storage, PubSub, Data Fusion, Dataflow, Dataproc, etc.
  • Worked on building dashboards in Tableau with ODBC connections from different sources like Big Query/ presto SQL engine.
  • Implemented a script to transmit information from Oracle to HBase using Sqoop.
  • Implemented best income logic using Pig scripts and UDFs.
  • Implemented test scripts to support test driven development and continuous integration.
  • Responsible to manage data coming from different sources. Involved in loading data from UNIX file system to HDFS.
  • Load and transform large sets of structured, semi structured, and unstructured data
  • Cluster coordination services through Zookeeper. Experience in managing and reviewing Hadoop log files.
  • Installed Oozie workflow engine to run multiple Hive and pig jobs.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

Environment: Hadoop, HDFS, Pig, Hive, Sqoop, HBase, Shell Scripting, Ubuntu, Linux RedHat, AWS (S3, Redshift, EMR, Glue, CloudWatch, Lambda), GCP (Data proc, Big query, big table, cloud storage, data flow & fusion).

Confidential

Role: Hadoop and Spark Developer

Responsibilities:

  • Involved in Configuring Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java and Nifi for data cleaning and preprocessing.
  • Imported and exported data into HDFS from Oracle database and vice versa using Sqoop.
  • Used Hive, Pig and Talend as an ETL tools for event joins, filters, transformations, and pre-aggregations.
  • Created partitions, bucketing across state in Hive to handle structured data using Elastic search.
  • Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data such as removing rmation or merging many small files into a handful of very large, compressed files using Pig pipelines in the data preparation stage.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Elastic search, Kafka, Flume & Talend and process the files by using Piggybank.
  • Extensively used PIG to communicate with Hive using Catalog and HBase using Handlers.
  • Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
  • Built and maintained ETL (Extract, Transform, Load) jobs and workflows using Java frameworks and tools, such as Apache NiFi.
  • Used Spark SQL to read and write table which are stored in Hive and Amazon EMR.
  • Performed Sqooping for various file transfers through the HBase tables for processing of data to several NoSQL DBs- Cassandra, Mango DB.
  • Created tables, secondary indices, join indices viewed in Teradata development environment for testing.
  • Captured data logs from web server and Elastic search into HDFS using Flume for analysis. Managed and reviewed Hadoop log files.

Environment: Hive, Pig, Map Reduce, Apache Nifi, Sqoop, Oozie, Flume, Kafka, Talend, EMR, Storm, HBase, Unix, Linux, Python, Spark, SQL, Hadoop 1.x, HDFS, GitHub, Talend, Python Scripting.

We'd love your feedback!