We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Loveland, CO

SUMMARY

  • Sr. Data Engineer with around 7 years of experience in various levels of software development process like analysis, development, testing and deployment.
  • Deep technical knowledge and practical understanding in tools like BigData, Cloud and DevOps. Also experienced in various programming languages like Java, Scala, and Python.
  • Strong experience building Data engineering pipelines both batch and real - time using Big data and Hadoop technologies.
  • Good experience working with various tools and frameworks within Big data ecosystem like Spark, Mapreduce, Hive, Sqoop, Oozie, Kafka, Yarn, Impala and HBase.
  • Strong experience working with both On-prem Hadoop clusters (Cloudera and Hortonworks) and AWS EMR Clusters.
  • Strong knowledge in programming languages like Java, Scala and Python. Utilized Scala and Python primarily for building Spark applications.
  • Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena, Glue Metastore etc.,
  • Hands on experience with Azure Cloud Services like Blob Storage, Azure SQL and Data warehouse, Azure Data Factory and Azure Databricks (SPARK).
  • Used Spark extensively to perform data transformations, data validations and data aggregations.
  • Strong experience troubleshooting and optimizing spark applications.
  • Extensive experience working with Spark Dataframe-Api, Spark-Sql, Spark Streaming and Spark-ML.
  • Hands on Experience on Data Ingestion tools like Apache Sqoop for importing and exporting data to Relational data base systems (RDBMS).
  • Worked on real time data integration using Kafka, Spark streaming and HBase.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events on streaming data.
  • In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Node Manager, Applications Master, Name Node, Data Node concepts.
  • Job workflow scheduling and monitoring using tools like Oozie.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from different sources like Hive and Spark.
  • Created Hive External and Managed Tables.
  • Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
  • Experienced in writing Oozie workflows and coordinator jobs to schedule data pipelines.
  • Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins and AWS.
  • Experience writing Shell scripts in Linux OS and integrating them with other solutions.
  • Strong Experience in working with Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Excellent communication, interpersonal and analytical skills and a highly motivated team player with the ability to work independently.
  • Design and development of REST based micro services using Spring boot framework.
  • Experience in using build tools like Maven, SBT, Version control tools like Github, CI/CD tools like Jenkins, and Agile project management tools like JIRA.
  • Responsible for creating shell scripts for automating routine tasks for log file management, archival and purging etc.,
  • Experience working with containerization engines like Docker and Kubernetes.

TECHNICAL SKILLS

Big Data Ecosystem: Spark, Hive, Mapreduce, Yarn, Sqoop, Kafka

Hadoop Distributions: Cloudera, Hortonworks and AWS EMR

Cloud Ecosystem: AWS S3, EMR, Step Functions, Redshift, CloudWatch, AthenaGlue Metastore, Lambda Functions, Azure Blob Storage, Azure SQL and Data warehouse, Azure Data Factory and Azure Databricks (SPARK)

Languages: Java, Python, Scala

Databases: Teradata, MySql, Postgres

No SQL Database: DynamoDB, HBase

Build and other Tools: Maven, SBT, Jenkins, Github

PROFESSIONAL EXPERIENCE

Confidential, Loveland, CO

Sr. Big Data Engineer

Responsibilities:

  • Worked on building centralized Data lake on AWS Cloud utilizing primary services like S3, EMR, Redshift, Athena and Glue.
  • Hands on Experience in migrating datasets and ETL workloads from On-prem to AWS Cloud services.
  • Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
  • Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
  • Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
  • Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
  • Developed Spark based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as storage layer.
  • Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
  • Developed AWS lambdas using Step functions to orchestrate data pipelines.
  • Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.
  • Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
  • Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
  • Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.
  • Implemented a Continuous Delivery pipeline with Bitbucket and AWS AMI's.
  • Designed, documented operational problems by following standards and procedures using Jira.

Environment: AWS S3, EMR, Lambdas, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka, Scala, Python, Bitbucket, Jira.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Design and development of IT solutions using Big Data tools
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Hadoop cluster.
  • Used Zeppelin, Jupyter notebooks and Spark-Shell to develop, test and analyze Spark jobs before Scheduling Customized Spark jobs.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • To meet specific business requirements wrote UDF’s in Scala and Store procedures
  • Replaced the existing MapReduce programs and Hive Queries into Spark application using Scala
  • Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS)
  • Conducting code reviews for team members to ensure proper test coverage and consistent code standards.
  • Responsible for documenting the process and cleanup of unwanted data
  • Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.
  • Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity
  • Hands-on experience on developing PowerShell Scripts for automation purpose
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
  • Experience in using Scala Test Fun Suite Framework for developing Unit Tests cases and Integration testing.
  • Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS.
  • Involved in running the Cosmos Scripts in Visual Studio 2017/2015 for checking the diagnostics
  • Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks

Environment: Azure Cloud Services, Databricks, Blob Storage, ADF, Azure SQL Server, HDFS, Pig, Hive, Spark, Kafka, IntelliJ, Cosmos, Sbt, Zeppelin, YARN, Scala, SQL, Git

Confidential, Rockville, MD

Hadoop/Spark Developer

Responsibilities:

  • Developing and maintaining a Data Lake containing regulatory data for federal reporting with big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala, Apache Hive and Cloudera distribution.
  • Developing different ETL jobs to extract data from different data sources like Oracle, Microsoft SQL Server, transform the extracted data using Hive Query Language (HQL) and load it into Hadoop Distributed file system (HDFS).
  • Involved in importing the data from different sources into HDFS using sqoop and applying transformations using Hive, spark and then loading data into Hive tables.
  • Fixing data related issues within the Data Lake.
  • Primarily involved in Data Migration process using AWS by integrating with Github repository and Jenkins.
  • Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Performed in-memory computing capacity of Spark to perform procedures such as text analysis and processing using Scala.
  • Primarily responsible for designing, implementing, Testing, and maintaining database solution for AWS.
  • Experience working with Spark Streaming and divided data into different branches for batch processing through the Spark engine.
  • Implementing new functionality in the Data Lake using big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala and Apache Hive based on the requirements provided by the client.
  • Communicating regularly with the business teams along with the project manager to ensure that any gaps between the client’s requirements and project’s technical requirements are resolved.
  • Developing Python scripts using Hadoop Distributed File System API’s to generate Curl commands to migrate data and to prepare different environments within the project.
  • Monitoring production jobs using Control-M on a daily basis.
  • Coordinating the Production releases with the change management team using Remedy tool.
  • Communicating effectively with team members and conducting code reviews.

Environment: Hadoop, Data Lake, AWS, Python, Spark, Hive, Cassandra, ETL Informatica, Cloudera, Oracle 10g, Microsoft SQL Server, Control-M, Linux

Confidential

Big Data Developer

Responsibilities:

  • Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.
  • Involved in developing spark applications to perform ELT kind of operations on the data.
  • Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, DataFrames and Spark SQL API’s
  • Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables
  • Involved in creating Hive external tables to perform ETL on data that is produced on daily basis
  • Validated the data being ingested into Hive for further filtering and cleansing.
  • Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
  • Loaded data into hive tables from spark and used Parquet columnar format.
  • Created Oozie workflows to automate and productionize the data pipelines
  • Migrating Map Reduce code into Spark transformations using Spark and Scala.
  • Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Did a Poc on GCP cloud services and feasibility of migrating onprem setup to GCP cloud and utilizing various services in GCP like Dataproc, BigQuery, Cloud Storage etc.,
  • Designed, documented operational problems by following standards and procedures using JIRA

Environment: CDH, Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.

Confidential

Java Developer

Responsibilities:

  • Involved in the implementation of service layer and DAO.
  • Developedclasses in Business Layer and in Data Access Layer in C#
  • Deployment ofWeb servicesfor online transactions usingC#and exposed them throughSOAPandHTTP.
  • Responsible for designing JSPs as per the requirements.
  • Developed Java applications with various features like multi-threading and queues.
  • Worked on JDBC, Collections, Multithreading, Collection API and Generics, File Handling.
  • Experience building various components in microservice applications
  • Worked Sql queries to create databases and tables and loading the data using Sql queries
  • Involved in developing and deploying the Server-Side components.
  • Consumed JSON RESTful Web Services and sent responses with Spring MVC.
  • Developed core Java classes for utility classes, business logic, and test cases.
  • Worked on supporting legacy projects building using Teradata and informatica.
  • Developed Maven and Jenkins scripts for building and deployment of the artifacts.

We'd love your feedback!