We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Alpharetta, GA

SUMMARY

  • Over 7 years of IT experience in software Development and Big Data Technologies and including data lifecycle, data pipelines, data analytics and Machine Learning with a particular focus on high volume real - time data (hundreds of billions of messages per day).
  • Extensive experience in ETL development and worked with large scale data in real time.
  • Hands on experience with Big Data streaming and analytics technologies such as Kafka, Spark, Pulsar, NiFi etc.
  • Experience operationalizing NiFi on Kubernetes and AWS to transfer high volume real-time data between different sources and destination systems. Experience in implementing production NiFi clusters on AWS with SSL and Kerberos.
  • Expertise in building and operating high scale systems with a focus on reliability, scalability and solving business needs.
  • Proven track records of strong hands-on technical leadership with experience building big data applications and platforms from conception to production.
  • Experience working with remote and global teams and cross team collaboration
  • Expertise in SQL (relational databases), key-value datastores, and document stores.
  • Experience with high-velocity high-volume stream processing: Apache Kafka, Apache NiFi, Apache Pulsar
  • Experience configuring, deploying and administering NiFi installations
  • Experience in building data ingestion workflows/pipeline flows using NiFi, NiFi registry and creating custom NiFi processors
  • Experienced in handling different file formats like Text, Avro, delimited, Sequence, XML and JSON files.
  • Designed, implemented data engineering solutions (e.g., locate and extract data from a variety of sources for use in reporting, analytics, and data science modeling to drive continuous improvement and development.)
  • Strong experience implementing software solutions in the enterprise Linux or Unix environment
  • Worked closely with IT application teams, Enterprise architecture, infrastructure and information security to translate business and technical requirements into data-driven solutions.
  • End-to-End software product development and management skills
  • Strong technical background with a knowledge of DevOps and Cloud Native
  • Experience working with data lake architecture in the data domain.
  • Collaborated with data science and analytics, reporting teams to productize new methods of extracting information from structured and unstructured data.
  • Work with a geographically distributed team to deliver high volume data to support multiple applications developments.
  • Experience and Expertise in ETL, Data analysis and designing data warehouse strategies.
  • Expertise in using apache NiFi to automate the data movement between different Hadoop systems.
  • Developed and implemented robust and scalable data pipelines using Python, SQL and multiple processing frameworks.
  • Work closely with infrastructure, network, database, business intelligence and application teams to ensure data delivered as per application requirements.
  • Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
  • Experience with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs.
  • Experience in implementing Kerberos authentication protocol in Hadoop for data security.
  • Hands on experience in using Amazon Web Services like EC2, EMR, RedShift, DynamoDB and S3.
  • Hands on using Apache Kafka for tracking data ingestion to Hadoop cluster and implementing Kafka Custom encoders for custom input format to load data into Kafka Partitions.
  • Experience in Spark Streaming to ingest data from multiple data sources into HDFS.

TECHNICAL SKILLS

Databases: Oracle DB, Teradata, Mongo DB, Prometheus, MySQL, PostgreSQL.

Platforms: Windows, Ubuntu, Red hat, Centos, RHEL

Management Tools: Terraform, Jenkins, Git, Argo, Flux, Airflow

Containers & Orchestration: Docker, Kubernetes and EKS.

Monitoring: CloudWatch, Elasticsearch, Prometheus, Grafana, Splunk

Cloud Infrastructure: Amazon Web Services, Azure, Google Cloud Platform

Programming Languages: Java, PL/SQL, Python, HiveQL, Scala, SQL

Big Data Ecosystem: NiFi, Kafka, Pulsar, HDFS, Hive, Pig, HBase, Spark

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Alpharetta, GA

Responsibilities:

  • As a Big Data Engineer managed and administered dataflow architecture and infrastructure for Network Performance Intelligence at Confidential .
  • Experience operationalising NiFi on Kubernetes and AWS to transfer high volume real-time data between different sources and destination systems
  • Managing data ingestion using modern ETL compute and orchestration frameworks using Apache Pulsar, Apache NiFi and Apache Kafka.
  • Worked extensively with Apache NiFi, Apache Kafka, Apache Pulsar and Hadoop HDFS for large-scale data routing, modeling, extraction, transformation, loading and warehousing
  • Developed a secure data streaming solution using Apache NiFi, Apache Pulsar, Apache Kafka to deliver highly sensitive data with low latency to multiple teams across Confidential .
  • Managed resources, designed dataflows and to migrate high volume data to Confidential Corporate Grid.
  • Coordinated with overseas application development teams, analytics implementation/analytics and data science teams to deliver the data as part of business requirements to build application solutions.
  • Designed, built and maintained infrastructure to support large-scale data routing, streaming and historical processing, data warehousing.
  • Built efficient, secure, real-time data pipelines for ingesting and processing high volume and highly sensitive data to multiple data teams.
  • Administered NiFi applications in development, user acceptance tests and production environments.
  • Maintained the Network Performance Intelligence project full software development life cycle, including the deployment of production code and managing the release every week using Git/GitLab.
  • Created dataflows for hundreds of feeds collected across devices and services of Confidential and delivered to Splunk, Tableau and Analytics applications.
  • Automated the process from point of data generation to delivering structured to Confidential core-business teams.
  • Delivered data by transforming unstructured data collected from multiple data sources across the Confidential network with Apache Spark Streaming.
  • Implemented mapping of data sources, data movement, interfaces, and analytics, with the goal of ensuring data quality.
  • Collaborated with project managers and business unit leaders for all projects involving enterprise data delivery.

Environment: Apache NiFi, Apache Kafka, Apache Pulsar, Hadoop, Spark, Hive, Pig, HBase, Oozie, Linux, Python, Bash/Shell Scripting, HBase, SQL, Storm, Unix, Git, Jenkins, Gitlab

Sr. Hadoop Developer

Confidential, Dallas, TX

Responsibilities:

  • Involved heavily in setting up the CI/CD pipeline using Jenkins, Maven, Nexus, GitHub, CHEF, Terraform and AWS.
  • Created big data streaming pipelines with Spark using SMACK stack on DC-OS and automated deployments for Spark’s Structured Streaming on AWS EMR.
  • Implemented Spark using python and Spark SQL for faster processing of data and algorithms for real time analysis in Spark.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL databases for huge volume of data.
  • Developing Kafka producers and consumers in java and integrating with apache storm and ingesting data into HDFS and HBase by implementing the rules in storm.
  • Develop efficient MapReduce programs in Python to perform batch processes on huge unstructured datasets.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Created HBase tables and column families to store the user event data and wrote automated HBase test cases for data quality checks using HBase command line tools.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs and created UDF's to store specialized data structures in HBase and Cassandra.
  • Develop NiFi workflow to pick up the multiple retail files from ftp location and move those to HDFS on a daily basis.
  • Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Evaluated Hortonworks NiFi (HDF 2.0) and recommended a solution to inject data from multiple data sources to HDFS & Hive using NiFi and importing data using Nifi tool from Linux servers.
  • Developed product profiles using Pig and commodity UDFs & developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Integrated Spark Streaming applications with Apache Kafka for on-prem and Amazon Kinesis for cloud deployments.

Environment: Hadoop, MapReduce, Yarn, Spark, Hive, Pig, Kafka, HBase, Oozie, Sqoop, Python, Bash/Shell Scripting, Flume, HBase, Cassandra, Oracle 11g, Core Java, Storm, HDFS, Unix, Teradata, NiFi, Eclipse.

Hadoop Developer

Confidential, Chicago, IL

Responsibilities:

  • Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce and Spark to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Along with the Infrastructure team, involved in design and developed Kafka and Storm based data pipeline.
  • This pipeline is also involved in Amazon Web Services EMR, S3 and RDS.
  • Develop and run MapReduce jobs on a multi - Peta byte YARN and Hadoop clusters which processes billions of events every day, to generate daily and monthly reports as per users need.
  • Used AWS Data Pipeline to schedule Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
  • Expertise in AWS data migration between different database platforms like SQL Server to Amazon Aurora using RDS tool.
  • Used Storm to consume events coming through Kafka and generate sessions and publish them back to Kafka.
  • Experience working on multiple node cluster tool which offer several commands to return HBase usage.
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka through persistence of data into HBase.
  • Optimizing MapReduce code, pig scripts, user interface analysis, performance tuning and analysis.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Worked on the clickstream data to load into structured format using Spark/Scala on top of HDFS.
  • Worked on migrating MapReduce programs into Spark transformations using Scala.
  • Involved in creating Hive tables and working on them using HiveQL and perform data analysis using Hive and Pig.
  • Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
  • Used MongoDB as part of POC and migrated few of the stored procedures in SQL to MongoDB.
  • Experience on Unix shell scripts for business process and loading data from different interfaces to HDFS.

Environment: Hadoop, YARN, Cloudera Manager, Scala, Splunk, Red hat Linux, Bash/Shell Scripting, Unix, AWS, EMR, GIT, HBase, MongoDB, Cent OS, Storm, Java, NoSQL-Kafka, Perl, Cloudera Navigator.

Data Analytics Engineer/Hadoop

Confidential 

Responsibilities:

  • Performed data cleaning on patient data obtained from client/hospitals data feeds and reviewed mapping to Confidential Standard data warehouse
  • Used clustering technique K-Means to identify outliers in the data and to classify unlabeled data.
  • Used Principal Component Analysis in feature engineering to analyze high dimensional data.
  • Implemented rule-based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
  • Evaluated models for feature selection and interact with the other departments to understand and identify data needs and requirements.
  • Worked on analyzing, writing Hadoop MapReduce jobs using API, Pig and Hive.
  • Gathered the business requirements from the Business Partners and Subject Matter Experts.
  • Involved in installing Hadoop Ecosystem components under Cloudera distribution.
  • Responsible for managing data collected from different sources.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Prepared Spark build from the source code and ran the PIG Scripts using Spark rather using MR jobs for better performance
  • Imported data using Sqoop to load data from MySQL to HDFS on a regular basis.
  • Developed Scripts and Batch Job to schedule various Hadoop Program.
  • Written Hive queries for data analysis to meet the business requirements.
  • Created Hive tables and worked on them using Hive QL.
  • Used storm for an automatic mechanism for repeating attempts to download and manipulate the data.

Environment: Java, MapReduce, Spark, HDFS, Hive, Pig, Linux, XML, MySQL, MySQL Workbench, Java 6, Eclipse, PL/SQL, SQL connector, Subversion.

We'd love your feedback!