We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

Atlanta, GA

SUMMARY:

  • Over 7+ years of IT experience working as a Data Engineer and Developer with expertise in Data Engineering, Big Data and SQL. Analytically minded self - starter with years of experience collaborating with cross-functional teams and ensuring the accuracy and integrity around data and actionable insights. Prepared to collaborate with teams in ETL tools and hadoop architecture systems combined with Spark to extract, transform and load data.

PROFESSIONAL EXPERIENCE:

Data Engineer

Confidential

Responsibilities:

  • Worked with big data (20M+ observations of text data). SQL, SQLite
  • Wrote SQL queries to merge data from multiple tables to obtain relationships between questionnaire data, participant response data, and participant withdrawal data. SQL, SQLite
  • Worked within an Agile / Scrum framework. Took on the role of Scrum Master for several sprints.
  • Utilized a Git versioning environment using git and python
  • Extracted Data from several sources to increase agility and accuracy with a centralized system
  • Analyzed the impact on server performance CPU usage, server memory usage for the applications of varied numbers of multiple, simultaneous users.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Experience in managing and reviewing Hadoop log files.
  • Created different staging tables like ingestion tables and preparation tables in the Hive environment.
  • Worked on Sequence files, Map side joins, Bucketing, Static and Dynamic Partitioning for Hive performance enhancement and storage improvement.
  • Ingested data from RDBMS to Hive to perform data transformations, and then export the transformed data to Cassandra for data access and analysis.
  • Created Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • Used Kafka Streams to Configure Spark Streaming to get information and then store it in HDFS.
  • Handled large datasets using Partitions, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Written Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
  • Executed many performance tests using the Cassandra-stress tool to measure and improve the read and write performance of the cluster.
  • Developed design documents for the code developed.
  • Configured various workflows to run on top of Hadoop using Oozie and these workflows comprises heterogeneous jobs like Pig, Hive, Sqoop and MapReduce.
  • Written Spark applications using Scala to interact with the MySQL database using Spark SQL Context and accessed Hive tables using Hive Context.
  • Developed Spark batch job to automate creation/metadata update of external Hive table created on top of datasets residing in HDFS.
  • Performed the migration of Hive and MapReduce Jobs from on-premise MapR to AWS cloud using EMR.
  • Optimized Hive queries and used Hive on top of Spark engine.

Big Data Engineer

Confidential - Atlanta, GA (Remote)

Responsibilities:

  • Involved in complete project life cycle starting from design discussion to production deployment
  • Worked closely with the business team to gather their requirements and new support features
  • Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
  • Developed a 200-node cluster in designing the Data Lake with the Hortonworks distribution
  • Responsible for building scalable distributed data solutions using Hadoop
  • Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
  • Implemented Kerberos for authenticating all the services in Hadoop Cluster
  • Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
  • Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
  • Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
  • Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
  • Streamed AWS log group into Lambda function to create service now incident.
  • Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Created Managed tables and External tables in Hive and loaded data from HDFS.
  • Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
  • Scheduled several times based on the Oozie workflow by developing Python scripts.
  • Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
  • Exporting the data using Sqoop to RDBMS servers and processing that data for ETL operations.
  • Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
  • Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
  • End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, JavaScript and related, Linux
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
  • Involved in Spark and Spark Streaming creating RDD, applying operations -Transformation and Actions.
  • Created partitioned tables and loaded data using both static partition and dynamic partition method.
  • Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
  • Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
  • Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
  • Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
  • Scheduled map reduces jobs in the production environment using Oozie scheduler.
  • Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
  • Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
  • Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
  • Improved the Performance by tuning of HIVE and map reduction.
  • Research, evaluate and utilize modern technologies/tools/frameworks around the Hadoop ecosystem.

Big Data Developer

Confidential - Atlanta, GA (Remote)

Responsibilities:

  • Administered, maintained, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux
  • Experienced in Spark Streaming and creating RDD and applying operations transformations and actions
  • Developed Spark applications using Python for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Python API
  • Developed Spark code using Python and Spark-SQL for faster processing and testing.
  • Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions
  • Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Experienced with different scripting languages like Python and shell scripts.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ
  • Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Python, and Python.
  • Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored them in AWS S3.
  • Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL)
  • Extensively worked on Text, ORC, Avro and Parquet file formats and compression techniques like Snappy, Gzip and Zlib.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Created, managed, and utilized policies for S3 buckets and Glacier for storage and backup AWS.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Java and Talend.
  • Transformed raw data into MySQL with custom-made ETL application to prepare unruly data for machine learning

We'd love your feedback!