Data Engineer Resume

SUMMARY

Big Data professional with 6+ years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
4 years of experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Strong experience in working with UNIX/LINUX environments, writing shell scripts.
Excellent knowledge of J2EE architecture, design patterns, object modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
Experienced in working in SDLC, Agile and Waterfall Methodologies.
Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

TECHNICAL SKILLS

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Scala, Spring, Hibernate, JDBC, JSON, HTML, CSS

Script Languages: JavaScript, jQuery, Python, Shell Script(bash,sh)

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

IDE: Intellij, Eclipse and NetBeans

Version controls and Tools: GIT, Maven, SBT, CBT

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
Experience in data cleansing and data mining.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3,Teradata and snowflake.
Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query’s or python scripts based on source.
Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.
Deploy the code to EMR via CI/CD using Jenkins
Extensively used Code cloud for code check-in and checkouts for version control.

Environment: AgileScrum,MapReduce,Snowflake,Pig,Spark,Scala,Hive,Kafka,Python,Airflow,JSON,Parquet,CSV,Codecloud, AWS

Confidential

Big Data Engineer

Responsibilities:

Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
Developed spark applications for performing large scale transformations and denormalization of relational datasets.
Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
Created reports for the BI team using Sqoop to export data into HDFS and Hive.
Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
Developed Complex HiveQL's using SerDe JSON
Created HBase tables to load large sets of structured data.
Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
Managed and reviewed Hadoop log files.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Worked on PySpark APIs for data transformations.
Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
Upgraded current Linux version to RHEL version 5.6
Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
Worked on JSON, Parquet, Hadoop File formats.
Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
Used Git hub for continuous integration services.

Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship