Data Engineer Resume
SUMMARY
- Big Data Engineer with 10 years of IT experience including 9 years of experience in the Big Data technologies. Expertise in Hadoop/Spark development experience, automation tools, and software design process. Outstanding communication skills, dedicated to maintaining up - to-date IT skills
- Skilled in managing data analytics and data processing, database, and data-driven projects
- Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end-users
- Skilled in Database systems and administration
- Proficient in writing technical reports and documentation
- Adept with various distributions such as Cloudera Hadoop, Hortonworks, MapR, and Elastic Cloud, Elasticsearch
- Expert in bucketing and partitioning
- Expert in Performance Optimization
TECHNICAL SKILLS
APACHE: Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS Hortonworks, MapR, MapReduce
SCRIPTING: HiveQL, MapReduce, XML, FTPPython, UNIX, Shell scripting, LINUX
OPERATING SYSTEMS: Unix/Linux, Windows 10, Ubuntu
FILE FORMATS: Parquet, Avro & JSON, ORC, text, csv
DISTRIBUTIONS: Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
DATA PROCESSING (COMPUTE) ENGINES: Apache Spark, Spark Streaming, Flink
DATA VISUALIZATION TOOLS: QlikView, Tableau, PowerBI, matplot
COMPUTE ENGINES: Apache Spark, Spark Streaming, Storm
DATABASE: Microsoft SQL Server Database (2005, 2008R2, 2012) Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB
SOFTWARE: Microsoft Projec, VMWare, Microsoft Word, Excel, Outlook, PowerPoint; Technical Documentation Skills
PROFESSIONAL EXPERIENCE
Confidential
Data ENGINEER
Responsibilities:
- Design and build data processing pipelines using tools and frameworks in the Hadoop ecosystem
- PySpark streaming to receive real-time data using Kafka
- Creating Hive tables, loading with data, and writing hive queries to process the data.
- Split the JSON file into RDD level to be processed in parallel for better performance and fault tolerance
- Designed Hive queries to perform data analysis, data transfer, and table design
- Collected data using REST API, built HTTPS connection with client-server, sent GET request and collected response in Kafka producer
- Wrote a Spark program to parse out the needed data by using Spark Context and select the columns with target information and assigned names
- Configured Zookeeper to coordinate the servers in clusters to maintain the data consistency and to monitor services
- Design and build ETL pipelines to automate the ingestion of structured and unstructured data
- Design and Build pipelines to facilitate data analysis
- Implement and configure big data technologies as well as tune processes for performance at scale
- Working closely with the stakeholders & solution architect.
- Ensuring architecture meets the business requirements.
- Building highly scalable, robust & fault-tolerant systems.
- Finding ways & methods to find the value out of existing data. Proficiency and knowledge of best practices with the Hadoop (YARN, HDFS, MapReduce)
- AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)
- Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates
- Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring
- Work with engineering team members to explore and create interesting solutions while sharing knowledge within the team
- Work across product teams to help solve customer-facing issues
- Demonstrable experience designing technological solutions to complex data problems, developing & testing modular, reusable, efficient and scalable code to implement those solutions
Confidential
BIG DATA ENGINEER
Responsibilities:
- Support, maintain and document Hadoop and MySQL data warehouse
- Iterate and improve existing features in the pipeline as well as add new ones
- Design, develop, document, and test new requirements in the data pipeline using BASH, FLUME, HDFS and SPARK in the Hadoop ecosystem
- Provide full operational support - analyze code to identify root causes of production issues and provide solutions or workarounds and lead it to resolution
- Participate in full development life cycle including requirements analysis, design, development, deployment, and operations support
- Made and oversaw cloud VMs with AWS EC2 command-line clients and AWS administration reassure.
- Used Spark DataFrame API over the Cloudera platform to perform analytics on Hive data.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Used Ansible Python Script to generate inventory and push the deployment to AWS Instances.
- Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift
- Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.
- Populating database tables via AWS Kinesis Firehose and AWS Redshift.
- Automated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.
- AWS Cloud Formation templates used for Terraform with existing plugins.
- Developed AWS Cloud Formation templates to create a custom infrastructure of our pipeline
- Implemented AWS IAM user roles and policies to authenticate and control access
- Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS
- Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift
Confidential
BIG DATA ENGINEER
Responsibilities:
- Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business
- Develops and builds frameworks/prototypes that integrate Big Data and advanced analytics to make business decisions
- Assist application development teams during application design and development for highly complex and critical data projects
- Worked on AWS Kinesis for processing huge amounts of real-time data
- Developed multiple Spark Streaming and batch Spark jobs using Java, Scala, and Python on AWS
- RDS, Cloud Formation, AWS IAM and Security Group in Public and Private Subnets in VPC
- Worked with AWS Lambda functions for event-driven processing to various AWS resources
- Assist in Install and configuration of Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches
- Created Hive queries to spot emerging trends by comparing Hadoop data with historical metrics
- Loaded into ingested data into Hive Managed and External tables.
- Wrote custom user define functions (UDF) for complex Hive queries (HQL)
- Performed upgrades, patches and bug fixes in Hadoop in a cluster environment
- Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access the data through Hive based views
- Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
- Built the Hive views on top of the source data tables, and built a secured provisioning
- Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster
- Wrote shell scripts for automating the process of data loading
- Work closely with development, test, documentation, and product management teams to deliver high-quality products and services in a fast-paced environment
- Algorithm development on high-performance systems
- Create data management policies, procedures, and standards
- Working with the end-user to make sure the analytics transform data to knowledge in very focused and meaningful ways
International Paper, Data administrator
Confidential, GA
Responsibilities:
- Executes moderately complex functional work tracks for the team
- Work in an agile environment and continuously improve the agile processes
- Maintain existing ETL workflows, data management, and data query components
- Wrote database objects like Stored Procedures, Triggers for Oracle, MS SQL, Hive
- Good knowledge in PL/SQL, HQL hands-on experience in writing medium level SQL queries
- Good knowledge in Impala, Spark/Scala, Spark, Storm
- Expertise in preparing the test cases, documenting and performing unit testing and Integration
- Installed and configured Hive and also written Hive UDFs
- Experience in Importing and exporting data into HDFS and Hive using Sqoop.
- Developed Sqoop jobs to populate Hive external tables using incremental loads
- Installed Oozie workflow engine to run multiple Hive jobs
- Used Spark modules to store the data on HDFS
- Develop automation and data collection frameworks
- Develops innovative solutions to Big Data issues and challenges within the team
- Known for being a smart, analytical thinker who approaches their work with logic and enthusiasm
- Drive the optimization, testing and tooling to improve data quality