Big Data Engineer Resume

SUMMARY

Over 6+ years of IT experience working as a Data Engineer and Big Data Developer with expertise in Data Engineering and Data Developer.
Analytically minded self - starter with 6 years of experience collaborating with cross-functional teams and ensuring the accuracy and integrity around data and actionable insights.
Prepared to collaborate with teams in predictive modeling and insight reporting to boost business efficiency, strategic goals, and profit.

TECHNICAL SKILLS

Leading Software Development Teams
Database Design and Management
Database Development
Hardware and Software Updates
SQL Server Database
AWS Redshift

PROFESSIONAL EXPERIENCE:

Big Data Engineer

Confidential

Responsibilities:

Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and Sharding features
Involved in designing the row key in HBase to store Text and JSON as key values in the HBase table and designed row key in such a way to get/scan it in sorted order
Oozie was integrated with the Hadoop stack which consists of Map-Reduce, Hive, Sqoop, and Pig and Unix shell scripts
While working on Hive tables used Hive QL, designed and Implemented Partitioning(Static, Dynamic) Buckets on Hive
Performed Cache and Persist as well as Checkpointing when utilizing Spark Streaming APIs to build common learning data models to get near real-time data from Kinesis Streams and persisted into Cassandra
Conducted cluster coordination services with Zookeeper and monitored the workload, capacity planning and job performance through Cloudera Manager
Built applications by utilizing Maven and integrated with Continuous Integration servers such as Jenkins to build jobs
Deployed, maintained, and configured Test and multi-node Dev Kinesis Streams Clusters as well as handled clusters and implemented data ingestion for real time processing in Kinesis Streams
Created cubes in Talend for various aggregation types of data from PostgreSQL and MS SQL server to visualize data
Monitored Name Node health status in Hadoop as well as the number of Data Nodes and Task trackers running along with automating jobs to pull data from various MySQL data sources to push result set data to HDFS
Created story telling dashboards through Tableau Desktop to publish on Tableau Server and integrated GitHub for version control tools to maintain the versions in projects
Deployed Spark applications in python and utilized Datasets and DataFrames in Spark SQL for processing data faster
Loaded transactional data with Sqoop from Teradata, created managed and external tables in Hive, and worked with semi-structured and structured data of 5 Petabytes in size
Constructed MapReduce jobs to validate, clean, and access data and worked with Sqoop jobs with incremental load to populate and load into Hive External tables
Designed strategies to optimize distribution of weblog data over clusters, in addition to exporting and importing stored web log data into Hive and HDFS through Sqoop
Responsibilities of building scalable data solutions that are distributed through Hadoop and Cloudera as well as developed and designed automated test scripts in Python
Integrated Apache Storm with Kinesis Streams to perform web analytics and to perform clickstream data from Kinesis Streams to HDFS
Developed SQL scripts and designed solutions to implement Spark with Hive Generic UDFs for incorporating business logic within Hive queries
Developed data pipelines in AWS using S3, EMR, Redshift to extract data from weblogs to store into HDFS
Transmitted streaming data from Kinesis Streams to HBase, Hive, and HDFS by integrating Apache Storm and wrote Pig scripts for transforming raw data from various data sources to form baseline data
Participated in Agile meetings Ford Credit Customer Data domain, conducted daily scrum meetings and spring planning
Expanded and optimized data pipelines and architecture as well as optimized data flow and collection
Created pipelines from scratch using PySpark and scheduled jobs using Airflow
Stream processed data in Kinesis Streams streaming wrote Producer, Consumer, Connector and Streams API to handle stream of records, subscription of topics, consume input, and build reusable producers and consumers
Worked with unstructured datasets such as IoT sources, sensor data, XML and JSON document sources
Scaled up architecture with Google Kubernetes and set up load balancing
Worked on various optimization techniques in Spark such as cache and persist, using accumulators, bucketing and partitioning, garbage collection tuning, data serialization, windowing functions, and broadcast variables
Used numerous Spark transformations such as groupByKey()/reduceByKey(), flatMap(), filter(), sample(), union(), etc
Utilized Dataproc to spin up clusters, Dataprep for data analysis, and Dataflow for streaming data using Apache Beam
Dealt with VPC controls on Google Cloud Platform and CMEK encryption for data security

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, Kinesis Streams, MapReduce, Sqoop, ETL, Java, Python, PostgreSQL, SQL Server, Teradata, Unix/Linux.

Big Data Engineer

Confidential

Responsibilities:

Performed query tuning in HiveQL as well as performance tuning transformations in Pyspark using Spark RDDs and Python
Used lambda functions to create a Serverless Data intake pipeline on AWS
Utilized python to construct a Spark Streaming pipeline to receive real-time data from Apache Kinesis Streams and store it in DynamoDB
Implemented Apache Spark data processing module to handle data from multiple RDBMS and Streaming sources, then compiled Apache Spark applications using Scala and Python
Extensive experience designing and scheduling multiple Spark Streaming / batch Jobs in Python (pyspark) and Scala
Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming
Involved with the use for creating and saving data frames using various Python modules with pyspark
Sqoop data and performed Hive queries for data ingestion from relational databases to analyze historical data
Experienced with Elastic MapReduce (EMR) as well as setting up environments on amazon AWS EC2 instances for pipelines in AWS
Expertise in handling Hive queries using Spark SQL such as window functions and aggregations
Ran Spark applications on Docker using EMR and used AWS Glue data catalog as the metastore in Spark SQL
Configured different File Formats like Avro, parquet for HIVE querying and processing based on business logic
Utilized Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement
Implemented Hive UDF to implement business logic and performed extensive data validation using Hive
Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API
Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing
Scripting Hadoop package installation and configuration to support fully automated deployments
Involved in chef-infra maintenance including backup/security fix on Chef Server
Deployed application updates using Jenkins
Installed, configured, and managed Jenkins
Triggering the SIT environment build of the client remotely through Jenkins
Deployed and configured Git repositories with branching, forks, tagging, and notifications
Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication, schema design
Viewing the selected issues of web interface using SonarQube
Developed a fully functional login page for the company's user facing website with complete UI and validations
Installed, Configured and utilized AppDynamics (Tremendous Performance Management Tool) in the whole JBoss Environment (Prod and Non-Prod)
Installed and installed Hive in a Hadoop cluster and assisted business users/application teams in fine-tuning their HIVE QL for optimal performance and efficient use of cluster resources
Utilized Oozie workflow for ETL Process for critical data feeds across the platform
Configured Ethernet bonding for all Nodes to double the network bandwidth
Configured Kerberos Security Authentication protocol for existing clusters
Constructed the use of Zookeeper failover controller (ZKFC) and Quorum Journal nodes for high availability for significant production clusters and automatic failover controller created
Installation and deployment of many Apache Hadoop nodes on an AWS EC2 system, as well as development of Pig Latin scripts to replace the old traditional process with Hadoop, and data feeding to AWS S3
Experience with AWS CloudFront, including the creation and management of distributions that provide access to an S3 bucket or an HTTP server running on EC2 instances
Developed Python scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop; And Developed enterprise application using Python
Constructed Spark application performance optimization, including determining the appropriate Batch Interval time, Parallelism Level, and Memory Tuning
Worked with on prem clusters as well as clusters on the cloud and used GCP Big Query, Data Fusion, DataFlow, DataProc, BigTable
Experience and hands-on knowledge in Akka and LIFT Framework
Used PostgreSQL and No-SQL database and integrated with Hadoop to develop datasets on HDFS

Environment: HDFS, Map Reduce, Hive 1.1.0, Kinesis Streams, Hue 3.9.0, Pig, Flume, Oozie, Sqoop, Apache Hadoop 2.6, Spark, SOLR, Storm, Cloudera Manager, Red Hat, MySQL, Prometheus, Docker, Puppet, YARN, Spark-SQL, Python, Amazon AWS, Elastic Search, Tableau, Linux, GCP Big Query, Data Fusion, DataFlow, DataProc, BigTable.

Big Data Developer

Confidential

Responsibilities:

Developed NiFi workflows to automate the data movement between different Hadoop systems
Configured deployed and maintained multi-node Dev and Test Kinesis Streams Clusters
Developed Spark scripts by using Scala shell commands as per the requirement
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop
Implemented Spark using Scala and SparkSQL for faster testing and processing of data
Imported large datasets from DB2 to Hive Table using Sqoop
Implemented Apache PIG scripts to load data from and to store data into Hive
Partitioned and bucketed Hive tables and compressed data with Snappy to load data into Parquet hive tables from Avro hive tables
Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL
Developed Spark scripts by using Scala Shell commands as per the requirement
Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC
Responsible for implementing ETL process through Kinesis Streams-Spark-HBase Integration as per the requirements of customer facing API
Worked on Batch processing and real-time data processing on Spark Streaming using Lambda architecture
Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive
Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java
Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive
Fetched live stream data from DB2 to HBase table using Spark Streaming and Apache Kinesis Streams
Load the data into Spark RDD and do in memory data Computation to generate the Output response
Used Spark for interactive queries, processing of streaming data and integration with MongoDB
Wrote different pig scripts to clean up the ingested data and created partitions for the daily data
Developed Spark programs with Scala to process the complex unstructured and structured data sets
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python
Analyzed the SQL scripts and designed the solution to implement using Spark
Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala
Used Oozie workflow to co-ordinate pig and Hive Scripts

Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship