We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

SUMMARY

  • Over 6+ years of IT experience working as a Data Engineer and Big Data Developer with expertise in Data Engineering and Data Developer.
  • Analytically minded self - starter with 6 years of experience collaborating with cross-functional teams and ensuring the accuracy and integrity around data and actionable insights.
  • Prepared to collaborate with teams in predictive modeling and insight reporting to boost business efficiency, strategic goals, and profit.

TECHNICAL SKILLS

  • Leading Software Development Teams
  • Database Design and Management
  • Database Development
  • Hardware and Software Updates
  • SQL Server Database
  • AWS Redshift

PROFESSIONAL EXPERIENCE:

Big Data Engineer

Confidential

Responsibilities:

  • Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and Sharding features
  • Involved in designing the row key in HBase to store Text and JSON as key values in the HBase table and designed row key in such a way to get/scan it in sorted order
  • Oozie was integrated with the Hadoop stack which consists of Map-Reduce, Hive, Sqoop, and Pig and Unix shell scripts
  • While working on Hive tables used Hive QL, designed and Implemented Partitioning(Static, Dynamic) Buckets on Hive
  • Performed Cache and Persist as well as Checkpointing when utilizing Spark Streaming APIs to build common learning data models to get near real-time data from Kinesis Streams and persisted into Cassandra
  • Conducted cluster coordination services with Zookeeper and monitored the workload, capacity planning and job performance through Cloudera Manager
  • Built applications by utilizing Maven and integrated with Continuous Integration servers such as Jenkins to build jobs
  • Deployed, maintained, and configured Test and multi-node Dev Kinesis Streams Clusters as well as handled clusters and implemented data ingestion for real time processing in Kinesis Streams
  • Created cubes in Talend for various aggregation types of data from PostgreSQL and MS SQL server to visualize data
  • Monitored Name Node health status in Hadoop as well as the number of Data Nodes and Task trackers running along with automating jobs to pull data from various MySQL data sources to push result set data to HDFS
  • Created story telling dashboards through Tableau Desktop to publish on Tableau Server and integrated GitHub for version control tools to maintain the versions in projects
  • Deployed Spark applications in python and utilized Datasets and DataFrames in Spark SQL for processing data faster
  • Loaded transactional data with Sqoop from Teradata, created managed and external tables in Hive, and worked with semi-structured and structured data of 5 Petabytes in size
  • Constructed MapReduce jobs to validate, clean, and access data and worked with Sqoop jobs with incremental load to populate and load into Hive External tables
  • Designed strategies to optimize distribution of weblog data over clusters, in addition to exporting and importing stored web log data into Hive and HDFS through Sqoop
  • Responsibilities of building scalable data solutions that are distributed through Hadoop and Cloudera as well as developed and designed automated test scripts in Python
  • Integrated Apache Storm with Kinesis Streams to perform web analytics and to perform clickstream data from Kinesis Streams to HDFS
  • Developed SQL scripts and designed solutions to implement Spark with Hive Generic UDFs for incorporating business logic within Hive queries
  • Developed data pipelines in AWS using S3, EMR, Redshift to extract data from weblogs to store into HDFS
  • Transmitted streaming data from Kinesis Streams to HBase, Hive, and HDFS by integrating Apache Storm and wrote Pig scripts for transforming raw data from various data sources to form baseline data
  • Participated in Agile meetings Ford Credit Customer Data domain, conducted daily scrum meetings and spring planning
  • Expanded and optimized data pipelines and architecture as well as optimized data flow and collection
  • Created pipelines from scratch using PySpark and scheduled jobs using Airflow
  • Stream processed data in Kinesis Streams streaming wrote Producer, Consumer, Connector and Streams API to handle stream of records, subscription of topics, consume input, and build reusable producers and consumers
  • Worked with unstructured datasets such as IoT sources, sensor data, XML and JSON document sources
  • Scaled up architecture with Google Kubernetes and set up load balancing
  • Worked on various optimization techniques in Spark such as cache and persist, using accumulators, bucketing and partitioning, garbage collection tuning, data serialization, windowing functions, and broadcast variables
  • Used numerous Spark transformations such as groupByKey()/reduceByKey(), flatMap(), filter(), sample(), union(), etc
  • Utilized Dataproc to spin up clusters, Dataprep for data analysis, and Dataflow for streaming data using Apache Beam
  • Dealt with VPC controls on Google Cloud Platform and CMEK encryption for data security

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, Kinesis Streams, MapReduce, Sqoop, ETL, Java, Python, PostgreSQL, SQL Server, Teradata, Unix/Linux.

Big Data Engineer

Confidential

Responsibilities:

  • Performed query tuning in HiveQL as well as performance tuning transformations in Pyspark using Spark RDDs and Python
  • Used lambda functions to create a Serverless Data intake pipeline on AWS
  • Utilized python to construct a Spark Streaming pipeline to receive real-time data from Apache Kinesis Streams and store it in DynamoDB
  • Implemented Apache Spark data processing module to handle data from multiple RDBMS and Streaming sources, then compiled Apache Spark applications using Scala and Python
  • Extensive experience designing and scheduling multiple Spark Streaming / batch Jobs in Python (pyspark) and Scala
  • Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming
  • Involved with the use for creating and saving data frames using various Python modules with pyspark
  • Sqoop data and performed Hive queries for data ingestion from relational databases to analyze historical data
  • Experienced with Elastic MapReduce (EMR) as well as setting up environments on amazon AWS EC2 instances for pipelines in AWS
  • Expertise in handling Hive queries using Spark SQL such as window functions and aggregations
  • Ran Spark applications on Docker using EMR and used AWS Glue data catalog as the metastore in Spark SQL
  • Configured different File Formats like Avro, parquet for HIVE querying and processing based on business logic
  • Utilized Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement
  • Implemented Hive UDF to implement business logic and performed extensive data validation using Hive
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing
  • Scripting Hadoop package installation and configuration to support fully automated deployments
  • Involved in chef-infra maintenance including backup/security fix on Chef Server
  • Deployed application updates using Jenkins
  • Installed, configured, and managed Jenkins
  • Triggering the SIT environment build of the client remotely through Jenkins
  • Deployed and configured Git repositories with branching, forks, tagging, and notifications
  • Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication, schema design
  • Viewing the selected issues of web interface using SonarQube
  • Developed a fully functional login page for the company's user facing website with complete UI and validations
  • Installed, Configured and utilized AppDynamics (Tremendous Performance Management Tool) in the whole JBoss Environment (Prod and Non-Prod)
  • Installed and installed Hive in a Hadoop cluster and assisted business users/application teams in fine-tuning their HIVE QL for optimal performance and efficient use of cluster resources
  • Utilized Oozie workflow for ETL Process for critical data feeds across the platform
  • Configured Ethernet bonding for all Nodes to double the network bandwidth
  • Configured Kerberos Security Authentication protocol for existing clusters
  • Constructed the use of Zookeeper failover controller (ZKFC) and Quorum Journal nodes for high availability for significant production clusters and automatic failover controller created
  • Installation and deployment of many Apache Hadoop nodes on an AWS EC2 system, as well as development of Pig Latin scripts to replace the old traditional process with Hadoop, and data feeding to AWS S3
  • Experience with AWS CloudFront, including the creation and management of distributions that provide access to an S3 bucket or an HTTP server running on EC2 instances
  • Developed Python scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop; And Developed enterprise application using Python
  • Constructed Spark application performance optimization, including determining the appropriate Batch Interval time, Parallelism Level, and Memory Tuning
  • Worked with on prem clusters as well as clusters on the cloud and used GCP Big Query, Data Fusion, DataFlow, DataProc, BigTable
  • Experience and hands-on knowledge in Akka and LIFT Framework
  • Used PostgreSQL and No-SQL database and integrated with Hadoop to develop datasets on HDFS

Environment: HDFS, Map Reduce, Hive 1.1.0, Kinesis Streams, Hue 3.9.0, Pig, Flume, Oozie, Sqoop, Apache Hadoop 2.6, Spark, SOLR, Storm, Cloudera Manager, Red Hat, MySQL, Prometheus, Docker, Puppet, YARN, Spark-SQL, Python, Amazon AWS, Elastic Search, Tableau, Linux, GCP Big Query, Data Fusion, DataFlow, DataProc, BigTable.

Big Data Developer

Confidential

Responsibilities:

  • Developed NiFi workflows to automate the data movement between different Hadoop systems
  • Configured deployed and maintained multi-node Dev and Test Kinesis Streams Clusters
  • Developed Spark scripts by using Scala shell commands as per the requirement
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data
  • Imported large datasets from DB2 to Hive Table using Sqoop
  • Implemented Apache PIG scripts to load data from and to store data into Hive
  • Partitioned and bucketed Hive tables and compressed data with Snappy to load data into Parquet hive tables from Avro hive tables
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL
  • Developed Spark scripts by using Scala Shell commands as per the requirement
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC
  • Responsible for implementing ETL process through Kinesis Streams-Spark-HBase Integration as per the requirements of customer facing API
  • Worked on Batch processing and real-time data processing on Spark Streaming using Lambda architecture
  • Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive
  • Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java
  • Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive
  • Fetched live stream data from DB2 to HBase table using Spark Streaming and Apache Kinesis Streams
  • Load the data into Spark RDD and do in memory data Computation to generate the Output response
  • Used Spark for interactive queries, processing of streaming data and integration with MongoDB
  • Wrote different pig scripts to clean up the ingested data and created partitions for the daily data
  • Developed Spark programs with Scala to process the complex unstructured and structured data sets
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python
  • Analyzed the SQL scripts and designed the solution to implement using Spark
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala
  • Used Oozie workflow to co-ordinate pig and Hive Scripts

Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn.

We'd love your feedback!