We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

New, YorK

SUMMARY

  • Over 5 years of IT experience as a Big Data Developer with cross platform integration experience using Hadoop Ecosystem.
  • Strong knowledge of Hadoop Architecture and Daemons such as HDFS, JOB Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts.
  • Well versed in implementing E2E solutions on big data using Hadoop framework.
  • Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
  • Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
  • Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
  • Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
  • Designed HIVE queries scripts to perform data analysis, data transfer and table design.
  • Having experience in developing a data pipeline using Kafka to store data into HDFS.
  • Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
  • Implemented Ad - hoc query using Hive to perform analytics on structured data.
  • Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
  • Experienced in optimizing Hive queries by tuning configuration parameters.
  • Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
  • Extensively used Apache Flume to collect the logs and error messages across the cluster.
  • Experienced in performing real time analytics on HDFS using HBase.
  • Used Cassandra CQL with Java API’s to retrieve data from Cassandra tables.
  • Experience in composing shell scripts to dump the shared information from MySQL servers to HDFS.
  • Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
  • Worked with Oozie and Zookeeper to manage the flow of jobs and coordination in the cluster.
  • Experience in performance tuning, monitoring the Hadoop cluster by gathering and analyzing the existing infrastructure using Cloudera manager.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
  • Good experience in writing Spark applications using Python and Scala.
  • Experience processing Avro data files using Avro tools and MapReduce programs.
  • Implemented predefined operators in spark such as map, flat Map, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
  • Used Scala sbt to develop Scala coded spark projects and executed using spark-submit
  • Added security to the cluster by integrating Kerberos.
  • Worked on multiple PoC’s on Apache NiFi, and python
  • Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Worked on Talend Open Studio and Talend Integration Suite.
  • Adequate knowledge and working experience with Agile and waterfall methodologies.
  • Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
  • Expert in developing applications using Servlets, JPA, JMS, Hibernate, spring frameworks.
  • Extensive experience in implementing/ consume Rest Based Web Services.
  • Good knowledge of Web/Application Servers like Apache Tomcat, IBM WebSphere and Oracle WebLogic.
  • Ability to work with onsite and offshore team members.
  • Able to work on own initiative, highly proactive, self-motivated commitment towards work and resourcefulness.
  • Strong debugging and critical thinking ability with good understanding of frameworks advancement in methodologies and strategies.

PROFESSIONAL EXPERIENCE

Confidential, New York

Big Data Engineer

Responsibilities:

  • Worked on developing architecture document and proper guidelines
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, Apache Spark and then loading data into Hive tables or AWS S3 buckets.
  • Involved in moving data from various DB2 tables to AWS S3 buckets using Sqoop process.
  • Configuring Splunk alerts in-order to get the log files while execution and storing them to a location in S3 bucket when cluster is running.
  • Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (pyspark).
  • Created a Serverless data ingestion pipeline on AWS using lambda functions.
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to DynamoDB using Scala.
  • Writing Oozie scripts in-order to schedule and automate the jobs in EMR cluster on AWS
  • Experienced in creating the EMR cluster and deploying code into the cluster in S3 buckets.
  • Experienced in using NoMachine and Putty in-order to SSH the EMR cluster and running spark-submit.
  • Experience in developing and scheduling various Spark Streaming / batch Jobs using python (pyspark) and Scala.
  • Developing spark code using pyspark to be applying various transformations and actions for faster data processing.
  • Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming
  • Used Spark Stream processing using Scala to get data into in-memory, created RDDs, Data Frames and applied transformations and actions.
  • Involved in using various Python libraries with pyspark in order to create data frames and store them to Hive.
  • Sqoop jobs and Hive queries were created for data ingestion from relational databases to analyze historical data.
  • Experience in working with Elastic MapReduce (EMR) and setting up environments on amazon AWS EC2 instances.
  • Knowledge on handling Hive queries using Spark SQL that integrates with the Spark environment.
  • Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
  • Knowledge on creating the user defined functions (UDF's) in Hive.
  • Worked with different File Formats like c, avro, parquet for HIVE querying and processing based on business logic.
  • Involved in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Implemented Hive UDF to implement business logic and Responsible for performing extensive data validation using Hive.
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
  • Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
  • Experience in build scripts using SBT and did continuous system integrations like Bamboo.
  • Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
  • Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.
  • Involved in Test Driven Development writing unit and integration test cases for the code

Environment: Hadoop, Confluent Kafka, Hortonworks HDF, HDP, NIFI, Linux, Splunk, Java, Puppet, Apache Yarn, Pig, Spark, Tableau, Machine Learning.

Confidential, San Francisco, California

Hadoop Developer

Responsibilities:

  • Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, Hive and Sqoop.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Created Spark jobs to see trends in data usage by users.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in Hive using the Scala API.
  • Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL.
  • Created Hive tables to store data and written Hive queries.
  • Extracted the data from Teradata into HDFS using Sqoop.
  • Exported the patterns analyzed back to Teradata using Sqoop.
  • Involved in Installing, Configuring Hadoop EcoSystem, and Cloudera Manager using CDH4 Distribution.
  • Developed Spark code to use Scala and Spark-SQL for faster processing and testing.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Developed Scala scripts to extract the data from the web server output files to load into HDFS.
  • Design and implement Map Reduce jobs to support distributed data processing.
  • Process large data sets utilizing our Hadoop cluster.
  • Designing NoSQL schemas in HBase.
  • Developing Mapreduce ETL in Python/Pig.
  • Involved in data validation using HIVE.
  • Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
  • Involved in weekly walkthroughs and inspection meetings, to verify the status of the testing efforts and the project.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.

Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn, Python, Machine Learning, NLP (Natural Language Processing)

Confidential, New York

Big Data Developer

Responsibilities:

  • Launching Amazon EC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL) and Configuring launched instances with respect to specific applications.
  • Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
  • Collected and aggregated large amounts of web log data from different sources such as web servers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
  • Installed and configured Hadoop MapReduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and processing.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
  • Analyzed substantial data sets by running Hive queries and Pig scripts.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop
  • Defined the Accumulate tables and loaded data into tables for near real-time data reports.
  • Created the Hive external tables using Accumulo connector.
  • Written Hive UDFs to sort Structure fields and return complex data type.
  • Used distinctive data formats (Text format and ORC format) while stacking the data into HDFS.
  • Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
  • Creating files and tuning the SQL queries in Hive utilizing HUE.
  • Experience working with hive for indexing and querying.
  • Created custom hive segments to optimize ideal search matching.
  • Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data.
  • Acted for bringing in data under HBase using HBase shell also HBase client API.
  • Designed the ETL process and created the high-level design document including the logical data flows, source data extraction process, the database staging, job scheduling and Error Handling
  • Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.

Environment: Apache Hadoop, HDFS, MapReduce, Sqoop, Flume, Hive, HBASE, Oozie, Scala, Spark, Linux.

We'd love your feedback!