We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Philadelphia, PA

SUMMARY

  • Having 10+ years of professional experience fields of software Analysis, Design, Development, Deployment and Maintenance of software and Big Data applications.
  • Experience in Big data Implementation with strong experience on major components of Hadoop Ecosystem Ingestion tools like Hadoop, Spark, Hive, Sqoop, Flume, Oozie, Kafka.
  • Hands on experience with Hadoop/Spark Distribution - Cloudera, Hortonworks.
  • Experience in data cleansing using Spark map and Filter Functions.
  • Experience in designing and developing application in Spark using Scala.
  • Experience migrating map reduce programs into Spark RDD transformations or actions to improve performance.
  • Experience in creating Hive Tables and loading the data from different file formats.
  • Experience developing and Debugging Hive queries.
  • Experience in processing the data using HiveQL and Pig Latin scripts for data Analytics.
  • Extending Hive Core functionality by writing UDF’s for Data Analysis.
  • Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
  • Used Oozie to Manage and schedule Spark Jobs on a Hadoop Cluster.
  • Used HUE GUI to implement Oozie scheduler and workflows.
  • Good Experience in Data importing and exporting to Hive and HDFS with Sqoop.
  • Experience in using Producer and Consumer API’s of Apache Kafka.
  • Skilled in integrating Kafka with Spark streaming for faster data processing.
  • Experience in using Spark Streaming programming model for Real-time data processing.
  • Experience dealing with the file formats like text files, Sequence files, JSON, Parquet, ORC.
  • Extensively used Apache Kafka to collect the logs and error messages across the cluster.
  • Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.
  • Experienced at performing read and write operations on HDFS file system.
  • Experience working with large data sets and making performance improvements.
  • Experience working with EC2 (Elastic Compute Cloud) cluster instances, setup data buckets on S3 (Simple Storage Service), setting up EMR (Elastic MapReduce).
  • Good experience working on Tableau and enabled the JDBC/ODBC data connectivity from those to Hive tables.
  • Experience creating and driving large scale ETL pipelines.
  • Good with version control systems like GIT.
  • Strong knowledge on UNIX/LINUX commands.
  • Adequate Knowledge on Python scripting Language.
  • Adequate knowledge of Scrum, Agile and Waterfall methodologies.
  • Highly motivated and committed to the highest levels of professionalism.
  • Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms.

TECHNICAL SKILLS

Big Data Technologies: Apache Hadoop, Apache Spark, Map Reduce, Apache Hive, Apache Pig, Apache Sqoop, Apache Kafka, Apache Flume, Apache Oozie, Apache Zookeeper, HDFS

Databases: MySQL, Oracle 11g, Db2

Languages: Scala, JAVA, Python

Operating Systems: Mac OS, Windows 7/10, Linux (Cent OS, Redhat, Ubuntu).

Development Tools: Apache Tomcat, Eclipse, NetBeans, IntelliJ.

PROFESSIONAL EXPERIENCE

Confidential, Philadelphia, PA

Sr. Big Data Engineer

Responsibilities:

  • Created and maintained Technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
  • Have done monitoring and reviewing Hadoop log files and written queries to analyze them.
  • Conducted POC's and mocks with client to understand the Business requirement, also attended defect triage meeting with UAT team and QA team to ensure defects are resolved in timely manner.
  • Worked with Kafka for the proof of concept for carrying out log processing on a distributed system.
  • Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MRv2, HIVE, SQOOP and Pig Latin.
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop, Flume and load into Hive tables, which are partitioned.
  • Developed Hive queries, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation and execution.
  • Written complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line, HUE.
  • Designed and implemented proprietary data solutions by correlating data from SQL and NoSQL databases using Kafka.
  • Used Pig as ETL tool to do transformations and some pre-aggregations before storing the analyzed data into HDFS.
  • Developed a PySpark code for saving data into AVRO and Parquet format and building hive tables on top of them.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Automated workflows using shell scripts to pull data from various databases into Hadoop.
  • Developed bash scripts to bring the Log files from ftp server and then processing it to load into hive tables. All the bash scripts are scheduled using Resource Manager Scheduler.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Developed spark programs using Scala, involved in creating Spark-SQL Queries and developed Oozie workflow for spark jobs.

Environment: HDFS, Hadoop 2.x YARN, Teradata, NoSQL, PySpark, MapReduce, pig, Hive, Sqoop, Spark 2.3, Scala, Oozie, Java, Python, MongoDB, Shell and bash Scripting.

Confidential, San Francisco, CA

Sr. Spark/Hadoop Developer

Responsibilities:

  • Worked under the Cloudera distribution CDH 5.13 version.
  • Involved in working with Sqoop for fetching the data from RDBMS.
  • Transformed and stored the ingested data into Data frames using spark SQL.
  • Created Hive tables to load the transformed Data.
  • Performed partitions and bucketing in hive for easy data classification.
  • Worked on Performance and Tuning optimization of Hive.
  • Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
  • Involved in exporting Spark SQL Data frames into hive tables stored as Parquet Files.
  • Involved in Ingesting real-time log data from various producers using Kafka.
  • Used spark streaming to subscribe to desired topics for real time processing.
  • Transformed the DStreams into Data frames using spark engine.
  • Experienced in performance tuning of Spark Application for setting right Batch Interval time, level of Parallelism and memory tuning for optimal Efficiency.
  • Responsible for performing sort, join, aggregations, filter, and other transformations on the data.
  • Appended the Data frames to pre-existing data in hive.
  • Performed analysis on the hive tables based on the business logic.
  • Created a data pipeline using Oozie workflows which performs jobs on a daily basis.
  • Involved in Analyzing data by writing queries in HiveQL for faster data processing.
  • Involved in Persisting Metadata into HDFS for further data processing.
  • Loading data from Linux File systems to HDFS and vice-versa using shell commands.
  • Used GIT as Version Control System.
  • Worked with Jenkins for continuous integration.
  • Strong experience in building large, responsive based REST web application experienced in Cherrypy framework, Python.

Environment: CDH 5.1, HDFS, Hadoop 3.0, Spark 2.3, Scala, Hive 3.0, Pig, Hue, Oozie, Sqoop, Kafka, Linux shell, Git, Jenkins, Agile.

Confidential, Charlotte, NC

Sr. Spark/Hadoop Developer

Responsibilities:

  • Worked under the Hortonworks Enterprise.
  • Worked on large sets of structured and semi-structured historical data.
  • Involved in working with Sqoop to import the data from RDBMS to Hive.
  • Created Hive tables to load the Data and stored as ORC files for processing.
  • Implemented Hive Partitioning and bucketing for further classification of data.
  • Worked on Performance and Tuning optimization of Hive.
  • Involved in cleansing and transforming the data.
  • Used spark SQL to perform sort, join and filter the data.
  • Copied the ORC files to amazon s3 buckets using Sqoop for further processing in amazon EMR.
  • Wrote custom UDF’s in Spark SQL using Scala.
  • Performed data Aggregation operations using Spark SQL queries.
  • Copied output data back to Hive from Amazon S3 buckets using Sqoop after getting the output desired by the business.
  • Setup Kafka to subscribe to topics(sensors) and load data directly to Hive table.
  • Automated filter and join operations to join new data with the respective Hive tables using Oozie workflows daily.
  • Used Oozie and Oozie coordinators to deploy end to end data processing pipelines and scheduling workflows.
  • Compared the sensor data to a persisted table on a 24hr period to check if the machine is operating at optimal conditions and Used Kafka as a messaging system to notify the producer of that data and the maintenance department in case a maintenance is required.
  • Used Git as Version Control System.
  • Worked with Jenkins for continuous integration.

Environment: HDP 2.5, HDFS, Hadoop 2.7, Spark 2.1, Kafka, Amazon S3, EMR, Sqoop, Oozie, Hive 2.1, Pig, Hue, Linux shell, Git, Jenkins, Agile.

Confidential, Jersey City, NJ

Hadoop Developer

Responsibilities:

  • Worked under the Cloudera distribution.
  • Responsible for building scalable distributed data solutions using Hadoop. Developed Simple to complex Map Reduce jobs.
  • Created and populated bucketed tables in Hive to allow for faster map side joins and for more efficient jobs and more efficient sampling.
  • Used spark SQL to perform sort, join and filter the data.
  • Also performed partitioning of data to optimize Hive queries.
  • Handled importing of data from Oracle 11g to Hive tables using Sqoop Oozie,, Scala on a regular basis, later performed join operations on the data in the Hive.
  • Develop User defined functions in Hive to work on multiple input rows and provide an aggregated result based on the business requirement.
  • Wrote user defined custom counters to add to the Map Reduce job to gain further insight and for debugging purposes.
  • Developed a Map Reduce job to perform lookups of all entries based on a given key from a collection of Map files that were created from the data.
  • Performed data Aggregation operations using Spark SQL queries.
  • Performed side data distribution using the distributed cache to make read only data available to the job to process the main dataset.
  • Used Combine File Input Format to make sure maps had sufficient data to process when there is a large number of small files. Also packaged a collection of small files into a Sequence File which was used as input to the Map Reduce job.
  • Implemented LZO compression of Map output to reduce I/O between mapper and reducer nodes.
  • Continuous monitoring and managing the Hadoop cluster using Web console.
  • Developed Pig Latin scripts to extract the data from the output files to load into HDFS.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs

Environment: CDH 5.0, HDFS, Hadoop 2.7, Map Reduce, spark 1.6, Hive 1.2, Pig, Hue, Oozie, Sqoop, Scala, Oracle 12c, YARN, Linux shell, GIT, Jenkins, Agile.

We'd love your feedback!