Sr. Big Data Engineer Resume
Philadelphia, PA
SUMMARY
- Having 10+ years of professional experience fields of software Analysis, Design, Development, Deployment and Maintenance of software and Big Data applications.
- Experience in Big data Implementation with strong experience on major components of Hadoop Ecosystem Ingestion tools like Hadoop, Spark, Hive, Sqoop, Flume, Oozie, Kafka.
- Hands on experience with Hadoop/Spark Distribution - Cloudera, Hortonworks.
- Experience in data cleansing using Spark map and Filter Functions.
- Experience in designing and developing application in Spark using Scala.
- Experience migrating map reduce programs into Spark RDD transformations or actions to improve performance.
- Experience in creating Hive Tables and loading the data from different file formats.
- Experience developing and Debugging Hive queries.
- Experience in processing the data using HiveQL and Pig Latin scripts for data Analytics.
- Extending Hive Core functionality by writing UDF’s for Data Analysis.
- Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
- Used Oozie to Manage and schedule Spark Jobs on a Hadoop Cluster.
- Used HUE GUI to implement Oozie scheduler and workflows.
- Good Experience in Data importing and exporting to Hive and HDFS with Sqoop.
- Experience in using Producer and Consumer API’s of Apache Kafka.
- Skilled in integrating Kafka with Spark streaming for faster data processing.
- Experience in using Spark Streaming programming model for Real-time data processing.
- Experience dealing with the file formats like text files, Sequence files, JSON, Parquet, ORC.
- Extensively used Apache Kafka to collect the logs and error messages across the cluster.
- Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.
- Experienced at performing read and write operations on HDFS file system.
- Experience working with large data sets and making performance improvements.
- Experience working with EC2 (Elastic Compute Cloud) cluster instances, setup data buckets on S3 (Simple Storage Service), setting up EMR (Elastic MapReduce).
- Good experience working on Tableau and enabled the JDBC/ODBC data connectivity from those to Hive tables.
- Experience creating and driving large scale ETL pipelines.
- Good with version control systems like GIT.
- Strong knowledge on UNIX/LINUX commands.
- Adequate Knowledge on Python scripting Language.
- Adequate knowledge of Scrum, Agile and Waterfall methodologies.
- Highly motivated and committed to the highest levels of professionalism.
- Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms.
TECHNICAL SKILLS
Big Data Technologies: Apache Hadoop, Apache Spark, Map Reduce, Apache Hive, Apache Pig, Apache Sqoop, Apache Kafka, Apache Flume, Apache Oozie, Apache Zookeeper, HDFS
Databases: MySQL, Oracle 11g, Db2
Languages: Scala, JAVA, Python
Operating Systems: Mac OS, Windows 7/10, Linux (Cent OS, Redhat, Ubuntu).
Development Tools: Apache Tomcat, Eclipse, NetBeans, IntelliJ.
PROFESSIONAL EXPERIENCE
Confidential, Philadelphia, PA
Sr. Big Data Engineer
Responsibilities:
- Created and maintained Technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
- Have done monitoring and reviewing Hadoop log files and written queries to analyze them.
- Conducted POC's and mocks with client to understand the Business requirement, also attended defect triage meeting with UAT team and QA team to ensure defects are resolved in timely manner.
- Worked with Kafka for the proof of concept for carrying out log processing on a distributed system.
- Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MRv2, HIVE, SQOOP and Pig Latin.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop, Flume and load into Hive tables, which are partitioned.
- Developed Hive queries, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation and execution.
- Written complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line, HUE.
- Designed and implemented proprietary data solutions by correlating data from SQL and NoSQL databases using Kafka.
- Used Pig as ETL tool to do transformations and some pre-aggregations before storing the analyzed data into HDFS.
- Developed a PySpark code for saving data into AVRO and Parquet format and building hive tables on top of them.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Automated workflows using shell scripts to pull data from various databases into Hadoop.
- Developed bash scripts to bring the Log files from ftp server and then processing it to load into hive tables. All the bash scripts are scheduled using Resource Manager Scheduler.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Developed spark programs using Scala, involved in creating Spark-SQL Queries and developed Oozie workflow for spark jobs.
Environment: HDFS, Hadoop 2.x YARN, Teradata, NoSQL, PySpark, MapReduce, pig, Hive, Sqoop, Spark 2.3, Scala, Oozie, Java, Python, MongoDB, Shell and bash Scripting.
Confidential, San Francisco, CA
Sr. Spark/Hadoop Developer
Responsibilities:
- Worked under the Cloudera distribution CDH 5.13 version.
- Involved in working with Sqoop for fetching the data from RDBMS.
- Transformed and stored the ingested data into Data frames using spark SQL.
- Created Hive tables to load the transformed Data.
- Performed partitions and bucketing in hive for easy data classification.
- Worked on Performance and Tuning optimization of Hive.
- Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
- Involved in exporting Spark SQL Data frames into hive tables stored as Parquet Files.
- Involved in Ingesting real-time log data from various producers using Kafka.
- Used spark streaming to subscribe to desired topics for real time processing.
- Transformed the DStreams into Data frames using spark engine.
- Experienced in performance tuning of Spark Application for setting right Batch Interval time, level of Parallelism and memory tuning for optimal Efficiency.
- Responsible for performing sort, join, aggregations, filter, and other transformations on the data.
- Appended the Data frames to pre-existing data in hive.
- Performed analysis on the hive tables based on the business logic.
- Created a data pipeline using Oozie workflows which performs jobs on a daily basis.
- Involved in Analyzing data by writing queries in HiveQL for faster data processing.
- Involved in Persisting Metadata into HDFS for further data processing.
- Loading data from Linux File systems to HDFS and vice-versa using shell commands.
- Used GIT as Version Control System.
- Worked with Jenkins for continuous integration.
- Strong experience in building large, responsive based REST web application experienced in Cherrypy framework, Python.
Environment: CDH 5.1, HDFS, Hadoop 3.0, Spark 2.3, Scala, Hive 3.0, Pig, Hue, Oozie, Sqoop, Kafka, Linux shell, Git, Jenkins, Agile.
Confidential, Charlotte, NC
Sr. Spark/Hadoop Developer
Responsibilities:
- Worked under the Hortonworks Enterprise.
- Worked on large sets of structured and semi-structured historical data.
- Involved in working with Sqoop to import the data from RDBMS to Hive.
- Created Hive tables to load the Data and stored as ORC files for processing.
- Implemented Hive Partitioning and bucketing for further classification of data.
- Worked on Performance and Tuning optimization of Hive.
- Involved in cleansing and transforming the data.
- Used spark SQL to perform sort, join and filter the data.
- Copied the ORC files to amazon s3 buckets using Sqoop for further processing in amazon EMR.
- Wrote custom UDF’s in Spark SQL using Scala.
- Performed data Aggregation operations using Spark SQL queries.
- Copied output data back to Hive from Amazon S3 buckets using Sqoop after getting the output desired by the business.
- Setup Kafka to subscribe to topics(sensors) and load data directly to Hive table.
- Automated filter and join operations to join new data with the respective Hive tables using Oozie workflows daily.
- Used Oozie and Oozie coordinators to deploy end to end data processing pipelines and scheduling workflows.
- Compared the sensor data to a persisted table on a 24hr period to check if the machine is operating at optimal conditions and Used Kafka as a messaging system to notify the producer of that data and the maintenance department in case a maintenance is required.
- Used Git as Version Control System.
- Worked with Jenkins for continuous integration.
Environment: HDP 2.5, HDFS, Hadoop 2.7, Spark 2.1, Kafka, Amazon S3, EMR, Sqoop, Oozie, Hive 2.1, Pig, Hue, Linux shell, Git, Jenkins, Agile.
Confidential, Jersey City, NJ
Hadoop Developer
Responsibilities:
- Worked under the Cloudera distribution.
- Responsible for building scalable distributed data solutions using Hadoop. Developed Simple to complex Map Reduce jobs.
- Created and populated bucketed tables in Hive to allow for faster map side joins and for more efficient jobs and more efficient sampling.
- Used spark SQL to perform sort, join and filter the data.
- Also performed partitioning of data to optimize Hive queries.
- Handled importing of data from Oracle 11g to Hive tables using Sqoop Oozie,, Scala on a regular basis, later performed join operations on the data in the Hive.
- Develop User defined functions in Hive to work on multiple input rows and provide an aggregated result based on the business requirement.
- Wrote user defined custom counters to add to the Map Reduce job to gain further insight and for debugging purposes.
- Developed a Map Reduce job to perform lookups of all entries based on a given key from a collection of Map files that were created from the data.
- Performed data Aggregation operations using Spark SQL queries.
- Performed side data distribution using the distributed cache to make read only data available to the job to process the main dataset.
- Used Combine File Input Format to make sure maps had sufficient data to process when there is a large number of small files. Also packaged a collection of small files into a Sequence File which was used as input to the Map Reduce job.
- Implemented LZO compression of Map output to reduce I/O between mapper and reducer nodes.
- Continuous monitoring and managing the Hadoop cluster using Web console.
- Developed Pig Latin scripts to extract the data from the output files to load into HDFS.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs
Environment: CDH 5.0, HDFS, Hadoop 2.7, Map Reduce, spark 1.6, Hive 1.2, Pig, Hue, Oozie, Sqoop, Scala, Oracle 12c, YARN, Linux shell, GIT, Jenkins, Agile.