Sr. Big Data Engineer Resume Philadelphia, PA - Hire IT People

SUMMARY

Having 10+ years of professional experience fields of software Analysis, Design, Development, Deployment and Maintenance of software and Big Data applications.
Experience in Big data Implementation with strong experience on major components of Hadoop Ecosystem Ingestion tools like Hadoop, Spark, Hive, Sqoop, Flume, Oozie, Kafka.
Hands on experience with Hadoop/Spark Distribution - Cloudera, Hortonworks.
Experience in data cleansing using Spark map and Filter Functions.
Experience in designing and developing application in Spark using Scala.
Experience migrating map reduce programs into Spark RDD transformations or actions to improve performance.
Experience in creating Hive Tables and loading the data from different file formats.
Experience developing and Debugging Hive queries.
Experience in processing the data using HiveQL and Pig Latin scripts for data Analytics.
Extending Hive Core functionality by writing UDF’s for Data Analysis.
Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
Used Oozie to Manage and schedule Spark Jobs on a Hadoop Cluster.
Used HUE GUI to implement Oozie scheduler and workflows.
Good Experience in Data importing and exporting to Hive and HDFS with Sqoop.
Experience in using Producer and Consumer API’s of Apache Kafka.
Skilled in integrating Kafka with Spark streaming for faster data processing.
Experience in using Spark Streaming programming model for Real-time data processing.
Experience dealing with the file formats like text files, Sequence files, JSON, Parquet, ORC.
Extensively used Apache Kafka to collect the logs and error messages across the cluster.
Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.
Experienced at performing read and write operations on HDFS file system.
Experience working with large data sets and making performance improvements.
Experience working with EC2 (Elastic Compute Cloud) cluster instances, setup data buckets on S3 (Simple Storage Service), setting up EMR (Elastic MapReduce).
Good experience working on Tableau and enabled the JDBC/ODBC data connectivity from those to Hive tables.
Experience creating and driving large scale ETL pipelines.
Good with version control systems like GIT.
Strong knowledge on UNIX/LINUX commands.
Adequate Knowledge on Python scripting Language.
Adequate knowledge of Scrum, Agile and Waterfall methodologies.
Highly motivated and committed to the highest levels of professionalism.
Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms.

TECHNICAL SKILLS

Big Data Technologies: Apache Hadoop, Apache Spark, Map Reduce, Apache Hive, Apache Pig, Apache Sqoop, Apache Kafka, Apache Flume, Apache Oozie, Apache Zookeeper, HDFS

Databases: MySQL, Oracle 11g, Db2

Languages: Scala, JAVA, Python

Operating Systems: Mac OS, Windows 7/10, Linux (Cent OS, Redhat, Ubuntu).

Development Tools: Apache Tomcat, Eclipse, NetBeans, IntelliJ.

PROFESSIONAL EXPERIENCE

Confidential, Philadelphia, PA

Sr. Big Data Engineer

Responsibilities:

Created and maintained Technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
Have done monitoring and reviewing Hadoop log files and written queries to analyze them.
Conducted POC's and mocks with client to understand the Business requirement, also attended defect triage meeting with UAT team and QA team to ensure defects are resolved in timely manner.
Worked with Kafka for the proof of concept for carrying out log processing on a distributed system.
Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MRv2, HIVE, SQOOP and Pig Latin.
Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop, Flume and load into Hive tables, which are partitioned.
Developed Hive queries, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation and execution.
Written complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line, HUE.
Designed and implemented proprietary data solutions by correlating data from SQL and NoSQL databases using Kafka.
Used Pig as ETL tool to do transformations and some pre-aggregations before storing the analyzed data into HDFS.
Developed a PySpark code for saving data into AVRO and Parquet format and building hive tables on top of them.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Automated workflows using shell scripts to pull data from various databases into Hadoop.
Developed bash scripts to bring the Log files from ftp server and then processing it to load into hive tables. All the bash scripts are scheduled using Resource Manager Scheduler.
Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
Developed spark programs using Scala, involved in creating Spark-SQL Queries and developed Oozie workflow for spark jobs.

Environment: HDFS, Hadoop 2.x YARN, Teradata, NoSQL, PySpark, MapReduce, pig, Hive, Sqoop, Spark 2.3, Scala, Oozie, Java, Python, MongoDB, Shell and bash Scripting.

Confidential, San Francisco, CA

Sr. Spark/Hadoop Developer

Responsibilities:

Worked under the Cloudera distribution CDH 5.13 version.
Involved in working with Sqoop for fetching the data from RDBMS.
Transformed and stored the ingested data into Data frames using spark SQL.
Created Hive tables to load the transformed Data.
Performed partitions and bucketing in hive for easy data classification.
Worked on Performance and Tuning optimization of Hive.
Experience converting HiveQL/SQL queries into Spark transformations through Spark RDD and Data frames API in Scala.
Involved in exporting Spark SQL Data frames into hive tables stored as Parquet Files.
Involved in Ingesting real-time log data from various producers using Kafka.
Used spark streaming to subscribe to desired topics for real time processing.
Transformed the DStreams into Data frames using spark engine.
Experienced in performance tuning of Spark Application for setting right Batch Interval time, level of Parallelism and memory tuning for optimal Efficiency.
Responsible for performing sort, join, aggregations, filter, and other transformations on the data.
Appended the Data frames to pre-existing data in hive.
Performed analysis on the hive tables based on the business logic.
Created a data pipeline using Oozie workflows which performs jobs on a daily basis.
Involved in Analyzing data by writing queries in HiveQL for faster data processing.
Involved in Persisting Metadata into HDFS for further data processing.
Loading data from Linux File systems to HDFS and vice-versa using shell commands.
Used GIT as Version Control System.
Worked with Jenkins for continuous integration.
Strong experience in building large, responsive based REST web application experienced in Cherrypy framework, Python.

Environment: CDH 5.1, HDFS, Hadoop 3.0, Spark 2.3, Scala, Hive 3.0, Pig, Hue, Oozie, Sqoop, Kafka, Linux shell, Git, Jenkins, Agile.

Confidential, Charlotte, NC

Sr. Spark/Hadoop Developer

Responsibilities:

Worked under the Hortonworks Enterprise.
Worked on large sets of structured and semi-structured historical data.
Involved in working with Sqoop to import the data from RDBMS to Hive.
Created Hive tables to load the Data and stored as ORC files for processing.
Implemented Hive Partitioning and bucketing for further classification of data.
Worked on Performance and Tuning optimization of Hive.
Involved in cleansing and transforming the data.
Used spark SQL to perform sort, join and filter the data.
Copied the ORC files to amazon s3 buckets using Sqoop for further processing in amazon EMR.
Wrote custom UDF’s in Spark SQL using Scala.
Performed data Aggregation operations using Spark SQL queries.
Copied output data back to Hive from Amazon S3 buckets using Sqoop after getting the output desired by the business.
Setup Kafka to subscribe to topics(sensors) and load data directly to Hive table.
Automated filter and join operations to join new data with the respective Hive tables using Oozie workflows daily.
Used Oozie and Oozie coordinators to deploy end to end data processing pipelines and scheduling workflows.
Compared the sensor data to a persisted table on a 24hr period to check if the machine is operating at optimal conditions and Used Kafka as a messaging system to notify the producer of that data and the maintenance department in case a maintenance is required.
Used Git as Version Control System.
Worked with Jenkins for continuous integration.

Environment: HDP 2.5, HDFS, Hadoop 2.7, Spark 2.1, Kafka, Amazon S3, EMR, Sqoop, Oozie, Hive 2.1, Pig, Hue, Linux shell, Git, Jenkins, Agile.

Confidential, Jersey City, NJ

Hadoop Developer

Responsibilities:

Worked under the Cloudera distribution.
Responsible for building scalable distributed data solutions using Hadoop. Developed Simple to complex Map Reduce jobs.
Created and populated bucketed tables in Hive to allow for faster map side joins and for more efficient jobs and more efficient sampling.
Used spark SQL to perform sort, join and filter the data.
Also performed partitioning of data to optimize Hive queries.
Handled importing of data from Oracle 11g to Hive tables using Sqoop Oozie,, Scala on a regular basis, later performed join operations on the data in the Hive.
Develop User defined functions in Hive to work on multiple input rows and provide an aggregated result based on the business requirement.
Wrote user defined custom counters to add to the Map Reduce job to gain further insight and for debugging purposes.
Developed a Map Reduce job to perform lookups of all entries based on a given key from a collection of Map files that were created from the data.
Performed data Aggregation operations using Spark SQL queries.
Performed side data distribution using the distributed cache to make read only data available to the job to process the main dataset.
Used Combine File Input Format to make sure maps had sufficient data to process when there is a large number of small files. Also packaged a collection of small files into a Sequence File which was used as input to the Map Reduce job.
Implemented LZO compression of Map output to reduce I/O between mapper and reducer nodes.
Continuous monitoring and managing the Hadoop cluster using Web console.
Developed Pig Latin scripts to extract the data from the output files to load into HDFS.
Installed Oozie workflow engine to run multiple Hive and Pig jobs

Environment: CDH 5.0, HDFS, Hadoop 2.7, Map Reduce, spark 1.6, Hive 1.2, Pig, Hue, Oozie, Sqoop, Scala, Oracle 12c, YARN, Linux shell, GIT, Jenkins, Agile.

We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

Philadelphia, PA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship