Senior Hadoop /spark Developer Resume
New, YorK
PROFESSIONAL SUMMARY:
- More than 11 years of experience in the field of IT including more than 4.5 years of experience in Hadoop Ecosystem/Spark Architecture, Core Java, Scala & 6 years of experience in ETL, DB, SQL development
- Excellent understanding /knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Expertise with different tools in Hadoop Environment including, Hive, HDFS, Map Reduce, Hive, HBase, Spark Streaming, Spark SQL, Sqoop, Spark, Kafka, Flume, Oozie, and Zookeeper
- Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala.
- In - depth understanding of Spark Architecture including Spark Core-,RDD, Spark SQL, Spark Streaming, Data Frames, Data Set, Data Frames
- Experience in manipulating the streaming data to clusters through Kafka and Spark-Streaming.
- Experience on collection the real time streaming data pipeline from different source using Kafka and store data into HDFS.
- Involved in creating tables, partitioning, bucketing of table and creating UDF's in Hive, implementing Security mechanism for Hive Data.
- Created and worked Sqoop jobs with incremental load to populate Hive External and Managed tables.
- In depth understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Created Hive scripts to build the foundation tables by joining multiple tables.
- Experience in developing Scala script to implement Spark job in analyzing and validating the data ingested.
- Experience in using Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster.
- Experience in import/export data using Sqoop from Hadoop Distributed File Systems to Relational Database Systems and vice versa.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Extensive experience in Core Java application module using Threading, Collections and OOPS concepts like Abstraction, Inheritance, and Polymorphism etc.
- Experienced in designing, built, and deploying a multitude applications utilizing the AWS stack (Including EC2, S3 and EMR), focusing on high-availability, fault tolerance, and auto-scaling.
- Extensively used JDBC Statement, Prepared Statement, Callable Statement and Result Set interfaces while providing database interaction with the RDBMS backend.
- Extensive work in ETL process consisting of data transformation, data sourcing, mapping, conversion.
- Created PIG UDF's in Java for data enrichments & Hive custom serde in Java.
- Expertise in programs of HIVE QL and PIG scripts to validate and cleanse the data in HDFS, obtained from heterogeneous data sources, to make it suitable for analysis.
- Experience in configuration, deploying and managing of different Hadoop distributions like Cloudera and Hortonworks Distributions.
- Imported the data from source HDFS into Spark RDD for in-memory data computation to generate the output response.
- Strong expertise in using ETL Tool Data stage, Workflow Manager, Repository Manager, Data Quality and ETL concepts.
- Experience in Hadoop Shell commands, writing MapReduce Programs, verifying managing and reviewing Hadoop Log files.
- Experienced in handling various file formats like AVRO, Sequential, text, xml and Parquet.
- Worked on Talend Jobs using Various Big data functions like tHdfs, tHive,tFileInput, tAggregate, tConvert, tSort, tFilter, tMap, tJoin, tReplace and Different Database connection
TECHNICAL SKILLS:
Big Data Ecosystem: Apache HDFS, Apache Spark Streaming, Spark SQL, MapReduce, HBaseKafka,Zookeeper,Hive, Pig, Sqoop Oozie,Flume, Nifi, Ambari, Storm,Impala
Hadoop Distributions: Cloudera (CDH), Hortonworks (HDP)
Languages: Core Java, Scala, SQL
Databases: Oracle, SQL Server, HBase, MongoDB
SDLC Methodologies: Waterfall, Agile Scrum, V-Model
Operating System: Windows, Unix, Linux
PROFESSIONAL EXPERIENCE:
Confidential, New York
Senior Hadoop /Spark Developer
Responsibilities:
- Designing and deployment of Hadoop cluster and different Big Data analytic tools including Hive, Spark Streaming, Spark SQL, Talend, HBase, Pig, Oozie, Sqoop, Kafka, Impala with Cloudera distribution
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL
- Used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.
- Working with Spark eco system using Scala and Hive Queries on different data formats like Text file and parquet.
- Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
- Involved in extracting appropriate features from data sets in order to handle bad, null, partial records using Spark SQL.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary
- Working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
- Responsible in handling Streaming data from web server console logs.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Worked on HBase to load and retrieve data for real time processing using Rest API.
- Importing different log files using Apache Kafka into HDFS and performed data analytics using apache spark.
- Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, apache Spark and then loading data into Hive tables.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to HBase.
- Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using Kafka.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation
- Extensively worked on Jenkins for continuous integration and for End to End automation for all build and deployments.
- Work with cross functional consulting teams within the data science and analytics team to design, develop, and execute solutions to derive business insights and solve clients' operational and strategic problems.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries
- Continuous monitoring and managing the Hadoop/spark cluster using Cloudera Manager
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators, created Oozie Jobs for workflow of Spark, Sqoop and Shell scripts.
- Loaded and performed some transform data into Hadoop cluster from large set of structured data using Talend Big data studio.
Environment: Hadoop (Cloudera-CDH), Scala, HDFS, HUE, Hive/HQL, Spark SQL, Spark Streaming, Kafka, Flume, HBase, Zookeeper, Pig, Sqoop, Oozie, Talend (ETL Tool),SQL server 2012, Java/J2EE Jenkins, Maven
Confidential, Jersey City, NJ
Big Data -Hadoop Developer
Responsibilities:
- Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
- Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, Spark and then loading data into Hive tables.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
- Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using Kafka.
- Experience in developing Kafka consumers and Kafka producers by extending low level and high-level consumer and producer API's.
- Implemented Partitioning, Dynamic partitioning and Bucketing in HIVE using internal and external table for more efficient data.
- Used HIVE queries for aggregating the data and mining information sorted by volume and grouped by vendor and product.
- Developed Spark Applications by using Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Working knowledge on Spark Streaming API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
- Used Spark Stream processing to get data into in-memory, implemented RDD transformations, and performed actions.
- Developed various Kafka Producers and consumers for importing various transaction logs.
- Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- Used various HBase commands and generated different Datasets as per requirements and provided access to the data when required using grant and Revoke
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Experienced in migrating HiveQL into Impala to minimize query response time, handling Hive queries using Spark SQL that integrates with Spark environment.
- Worked with different File Formats like textfile, Avro, ORC for HIVE querying and processing based on business logic.
- Experience in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
- Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
- Developed Pig Latin scripts to extract data from the web server output files to load into HDFS
- Scripting in multiple languages on UNIX, LINUX and Windows - Batch, Shell script etc.
Environment: Hadoop (Horton works-HDP), Java/J2EE, Spark SQL, Spark Streaming, HDFS, Yarn, Hive, Sqoop, Pig, Flume, OOZIE, HBase, Kafka, Talend(ETL Tool), Eclipse, Oracle, PL/SQL UNIX Shell Scripting,.
Confidential, New York, NY
Big Data (Hadoop) Developer
Responsibilities:
- Developed Map/Reduce jobs using Java for data transformations.
- Using Sqoop to extract the data back to relational database for business reporting
- Partitioning and bucketing done for the log file data to differentiate data on daily basis and aggregating based n business requirements.
- Responsible for developing data pipeline using Sqoop, MR and hive to extract the data from weblogs and store the results for downstream consumption.
- Developed Internal and External tables, used Hive DDLs to create, alter and drop tables.
- Stored non-relational data on MongoDB, Wrote services to store and retrieve user data from the MongoDB for the application on devices.
- Intensively worked on documentation of the project, maintained technical documentation for Hive queries and Pig Scripts we created.
- Created and developed the UNIX shell scripts for creating the reports from Hive data.
- Developed Apache PIG scripts to process the HDFS data and business transformations
- Developed the Sqoop scripts in order to make the interaction between Pig and MySQL Database
- Using Zookeeper and Oozie for coordinating the cluster and scheduling workflows.
- Working with Flume to load the log data from multiple sources directly into HDFS.
- Experienced with file manipulation, advanced research to resolve various problems and correct integrity for critical Big Data issues with NoSQL Hadoop HDFS Database.
- Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.
- Created Hive External tables and loaded the data in to tables and query data using HQL.
Environment: Hadoop (Cloudera), Java/J2EE MapReduce, HDFS, Hive, Pig, Java, SQL, Cloudera Manager, Sqoop, Flume, Oozie,, MongoDB, Eclipse, Oracle and Unix/Linux.
Confidential
SSIS ETL Developer
Responsibilities:
- Wrote several SQL Scripts such us finding tables that have Identity Columns, finding tables that do not have primary key.
- Used Joins, correlated and non-correlated sub-queries for complex business queries involving multiple tables from different databases and implemented triggers and stored procedures and enforced business rules via checks and constraints.
- Reduced the T-SQL overhead successfully by avoiding unnecessary use of The UNION, statement & using the TOP operator to limit the Select Statement in certain Queries.
- Created views to restrict access to data in a table for security.
- Worked on querying data and creating on-demand reports using Report Builder in SSRS reports and send the reports via email.
- Extracted data using SSIS (ETL) 2012 tool for the migration of data from legacy systems to modern databases that enabled others to have data in a usable form to make decisions and plan better.
- Implemented performance tuning and debugging of TSQL stored procedures by analyzing applicable indexes and filters used in queries for customer feedback process, which transferred back on overnight by SSIS packages minimizing downtime, increasing image and efficiency.
Environment: SSIS, SSRS, SSAS, MS SQL Server 2008/2012, C#, Visual Studio 2012, MS Excel, MS Office 2007.
Confidential, Boston, MA
ETL (Datastage) Developer
Responsibilities:
- Extensively used Data Stage as the data migration/transformation tool for Claims Data Warehouse application.
- Analyzed source systems and created mapping to the target database schema.
- Extracted data from DB2, Teradata applied Transformations and loaded in to Oracle Database.
- Involved in Design Development and Deployment of Data Stage Server and PX jobs, used stages like sort, aggregator, transformer, link collector, link partitioned, XML input, XML output, pivot, FTP stage, etc.
- Involved in extracting sequential files and flat files from different Sources.
- Involved in modifying existing Data Stage Jobs for better performance.
- Redesigned certain stages in existing jobs for optimization in accordance with framework.
- Developed Shell Scripts for taking backup and recovery of database. Performed physical and logical backup.
Environment: IBM InfoSphere Data Stage 8.5, Oracle 10g EE, SQL Loader, SQL Server 2008, Windows Server 2008,Windows, Confidential - Confidential, Tesseract, Sentinel, Trademapper, Oracle, FIX Protocol
Confidential
ETL (Datastage) Developer
Responsibilities:
- Created Several Datastage jobs to populate the data into dimensions and fact tables. Developed jobs in Datastage to load the data from various sources using different stages like Transformer, lookup, join, funnel, aggregator, copy, merge, switch.
- Modified the existing Datastage jobs for performance tuning. Replaced the additional hash files jobs with datasets. Utilized the Resource Estimation option in datastage to check Partition information and CPU requirements.
- Performance Analysis was conducted on Datastage jobs to check Job timeline, record Throughput and CPU utilization.
- Modified the configuration file to increase the number of nodes and resource allocation.
- Replaced the transformer stage with copy stage
- Removed the unnecessary stages in the job design and combined the functionalities in single stage itself.
- Utilized the Run Time Column Propagation to necessary jobs where same column mapping from Source to Target stages
- Involved in testing of jobs and creating test cases to ensure proper functionality of production interfaces.
- Involved in change of existing Datastage jobs to improve performance in production environment and ensure data integrity.
- Analyzing the Requirement and Prepared Design Documents.
- Involved in Designing, Coding and Unit testing.
Environment: IBM DataStage8.0.1/ 8.5, SQL Server 2008, Java, J2EE, HTML, CSS, DHTML, Tomcat Server, Confidential (Content Management Tool), Webservices, Orcale,Xidocs, Soap UI, Unix