Hadoop Developer Resume
New York New, YorK
SUMMARY:
- 8+ years of IT experience in analysis, design, development and implementation of large - scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zoopkeeper, Python & Scala.
- Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications.
- Experienced in Apache Spark, Hive and Pig's analytical functions and extending Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
- Experience with installing, backup, recovery, configuration and development on multiple Hadoop distribution platforms Cloudera and Hortonworks including cloud platforms Amazon AWS and Google Cloud.
- Highly skilled in Optimizing and moving large scale pipeline applications from on-premise clusters to AWS Cloud.
- Working knowledge of spinning-up, configuring and maintaining long-running Amazon EMR clusters manually as well as through Cloud Formation scripts on Amazon AWS.
- Experienced in building frameworks for Large scale streaming applications in Apache Spark.
- Worked on migrating Hadoop MapReduce programs to Apache Spark on Scala.
- Extensive hands-on knowledge of working on the Amazon AWS and Google Cloud Architecture.
- Highly skilled in integrating Amazon Kinesis streams with Spark Streaming applications to build long running real-time applications.
- Configuring Kinesis Shards for optimal throughput in Kinesis Streams for Spark Streaming Applications on AWS.
- Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
- In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Working knowledge of utilizing Hadoop file formats such as Sequence, ORC, Avro, Parquet as well as open source Text/CSV and JSON formatted files.
- In-depth knowledge of the Big Data Architecture along with-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager.
- Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
- HiveQL and Pig Latin scripts leading to good understanding in MapReduce design patterns, data analysis using Hive and Pig.
- Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
- Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
- Very well versed in writing and deploying Oozie Workflows and Coordinators. Scheduling, Monitoring and Troubleshooting through Hue UI.
- Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop.
- Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases.
- Very capable in processing of large sets of structured, semi-structured and unstructured data and supporting system application architecture in Hadoop, Spark and SQL databases such as Teradata, MySQL, DB2.
- Working experience in Impala, Mahout, Sparks, Storm, Avro, Kafka, Hue and AWS.
- Experience with installing, backup, recovery, configuration and development on multiple Hadoop distribution platforms like Hortonworks Distribution Platform (HDP), Cloudera Distribution for Hadoop (CDH).
- Experienced in version control and source code management tools like GIT, SVN, and Bitbucket.
- Software development in Java Application Development, Client/Server Applications, and implementing application environment using MVC, J2EE, JDBC, JSP, XML methodologies (XML, XSL, XSD), Web Services, Relational Databases and NoSQL Databases.
- Hands-on experience in application development using Java, RDBMS, and Linux shell scripting, Perl.
- Hands-on experience working with IDE tools such as Eclipse, IntelliJ, NetBeans, Visual Studio, GIT and Maven and experienced in writing cohesive E2E applications on Apache Zeppelin.
- Experience working in Waterfall and Agile - SCRUM methodologies.
- Ability to adapt to evolving technologies, a strong sense of responsibility and accomplishment.
PROFESSIONAL EXPERIENCE:
Confidential, New York, New York
Hadoop Developer
Responsibilities:
- Worked on developing architecture document and proper guidelines
- Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
- Worked Big data processing of clinical and non-clinical data using Map Reduce.
- Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
- Customized BI tool for manager team that perform Query analytics using HiveQL.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Experienced in Monitoring Cluster using Cloudera manager.
- Involved in Discussions with business users to gather the required knowledge.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
- Analyzing the requirements to develop the framework.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed Java Spark streaming scripts to load raw files and corresponding.
- Processed metadata files into AWS S3 and Elasticsearch cluster.
- Developed Python Scripts to get the recent S3 keys from Elasticsearch.
- Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
- Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
- Developed scripts to monitor and capture state of each file which is being through.
- Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services.
- Wrought with data investigation, discovery and mapping tools to scan every single data record from many sources.
- Implemented Shell script to automate the whole process.
- Integrated Apache Storm with Kafka to perform web analytics.
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau. • Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
Environment: AWS S3, Java, Maven, Python, Spark, Kafka, Elasticsearch, MapR Cluster, Amazon Redshift DB, Shell script, pandas, Elasticsearch, PySpark, Pig, Hive, Oozie, JSON.
Confidential, New York, New York
Hadoop Developer
Responsibilities:
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely with the business team to gather their requirements and new support features
- Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
- Developed a 200-node cluster in designing the Data Lake with the Hortonworks distribution
- Responsible for building scalable distributed data solutions using Hadoop
- Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
- Implemented Kerberos for authenticating all the services in Hadoop Cluster
- Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Developed the Sqoop scripts to make the interaction between Hive and vertica Database.
- Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
- Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Streamed AWS log group into Lambda function to create service now incident.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Created Managed tables and External tables in Hive and loaded data from HDFS.
- Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
- Scheduled several times based Oozie workflow by developing Python scripts.
- Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
- Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
- Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
- Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
- End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, JavaScript and related, Linux
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
- Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
- Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
- Scheduled map reduces jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
- Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
- Improved the Performance by tuning of HIVE and map reduce.
- Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem.
Environment : HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Scala and ETL, Python.
Confidential, Overland Park, KS
Hadoop Developer
Responsibilities:
- Provided application demo to the client by designing and developing a search engine, report analysis trends, application administration prototype screens using AngularJS, and Bootstrap JS.
- Took the ownership of complete application Design of Java part, Hadoop integration
- Apart from the normal requirement gathering, participated in a Business meeting with the client to gather security requirements.
- Assisted with the architect to analyze the existing system and future system Prepared design blue pints and application flow documentation
- Experienced in managing and reviewing Hadoop log files Load and transform large sets of structured, semi-structured and unstructured data
- Responsible to manage data coming from different sources and application Supported Map Reduce Programs those are running on the cluster
- Responsible for working with Message broker system such as Kafka Extracted data from mainframes and feed to KAFKA and ingested to HBase to perform Analytics
- Written event-driven, link tracking system to capture user events and feed to KAFKA to push it to HBASE.
- Created MapReduce jobs to extracts the contents from HBase and configured in OOZIE workflow to generate analytical reports.
- Worked on setting up Kafka for streaming data and monitoring for the Kafka Cluster.
- Responsible for importing log files from various sources into HDFS using Flume.
- Participated in SOLR schema, and ingested data into SOLR for data indexing.
- Written MapReduce programs to organize the data and ingest the data to suitable for analytics in client specified format
- Hands on experience in writing python scripts to optimize the performance Implemented Storm builder topologies to perform cleansing operations before moving data into Cassandra.
- Extracted files from Cassandra through Sqoop and placed in HDFS and processed. Implemented Bloom filters in Cassandra using key space creation
- Involved in writing Cassandra CQL statements God hands-on experience in developing concurrency using spark and Cassandra together
- Involved in writing spark applications using Scala Hands on experience in creating RDDs, transformations, and Actions while implementing spark applications
- Good knowledge in creating data frames using Spark SQL. Involved in loading data into Cassandra NoSQL Database
- Implemented record level atomicity on writes using Cassandra Written PIG Scripts to query and process the Datasets to figure out the patterns of trends by applying client-specific criteria, and configured OOZIE workflows to run the jobs along with the MR jobs
- Stored the derived the results in HBase from analysis and make it available to data ingestion for SOLR for indexing data
- Involved in integration of java search UI, SOLR and HDFS Involved in code deployments using continuous integration tool using Jenkins
- Documented all the challenges, issues involved to deal with the security system and Implemented best practices
- Created Project structures and configurations according to the project architecture and made it available to the junior developer to continue their work
- Handled onsite coordinator role to deliver work to offshore Involved in core reviews and application lead supported activities
- Implemented SparkRDD transformations to map business analysis and apply actions on top of transformations.
- Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark.
- Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.
Environment: Cassandra, Spring 3.2, Spring data, PIG, HIVE, apache AVRO, Map Reduce, Sqoop Zookeeper, SVN,
Jenkins, Spark, HBASE.
Confidential, New York, New York
SDET
Responsibilities
- Responsible for implementation and ongoing administration of Hadoop infrastructure and setting up infrastructure
- Analyzed technical and functional requirements documents and design and developed QA Test Plan/Test cases, Test Scenario by maintaining E2E flow of process.
- Developed testing script for internal brokerage application that is utilized by branch and financial market representatives to recommend and manage customer portfolios; including international and capital markets.
- Designed and Developed Smoke and Regression automation script and Automation of functional testing framework for all modules using Selenium and WebDriver.
- Created Data Driven scripts for adding multiple customers, checking online accounts, user interfaces validations, and reports validations.
- Performed cross verification of trade entry between mainframe system, its web application and downstream system.
- Extensively used Selenium WebDriver API (XPath and CSS locators) to test the web application.
- Configured Selenium WebDriver, TestNG, Maven tool, Cucumber, and BDD Framework and created Selenium automation scripts in java using TestNG.
- Performed Data-Driven testing by developing Java based library to read test data from Excel & Properties files.
- Extensively performed DB2 database testing to validate the trade entry from mainframe to backend system. \ Developed data driven framework with Java, Selenium WebDriver and Apache POI which is used to do the multiple trade order entry.
- Developed internal application using Angular.js and Node.js connecting to Oracle on the backend.
- Expertise in debugging issues occurred in front end part of web-based application, which is developed using HTML5, CSS3, Angular JS, Node.JS and Java.
- Developed smoke automation test suite for regression test suite.
- Applied various testing technique in test cases to cover all business scenario for quality coverage.
- Interacted with development team to understand design flow, code review, discuss unit test plan.
- Executed tests in System & integration Regression testing In Testing environment.
- Conducted Defect triage meeting, Defect root cause analysis, track defect in HP ALM Quality Center, manage defect by follow up open items, and retest defects with regression testing.
- Provide QA/UAT sign off after closely reviewing all the test cases in Quality Center along with receiving the Policy sign off the project.
Environment : HP ALM, Selenium WebDriver, JUnit, Cucumber, Angular JS, Node.JS Jenkins, GitHub, Windows, UNIX, Agile, MS SQL, IBM DB2, Putty, WinSCP, FTP Server, Notepad++, C#, DB Visualizer.
