Hadoop Developer Resume
New York, NY
PROFESSIONAL SUMMARY:
- Over 8+ years of technology experience including experience in Big data and Hadoop ecosystem. In - depth knowledge and hands-on experience in dealing with Apache Hadoop components like HDFS, MapReduce, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, HiveQL, HBase, Pig, Hive, Sqoop, Oozie, Cassandra, Flume and Spark.
- Extensively worked on MRV1 and MRV2 Hadoop architectures and wrote MapReduce programs, Pig & Hive scripts.
- Designed and created Hive external tables using shared meta-store instead of Derby with dynamic partitioning and bucketing.
- Experienced in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice-versa.
- Extensively used Kafka to load the log data from multiple sources directly into HDFS.
- Knowledge on RabbitMQ. Loaded streaming log data from various webservers into HDFS using Flume.
- Experienced in building Pig scripts to extract, transform and load data onto HDFS for processing.
- Excellent knowledge of data mapping, extract, transform and load from different data source.
- Experience in writing HiveQL queries to store processed data into Hive tables for analysis.
- Extended Hive and Pig core functionality by writing custom UDFs.
- Excellent understanding and knowledge of NOSQL databases like HBase and Cassandra.
- Designed Databases, created and managed schemas, wrote stored procs, functions, DDL, DML, SQL queries and data modeling
- Extensive experience in ETL Architecture, Development, enhancement, maintenance, Production support, Data Modeling, Data profiling, Reporting including Business requirement, system requirement gathering.
- Hands-on experience in shell scripting. Knowledge on cloud services Amazon web services (AWS) and Azure.
- Proficient in using RDMS concepts with Oracle, SQL Server and MySQL.
- Experienced in project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Experience in processing different file formats like XML, JSON and sequence file formats.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Good Experience in creating Business Intelligence solutions and designing ETL workflows using Tableau.
- Designed, deployed, maintained and lead the implementation of Cloud solutions using MS Azure and underlying technologies Implemented HA deployment models with Azure Classic and Azure Resource Manager and configured Azure Active Directory and managed users and groups. Worked on Continuous Integration CI/Continuous Delivery (CD) pipeline for Azure Cloud Services using Chef. Migrating Services from On-premise to Azure Cloud Environments.
- Collaborate with development and QA teams to maintain high-quality deployment
- Designed Client/Server telemetry adopting latest monitoring techniques.
- Configured Azure Traffic Manager to build routing for user traffic
- Infrastructure Migrations: Drive Operational efforts to migrate all legacy services to a fully Virtualized Infrastructure.
- Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms
PROFESSIONAL EXPERIENCE:
Confidential, New York, NY
Hadoop Developer
Responsibilities:
- Working on developing architecture document and proper guidelines
- Involved in all phases of Software Development Life Cycle (SDLC) and Worked on all activities related to the development, implementation and support for Hadoop.
- Installed and Configured Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Adding/installation of new components and removal of them through Cloudera Manager
- Played a key role in installation and configuration of the various Hadoop ecosystem tools such as Solr, Kafka, Pig, HBase and Cassandra.
- Implemented multiple Map Reduce Jobs in java for data cleansing and pre-processing.
- Wrote complex Hive queries and UDFs in Java and Python.
- Involved in implementing an HDInsight version 3.3 clusters, which is based on spark version 1.5.1.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters and Experience in converting MapReduce applications to Spark.
- Job duties involved the design, development of various modules in Hadoop Big Data Platform and processing data using Map Reduce, Hive, Pig, Sqoop and Oozie.
- Installed Oozie workflow engine to run multiple Map Reduce, Hive HQL and Pig jobs.
- Developed HDFS with huge amounts of data using Apache Kafka.
- Collected the log data from web servers and integrated into HDFS using Flume.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts and Experience in managing and reviewing Hadoop log files
- Worked on Spring RESTful for dependency injection.
- Developed and retrieved No-SQL data using Mongo DB using DAO's.
- Implemented test scripts to support test driven development and continuous integration.
- Perform data analytics and load data to Amazon s3/datalake/Spark cluster.
- Write and build Azkaban workflow jobs to automate the process.
- Develop sparm k Sql tables & queries to perform Adhoc data analytics for analyst team.
- Deploy components using Maven Build system and Docker images
- Involved in deploying multi module Azkaban applications using Maven
- Played an important in migrating jobs from spark 0.9 to 1.4 to 1.6.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala .
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Analysed the SQL scripts and designed the solution to implement using Scala .
- Developed analytical component using Scala , Spark and Spark Stream.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
- Involved in migration from Livelink to Sharepoint using Scala through Restful web service.
- Extensively involved in developing Restful API using JSON library of Play framework.
- Used Scala collection framework to store and process the complex consumer information.
- Used Scala functional programming concepts to develop business logic.
- Designed and implemented Apache Spark Application (Cloudera).
- Worked with cloud services like Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Implemented best income logic using Pig scripts and UDFs.
- Moderate and contribute to the support forums (specific to Azure Networking, Azure Virtual Machines, Azure Active Directory, Azure Storage) for Microsoft Developers Network including Partners and MVPs.
- Built a prototype Azure Data Lake application that accesses 3rd party data services via Web Services. The solution dynamically scales, automatically adding/removing cloud-based compute, storage and network resources based upon changing workloads.
- Worked with Azure ExpressRoute to create private connections between Azure datacenters and infrastructure for on premises and in colocation environment.
Environment: Spark, shark, Kafka, Cloudera, AWS, HDFS, ZooKeeper, Hive, Pig, Oozie, Core Java, Eclipse, HBase, Sqoop
Confidential, New York, NY
Hadoop Developer
Responsibilities:
- Extensively involved in installation and configuration of Cloudera Distribution Hadoop platform.
- Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames in Spark.
- Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in HiveContext, with transformations and actions (map, FlatMap, filter, reduce, ReduceByKey).
- Extend the capabilities of DataFrames using User Defined Functions in Python and Scala.
- Resolve missing fields in DataFrame rows using filtering and imputation.
- Involved in Agile methodologies, daily scrum meetings, spring planning.
- Integrate visualizations into a Spark application using Databricks and popular visualization libraries (ggplot, matplotlib).
- Implemented discretization and binning, data wrangling, cleaning, transforming, merging and reshaping data frames using Python. files in near real time and process them within few seconds.
- Created EC2 instances and implemented large multi node Hadoop clusters in AWS cloud from scratch.
- Configured AWS IAM and Security Groups.
- Responsible for implementing Kerberos, creating service principals, user accounts, keytabs, & syncing with AD.
- Developed terraform template to deploy Cloudera Manager on AWS.
- Configured different Notifications on AWS Services.
- Hands on experience in managing and monitoring the Hadoop cluster using Cloudera Manager.
- Installed, configured Hadoop Cluster using Puppet.
- MR2 Batch job was written to fetch required data from DB and store the same in CSV (static file)
- Automated workflows using shell scripting to schedule(crontab) Spark jobs.
- Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Used HiveQL for data analysis like creating tables and import the structured data to specified tables for reporting.
- Used Pig to perform data validation on the data ingested using scoop and flume and the cleansed data set is pushed into HBase.
- Participated in development/implementation of ClouderaHadoop environment.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Worked with Zookeeper, Oozie, and Data Pipeline Operational Services for coordinating the cluster and scheduling workflows.
- Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data using Lambda Architecture.
- Developed a data pipeline using Kafka and Strom to store data into HDFS.
- Performed real time analysis on the incoming data.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
- Load the data into SparkRDD and performed in-memory data computation to generate the output response.
- Loading data into HBase using Bulk Load and Non-bulk load.
- Created HBase column families to store various data types coming from various sources.
- Loaded data into the cluster from dynamically generated files
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures
- Created common audit and error logging processes job monitoring and reporting mechanism
- Troubleshooting performance issues with ETL/SQL tuning.
- Installed Oozie workflow engine to run multiple Map Reduce, Hive HQL and Pig jobs.
- Developed HDFS with huge amounts of data using Apache Kafka.
- Collected the log data from web servers and integrated into HDFS using Flume.
- Azure Migrate, Azure Site Recovery, Azure Database Migration Service, App Services - Web Apps, Virtual Machines, Virtual Networks, Network Security Groups (NSGs), Active Directory
Environment: Hadoop, Microsoft Azure, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn, Python, Azure
Confidential, Boston, Colorado
Hadoop Developer
Responsibilities:
- Launching Amazon EC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL).
- Designed and deployed Hadoop cluster that can scale to petabytes of data.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
- Load the data into spark RDD and performed in-memory data computation to get faster output response and implemented spark SQL queries on data formats like Text file, CSV file and XML files.
- Created SparkR DataFrame and data Frame Operation.
- Installed RStudio server on top of Master node and RStudio integration to an existing Cloudera Cluster.
- Deployment Data using Standlaone, YARN and Cloudera Cluster.
- Performed R module Via Built in Function on Spark Cluster to load particular Module For computation.
- Provided database interface using JDBC drivers with back-end as Oracle DB.
- Worked on analyzing Hadoop cluster and different big data analytic tools including HBase database and Sqoop.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
- Worked on Oozie workflow engine for job scheduling Imported and exported data into MapReduce and Hive using Sqoop.
- Developed Sqoop scripts to import, export data from relational sources and handled incremental loading on the customer data by date.
- Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Helped with the sizing and performance tuning of the Cassandra cluster.
- Developed Hive queries to process the data and generate the results in a tabular format.
- Handled importing of data from multiple data sources using Sqoop, performed transformations using Hive, MapReduce and loaded data into HDFS.
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports. Experienced in migrating Hive QL into Impala to minimize query response time.
- Worked with different teams to install operating system, Hadoop updates, patches, version upgrades of Hortonworks as required.
- Involved in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Responsible for migrating the code base from Cloudera Platform to Amazon EMR and evaluated Amazon eco systems components like Redshift.
- Work experience as a member of AWS Team.
- Worked with management framework and cloud administration tools.
Environment: Linux, Hadoop 2, Python, CDH 5.12.1, SQL, Sqoop, HBase, Hive, Spark, Oozie, Cloudera Manager, Oracle 11.2.0.3, Windows, Yarn, Spring, Shell Scripting, Sentry, AWS, SQL, Cassandra, Cloud Technologies
Confidential, New York, NY
Java Developer
Responsibilities:
- Responsible for implementation and ongoing administration of Hadoop infrastructure and setting up infrastructure
- Involved in requirements phase to understand the application impact and assisted System Analysts to gather inputs for the preparation of Functional Specification Document
- Responsible for creating and maintaining architecture for Restful API using Spring Boot.
- Worked extensively on Spring Boot for building web service and integrated Apache Camel (ESB) with Spring Boot.
- Implemented Restful web services using Spring boot.
- Documented how Spring Batch is useful for the current project.
- Involved in Java, J2EE, struts, web services and Hibernate in a fast-paced development environment.
- Followed agile methodology, interacted directly with the client provide/take feedback on the features, suggest/implement optimal solutions, and tailor application to customer needs.
- Rich experiences of database design and hands on experience of large database systems in Oracle 11g.
- Involved in design and implementation of web tier using Servlets and JSP.
- Developed the user interface using JSP and java Script and used Gherkin languages to write features and scenarios.
- Setup and support automated Continuous Integration utilizing tools like Jenkins, Shell Scripts and AWSCLI/API.
- Identify improvements to enhance CI/CD.
- Worked on Amazon Web Services (AWS) infrastructure with automation and configuration management tools.
- Good knowledge on Amazon Web Services (AWS), Amazon Cloud Services like EC2.
- Design roles and groups for users and resources using AWS Identity Access Management (IAM).
- Development of company's internal CI system, providing a comprehensive API for CI/CD.
- Designed and developed Data Access Objects (DAO) to access the database.
- Coded Java Server Pages for the Dynamic front end content that use Servlets and EJBs.
- Coded HTML pages using CSS for static content generation with JavaScript for validations.
- Used JDBC API to connect to the database and carry out database operations.
- Involved in writing test scripts using java and executed it through selenium cucumber.
- Triggered the automation jobs using Jenkins to get the cucumber JSON reports.
- Performing Code Reviews. Performed unit testing, system testing and integration testing.
- Created Test cases Using Element locators and Selenium Web driver methods.
- Deployed application in Dev and Prod servers; lead the team of developers for construction, development and testing phase
- Analyzed and made code changes to Mainframe components COBOL, JCL and DB2 applications.
- Actively participated in client meetings and able to interpret requirements to offshore team members
- Lead and Guide the teams to make sure all deliverables are on target from Inventory perspective.
Environment : HP ALM, Selenium WebDriver, JUnit, Cucumber, Angular JS, Node.JS Jenkins, GitHub, Windows, UNIX, Agile, MS SQL, IBM DB2, Putty, WinSCP, FTP Server, Notepad++, C#, DB Visualizer.