We provide IT Staff Augmentation Services!

Hadoop Developer Resume

Charlotte, NC

SUMMARY

  • 6+ years of overall IT Industry and Software Development experience with more than 5+ years of experience in Hadoop development
  • Worked on components of CDH and HDP including HDFS, MapReduce, Job tracker, Task tracker, Sqoop, Zookeeper, YARN, Oozie, Hive, Hue, Flume, HBase, Spark and Kafka
  • Deployed Hadoop clusters on public and private cloud environments like AWS and OpenStack
  • Good experience working with various data analytics and big data services in AWS Cloud like EC2, EMR, Redshift, S3, Athena, Glue etc
  • Worked extensively with Spark framework for building highly scalable and reliable data transformations on complex data pipelines
  • Experienced in developing production ready spark application using Spark RDD Apis, Data frames, Spark - SQL and Spark-Streaming API's
  • Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications
  • Strong experience in using Spark Streaming, Spark Sql and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
  • Experienced in automating the provisioning processes and system resources using Puppet
  • Involved in importing and exporting data with Sqoop from RDBMS MySQL, Oracle, Teradata and used fast loaders and connectors, and processed data for commercial analytics
  • Built ingestion framework using flume for streaming logs and aggregating the data into HDFS
  • Experienced in upgrading SQL Server software, patches and service packs
  • Practiced Agile scrum to provide operational support, installation updates, patches and version upgrades
  • Experienced in collaborative platforms including Jira, Rally, SharePoint and Discovery
  • Installed, monitored and performance tuned standalone multi-node clusters of Kafka
  • Successfully loaded files to Hive and HDFS from MongoDB, Cassandra, and HBase
  • Experienced in understanding and managingHadooplog files with Flume

PROFESSIONAL EXPERIENCE

Hadoop Developer

Confidential, Charlotte, NC

Responsibilities:

  • Created a process to pull the data from existing applications and land the data on Hadoop.
  • Worked in agile environment, involved in sprint planning, grooming and daily standup meetings.
  • Responsible for meeting with application owners for defining/planning of Sqooping the data from source systems.
  • Used Sqoop to pull the data from source databases such as Teradata, DB2, and MS SQL server.
  • Created the Hive tables on top of the data extracted from Source system.
  • Partitioning the Hive tables depending on the load type.
  • Created objects in Scala which can handle dynamic creation of tables in hive and oracle and could alter the tables on the fly.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map way
  • The hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
  • Load and transform large sets of structured, semi structured data using hive.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Experienced in defining job flows.
  • Carried out the responsibility to create an AutoSys box job for creating new data pipelines. Had to collect and populate insert job, box name, command, description and profile for creating successful Autosys jobs which run in different time intervals.
  • Knowledge in performance troubleshooting and tuning Hadoop clusters.
  • Experienced in managing and reviewing Hadoop log files.
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Load and transform large sets of structured semis structured and unstructured data.
  • Responsible to manage data coming from different sources.
  • Got good experience with NOSQL database.
  • Involved in loading data from UNIX file system to HDFS.
  • Installed and configured Hive and also written Hive UDFs.
  • Involved in creating Hive tables loading with data and writing hive queries which will run internally in map reduce way.
  • Worked on analyzing Hadoop cluster and different big data analytical and processing tools including Hive, Sqoop, python, Spark and Spark with Scala & java, Spark Streaming
  • Wrote Spark-Streaming applications to consume the data from Kafka topics and wrote processed streams to HBase and steamed data using Spark with Kafka
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Worked on the large-scale EMR cluster for distributed data processing and analysis using Spark, Hive’s
  • Datasets loading into ScalityS3 from HDFS, NAS Locations using PythonBoto3
  • Ingesting data into ScalityS3 from AWSS3 from EC2 Instance using Java Spring boot, Python Boto3.
  • Creating & Organizing Buckets in Scality&AWSS3
  • Experienced on Data Warehousing Systems Using AWS Redshift and Oracle PL/SQL. writing Lambda functions to read & mining the data in S3 using Scala
  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
  • Developed Apache Spark applications by using Scala and Spark for data processing from various streaming sources
  • Used DRL files to write the business logics to validate the business rules by executing these files in spark using Sql Lite
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark
  • Implemented Spark solutions to generate reports, fetch and load data in Hive
  • Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
  • Used EC2, Auto-scaling and VPC to build secure, for highly scalable and flexible systems that handled expected and unexpected load bursts
  • Setting up & configure of Cassandra, Spark, and other relevant architecture components.

Environment: UNIX Shell, Pig, Hive, MapReduce, YARN, Spark 1.4.1, Eclipse, Core Java, JDK1.7, Oozie Workflows, AWS, S3, Redshift, DynamoDB, Athena, HBASE, SQOOP, Scala, Kafka, Python, Cassandra, maven.

Hadoop Developer

Confidential, New York, NY

Responsibilities:

  • Worked on installing cluster, commissioning & decommissioning of datanode, namenode recovery, capacity planning, and slots configuration.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on tuning the performance of Mapreduce Jobs.
  • Implement large-scale data ecosystems including data management, governance and the integration of structured and unstructured data to generate insights leveraging cloud-based platforms
  • Implemented Spark RDD transformations and actions to implement business analysis.
  • Designed data quality framework to perform schema validation and data profiling on pySpark
  • Leveraged pySpark to manipulate unstructured data and apply text mining on user's table utilization data
  • Responsible to manage data coming from different sources.
  • Load and transform large sets of structured, semi structured and unstructured data
  • Experience in managing and reviewing Hadoop log files.
  • Submitted the spark jobs using microservices by creating a spark json on the fly and providing as an input for apache LIVY.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Using PIG predefined functions to convert the fixed width file to delimited file.
  • Worked on tuning Hive and Pig to improve performance and solve performance related issues in Hive and Pig scripts with good understanding of Joins, Group and aggregation and how it does Map Reduce jobs
  • Involved in scheduling Oozie workflow engine to run multiple Hive and Pig job
  • Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
  • Created Oozie workflows to run multiple MR, Hive and pig jobs.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
  • Develop Spark code using Scala and Spark-SQL for faster testing and data processing
  • Involved in the development of Spark Streaming application for one of the data source using Scala, Spark by applying the transformations.
  • Import the data from different sources like HDFS/HBase into SparkRDD.
  • Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Import the data from different sources like HDFS/HBase into Spark RDD.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.
  • Involved in gathering the requirements, designing, development and scala testing.
  • Developing traits and case classes etc in scala. experienced with integration of multiple concurrent data sources, including relational databases and flat files.
  • Experienced in designing, deploying, scheduling and supporting Informatica packages.
  • Experienced working with Informatica Repository, Designer, Workflow, Scheduling and Monitor.
  • Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses and Data Marts using Informatica Power Center (Repository Manager, Designer, Workflow Manager, Workflow Monitor, Metadata Manger), Power Exchange, Power Connect as ETL tool on Oracle, DB2 and SQL Server Databases
  • Experience in resolving on-going maintenance issues and bug fixes; monitoring Informatica sessions as well as performance tuning of mappings and sessions.

Environment: Hadoop, HDFS, Pig, Sqoop, HBase, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala

HADOOP DEVELOPER

Confidential, New York, NY

Responsibilities:

  • Developed, reviewed and updated architecture and process documentation, server diagrams, requisition documents and other technical documents
  • Involved in Agile methodology, attended daily scrum meetings and sprint planning meetings
  • Integrated visualizations into a Spark application using Databricks and visualization libraries (ggplot, Matplotlib)
  • Implemented different analytical algorithms using MapReduce programs to apply on top of HDFS data
  • Responsible for cluster maintenance, commissioning and decommissioning data nodes, cluster monitoring, troubleshooting, management and review data backups, and Hadooplog files
  • Skilled in Tableau Desktop for various types of data visualization, reporting and analysis including Cross Map, Scatter Plots, Geographic Map, Pie Charts and Bar Charts, Page Trails and Density Chart
  • Created HBase tables to store various data formats of data coming from MySQL, Oracle, Teradata
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on theHadoopcluster
  • Configured Spark streaming to get ongoing information from the Kafka and stored the stream information to HDFS
  • Developed data pipeline using Flume, Sqoop to ingest business data and purchase histories into HDFS for analysis
  • Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in Hive Context, with transformations and actions (map, flat Map, filter, reduce, reduceByKey)
  • Extended the capabilities of Data Frames using User Defined Functions in Python and Scala
  • Resolved missing fields in Data Frame rows using filtering and imputation
  • Experience with data wrangling and creating workable datasets
  • Monitoring systems and services through Ambari dashboard to make the clusters available for the business
  • Used Spark API over ClouderaHadoopYARN to perform analytics on data in Hive
  • Developed job processing scripts using Oozie workflow
  • Installed Git lab for Code Repository
  • Developed build and deployment scripts using ANT and MAVEN as build tools in Jenkins to move from one environment to other environments.
  • Installed and configured Nagios to constantly monitor network bandwidth, memory usage, and hard drive status.
  • Created and Automated ingest mechanism using shell script for various data sources with validation
  • Created Parquet tables and compressed using Snappy compression
  • Responsible for installation and configuration of Jenkins to support Java builds and Jenkins Plugins to automate continuous builds and publishing Docker images to Docker Repository.
  • Automating CI/CD Pipeline using Jenkins and groovy script and setting up Email notification on build status.
  • Extensively supported environmental issues that occur on day to day basis.
  • Responsible for maintaining Continuous Integration (CI) environments with build automation tools like Jenkins and automated Master-Slave Configuration whenever builds are triggered based on automated builds slaves will be picked for builds.
  • Implemented this for settlement feed which saved space and improved query performance.
  • Develop and review test scripts based on test cases and document and communicate with stake holders.

Environment: Apache Hadoop 3.0, Hive 0.10, Sqoop 1.4.3, Flume, MapReduce, HDFS, LINUX, Oozie, Cassandra, Tez, Hue, HCatalog, Java.

Hadoop Developer

Confidential, New York, NY

Responsibilities:

  • Developed MapReduce, Pig and Hive scripts to cleanse, validate and transform data
  • Implemented MapReduce programs to handle semi- and unstructured data like XML, JSON, Avro, Parquet files and sequence files for log files
  • Implemented Spark RDD transformations and actions to implement business analysis.
  • Designed data quality framework to perform schema validation and data profiling on pySpark
  • Leveraged pySpark to manipulate unstructured data and apply text mining on user's table utilization data
  • Worked on creating and optimizing Hive scripts for data analysts based on business requirements
  • Created Hive UDFs to encapsulate complex and reusable logic for the end users
  • Developed predictive analytic using Apache Spark Scala API
  • Experienced in migrating HiveQL into Impala to minimize query response time
  • Experienced with using different kind of compression techniques like LZO, Snappy, Bzip2, Gzip to save data and optimize data transfer over the networks using Avro, Parquet, and ORC file
  • Configured, deployed and maintained multi-node Dev and Test Kafka Clusters
  • Implemented data injection systems by creating Kafka brokers, Java producers, Consumers, custom encoders.
  • Implemented Partitioning, Dynamic Partitions and Bucketing in Hive for efficient data access.
  • Developed Spark code using Scala and Spark-SQL Streaming for faster testing and processing of data.
  • Developed a data pipeline using Kafka and Storm to store data into HDFS
  • Developed some utility helper classes to get data from HBase tables
  • Knowledge in Spark Core, Streaming, Data Frames and SQL, MLib, GraphX
  • Implemented Caching for Spark Transformations, action to use as reusable component
  • Extracted files from Cassandra through Sqoop and placed in HDFS and processed
  • Used Maven to build and deploy the Jars for MapReduce, Pig and Hive UDFs
  • Extensively used the Hue browser for interacting with Hadoop components
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Cluster coordination services through Zookeeper

Environment: Linux (CentOS, RedHat), UNIX Shell, Pig, Hive, MapReduce, YARN, Spark 1.4.1, Eclipse, Core Java, JDK1.7, Oozie Workflows, AWS, S3, EMR, Cloudera, HBASE, SQOOP, Scala, Kafka, Python, Cassandra, maven, Cloudera Manager

SDET

Confidential, New York, NY

Responsibilities:

  • Launched Amazon EC2 Instances using AWS (Linux/ Ubuntu/RHEL) and configured instances
  • Conducted functional testing, regression resting using Java-Selenium WebDriver and data-driven framework and keyword-driven framework using Page Factory and Page Object models
  • Experience in Selenium Grid for cross-platform, cross-browser and parallel tests using TestNG and Maven
  • Used Jenkins to execute the test scripts periodically on Selenium Grid for different platforms
  • Wrote test cases and conducted sanity, regression, integration, unit test, black-box and white-box tests
  • Integrated Jenkins with Git version control to schedule automatic builds using predefined maven commands
  • Developed BDD framework from scratch using Cucumber and defined steps, scenarios and features
  • Utilized Apache POI jar file to read test data from the excel spreadsheets and load them into test cases
  • Administered and Engineered Jenkins for managing weekly Build, Test, and Deploy chain, SVN/GIT with Dev/Test/Prod Branching Model for weekly releases
  • Handled Selenium Synchronization problems using Explicit & Implicit waits during regression testing
  • Experienced in writing complex and dynamic XPaths
  • Experienced in SOAP UI and REST with Postman
  • Used runner classes in cucumber to generate step definition and used tags to run different kinds of test suites like smoke, health check and regression.
  • Created profiles in maven to launch specific TestNG suite from Jenkins job
  • Used the Groovy language to verify Webservices through SOAP UI.
  • Experienced in testing the SauceLabs cloud platform
  • Practiced Agile Scrum and shared daily status reports with the team, team leads, managers and other stakeholders

Environment: s: Selenium IDE, Groovy, RC Web Driver, Cucumber, HPQC, My Eclipse, JIRA, MySQL, Oracle, Java, JavaScript .Net, Python, Microservices, Restful API Testing, JMeter, VBScript, JUnit, TestNG, Firebug, Xpath, Windows

Hire Now