Senior Hadoop Developer Resume
Columbus, OH
SUMMARY:
- Around 9 years of experience in Hadoop /Big Data technologies such as in Hadoop , Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Storm, Flink, Flume, Zookeeper, Impala, Tez, Kafka and Spark with hands on experience in writing Map Reduce/YARN and Spark/Scala jobs.
- Have good IT experience with special emphasis on Analysis, Design and Development and Testing of ETL methodologies in all the phases of the Data Warehousing.
- Expertise in OLTP/OLAP System Study, Analysis and E - R modeling, developing Database Schemas like star schema and Snowflake schema used in relational, dimensional modeling.
- Experience in optimizing and performance tuning of Mappings and implementing the complex business rules by creating re-usable Transformations, Mapplets and Tasks.
- Worked on creation of the projections like Query specific projections, Pre- Join Projections, Live aggregate projections.
- Responsible for developing data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Queried Vertica, SQL Server for data validation along with developing validation worksheets in Excel in order to validate the dashboards on Tableau.
- Have used various versions of Hive on multiple projects. Apart from regular queries, I have also implemented UDFs and UDAFs. I worked on a project that involved migrating Hive tables and underlying data from Cloudera CDH to Hortonworks HDP.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Extensively used SQL and PL/SQL for development of Procedures, Functions, Packages and Triggers.
- Experienced on Tableau Desktop, Tableau Server and good understanding of tableau architecture
- Experienced in integrating Kafka with Spark Streaming for high speed data processing.
- Experience in ImplementingAWS solutions using EC2, S3 and Azure storage.
- Experienced in developing business reports by writing complex SQL queries using views, macros, volatile and global temporary tables.
- Working with AWS team in testing our Apache Spark- ETL application on EMR/EC2 using S3.
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Experienced with work flow schedulers, data architecture including data ingestion pipeline design and data modelling.
- Configuration of ElasticSearch on Amazon Web Service with static IP authentication security features
- Experience in AWS Cloud platform and its features which includes EC2, AMI, EBS Cloudwatch, AWS Config, Auto-scaling, IAM user management, and AWS S3.
- Managed AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing and Glacier for our QA and UAT environments as well as infrastructure servers for GI.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MapReduce jobs.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop, Spark, Kafka, Flume, HDFS, Hive, Impala, Map Reduce, Sqoop, Oozie.
Distribution: Cloudera, HortonWorks
Programming Languages: Python, Scala and Java
Web Technologies: HTML, J2EE, CSS, JavaScript, Servlets, JSP, XML, AWS, EC2, S3
Databases: DB2, MySQL, HBase, Cassandra
DB Languages: SQL, PL/SQL.
Operating Systems: Linux, UNIX, Windows IDE/Testing Tools Eclipse, IntelliJ, PyCharm
PROFESSIONAL EXPERIENCE:
Senior Hadoop developer
Confidential, Columbus, OH
Responsibilities :
- Design data ingestion and integration process using SQOOP, Shell Scripts & Pig, with Hive.
- Adding and Decommissioning Hadoop Cluster Nodes Including Balancing HDFS block data.
- Implemented Fair schedulers on the Resource Manager to share the resources of the Cluster for the MRv2 jobs given by the users.
- Worked with the systems engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments.
- Perform investigation and migration from MRv1 to MRv2.
- Developed Pyspark code to read data from Hive, group the fields and generate XML files.Enhanced the Pyspark code to write the generated XML files to a directory to zip them to CDAs
- Worked with Big Data Analysts, Designers and Scientists in troubleshooting MRv1/MRv2 job failures and issues with Hive, Pig, Flume, and Apache Spark.
- Utilized Apache Spark for Interactive Data Mining and Data Processing.
- Accommodate load in its place before the data is analyzed using Apache Kafka with its fast, scalable, fault-tolerant system.
- Aanalyzed the SQL scripts and designed the solution to implement using Pyspark.
- Configuring Sqoop to import and export data from HDFS to RDBMS and vice-versa.
- Handle the data exchange between HDFS & Web Applications and databases using Flume and Sqoop.
- Used Hive and created Hive tables involved in data loading.
- Extensively involved in querying using Hive, Pig.
- Developed open source Impala/Hive Liqui base plug-in to schema migration in CI/CD pipelines.
- Involved in writing custom UDF's for extending Pig core functionality.
- Involved in writing custom MR jobs which utilize Java API.
- Familiarity with NoSQL databases including Hbase and Cassandra.
- Implemented Cassandra connection with the Resilient Distributed Datasets.
- Design and develop JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Setup automated processes to analyze the System and Hadoop log files for predefined errors and send alerts to appropriate groups.
- Setup automated processes to archive/clean the unwanted data on the cluster, on Name node and Standby node.
- Created Gradle builds to build and deploy Spring Boot microservices to internal enterprise Docker registry.
- Created Maven builds to build and deploy Spring Boot microservices to internal enterprise Docker registry.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
- Documented the systems processes and procedures for future references.
- Supported technical team members in management and review of Hadoop log files and data backups.
- Designed target tables as per the requirement from the reporting team and designed Extraction, Transformation and Loading (ETL) using Talend.
- Implemented File Transfer Protocol operations using Talend Studio to transfer files in between network folders.
- Participated in development and execution of system and disaster recovery processes.
- Experience with cloud AWS and service like EC2, ELB, RDS, Elasti Cache, Route53, EMR.
- Hands on experience in cloud configuration for Amazon web services (AWS). Hands on experience with container technologies such as Docker, embed containers in existing CI/CD pipelines.
- Set up independent testing lifecycle for CI/CD scripts with Vagrant and Virtual box.
Environment : Hadoop, MapReduce2, Hive, Pig, HDFS, Sqoop, Oozie, Microservices, Talend, Pyspark, CDH, Flume, Kafka, Spark, HBase, Zookeeper, Impala, LDAP, NoSQL, MySQL, Info bright, Linux, AWS, Ansible, Puppet, AWS, Chef.
Hadoop/Spark/Big Data Consultant
Confidential, Des Moines, IA
Responsibilities:
- Performance optimizations on Spark/Scala.
- Used Spark as ETL tool.
- Implemented best offer logic using Pig scripts and Pig UDFs.
- Analyzing of Large volumes of structured data using SparkSQL.
- Responsible to load data from external systems, parse and clean data for data scientists
- Create Docker images for Spark and Postgres
- Worked on analyzing Hadoop cluster and different big data analytic tools including Hive, Spark, Python, Sqoop, flume, Oozie.
- Avoided MapReduce by using PySpark for boosting performance to 3x times.
- Used Spark SQL to access Hive tables for data analytical and fast processing
- Import data from Postgres database to Hive using Sqoop using optimized techniques
- Develop Cassandra application to ingest log and time series data.
- Developed spark streaming application to process real time events
- Research customer needs and develop applications as per the customers need
- Developed microservices using Spring Boot api to interact with MongoDB to store analytical configurations.
- Build Cassandra Cluster in AWS environment.
- Developing Scripts and Batch Job to schedule various Hadoop Program
- Fine tuning Cassandra and Spark clusters and Hive queries
- Travel to customer places and identify current drilling issues.
- Responsible for creating and maintaining the micro services, Postgres and Rabbit MQ services in the cloud environments (GE Predix, AWS and Azure)
- Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
- Developed Spark scripts by using Scala as per the requirement.
- Involved in business requirement gathering, analysis and preparing design documents.
- Developed MapReduce jobs to ingest data into Hbase and index into SOLR
- Involved in preparing SOLR collection and schema creation.
- Developed Spark jobs using Scala for processing locomotive events
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Involved in debugging and fine tuning the SOLR cluster and queries
- Designed and developed applications using SOLRJ api to index and search documents
- Involved in importing documents data from external system to HDFS
- Developed Spark streaming applications to ingest emails and instant messages into HBase and Elasticsearch.
- Involved in troubleshooting, performance issues and tuning Hadoop cluster.
- Written code to interact with Hbase using Hbase java client API
- Managing and allocating tasks for onsite and offshore resources
- Involved in setting up Kerberos and authenticating from web application
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, Pair RDD's, Spark YARN.
Environment : HDP 2.6, AWS, Azure, Cassandra, Rabbit MQ, Postgres, SPARK, Hive, Elasticsearch, Hadoop, HDFS, Docker, Sqoop, MongoDB, Spring Boot, Swagger
Sr. Hadoop Developer
Confidential, Tampa, FL
Responsibilities:
- Experience with complete SDLC process staging code reviews, source code management and build process
- Implemented Big Data platforms as data storage, retrieval and processing systems
- Developed data pipeline using Kafka, Sqoop, Hive and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis
- Wrote Sqoop scripts for importing and exporting data into HDFS and Hive
- Wrote MapReduce jobs to discover trends in data usage by the users
- Load and transform large sets of structured, semi structured and unstructured data Pig
- Experienced working on Pig to do transformations, event joins, filtering and some pre-aggregations before storing the data onto HDFS
- Involved in developing Hive UDF's for the needed functionality that is not available out of the box from Hive
- Created Sub-Queries for filtering and faster execution of data
- Experienced in migrating Hive QL into Impala to minimize query response time
- Used HCATALOG to access the Hive table metadata from MapReduce and Pig scripts
- Experience loading and transforming large amounts of structured and unstructured data into HBase and exposure handling Automatic failover in HBase
- Ran POC's in Spark to take the benchmarking of the implementation
- Developed Spark jobs using Scala in test environment for faster data processing and querying
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala
- Used Python for pattern matching in build logs to format warnings and errors
- Configured big data workflows to run on the top of Hadoop using Oozie and these workflows comprises of heterogeneous jobs like Pig, Hive, Sqoop Cluster co-ordination services through ZooKeeper
- Hands on experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions
- Involved in developing test framework for data profiling and validation using interactive queries and collected all the test results into audit tables for comparing the results over the period
- Documented all the requirements, code and implementation methodologies for reviewing and analyzation purposes
- Extensively used GitHub as a code repository and Phabricator for managing day to day development process and to keep track of the issues
Environment : Java, Scala, Hadoop, Spark, HDFS, MapReduce, Yarn, Hive, Pig, Impala, Oozie, Sqoop, Flume, Kafka, Teradata, SQL, GitHub, Phabricator, Amazon Web Services