Hadoop Developer Resume
Charlotte, NC
SUMMARY
- 8 years of overall experience with strong emphasis on Development, Design, Implementation and Testing of Software applications in Hadoop, HDFS, MapReduce, Hadoop, Spark, Pig, Hive, Sqoop, Kafka, Oozie and Zookeeper
- Experience in writing Spark Applications using Python (Pyspark) and Scala
- Data extraction, transformation and loading (ETL) using Pig, Hive, Sqoop and HBase
- Acumen in Data migration from RDBMS to Hadoop platform using Sqoop and also designing and developing applications
- Hands on experience AWS services (VPC, EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, Redshift, Lambda, SNS, SQS).
- Experience in writing custom UDFs for Hive to in corporate methods and functionality of Java into and HQL HIVEQL
- Strong experience in core Java, J2EE, SQL
- Experience in Migrating data from Hadoop/Hive/Hbase to DynamoDB using java automation.
- Expertise in streaming data ingestion and processing
- Acumen in choosing the right efficient Hadoop ecosystem and providing best solutions for Big data problems
- Well versed in designing and developing Big data systems
- Hands on experience in configuring Zookeeper to coordinate the servers in clusters and maintaining the data consistency
- Experience in configuring and working with Flume to direct data from multiple sources directly into Hadoop
- Expertise in migrating ETL transformation using Pig Latin scripts and Join operator
- Hands on experience un handling relational databases such as DB2, My SQL and SQL Server
- Knowledge in all phases of SLDC with Analysis, Designing, Development, Implementation, Debugging and software testing in client server environment
- Basic experience in AWS Big Data stack - EMR, S3, GLUE etc
- Experienced in implementing projects in Agile, Design Patterns and Waterfall
- Well versed with Spring ceremonies methods carried out in Agile methodology
- Imported data from AWS S3 into Spark RDD
- Good knowledge on Machine Learning algorithms
- Strong Analytical skills and problem-solving capabilities with good communication and interpersonal skills
- Good team player with high motivation
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, HBase, Spark, YARN, Kafka, Hive, Flume, Sqoop, Oozie and Zookeeper
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, Amazon EMR
Languages: Python, Java, SQL, Scala, C/C++ and Linux shell scripting
ETL Tools: Talend Informatica
Methodology: Agile, Waterfall and Design Patterns
Web Design Tools: HTML, XML, JavaScript, CSS, JSON
Development / Build Tools: Eclipse, Maven, IntelliJ
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and IBM DB2
Operating systems: UNIX, LINUX, Mac os and Windows Variants
Data analytical tools: R, Pandas, NumPy, MATLAB, IBM SPSS
No SQL Databases: HBase, Cassandra
Cloud Technologies: Amazon Web Services - EC2, RDS, S3, EMR, Glue, Lambda
Machine Learning/Data Sciecne: Logistic Regression, Linear Regression, SVM, KNN, Decision TreesRandom Forests, K-Means, Dimansion reduction
Data Visualization Tools: Tableau, PowerBI, Excel, Rstudio (ggplot, ggplot2)
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Hadoop Developer
Responsibilities:
- Developing MapReduce job to parse raw data, populate tables and store the processed data in partitioned tables in the Enterprise Data Warehouse
- Scripting Hive queries for ad hoc data analysis before progressing into ongoing database
- Partitions and Bucketing in Hive to manage external tables optimizing performance
- Generating real time feed using Kafka and Spark streaming and transforming it to parquet formatted dataframes in HDFS
- Deployed application to GCP using Spinnekar(rpm based)
- Launched multi-node kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application from AWS to GCP.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQueryDeveloped Microservices based on Restful web service using Akka Actors and Akka-Http framework in Scala which handles high concurrency and high volume of traffic
- Developed REST based Scala service to pull data from ElasticSearch/Lucene dashboard, Splunk and Atlassian Jira
- Installed, configured, monitored and maintained Hadoop cluster on Big Data platform.
- Implemented solutions for ingesting data from various sources and processing the Data utilizing Big Data Technologies such as Hive, Pig, Sqoop, Hbase, Map reduce, etc.
- Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
- Worked on big data such as Streamline analytics for building predictive models inside Machine Learning using Scala, Python and R.
- Implemented Server-to-Server(S2S) Authentication based on token based system using Java to access Remote REST API
- Developed MicroServices using Spring boot and core Java/J2EE hosted on AWS to be called by Confidential Fios Mobile App
- Developed native Scala/Java library using Jsch to remotely execute Auto Logs Perl Scripts
- Created and implemented a custom grid system using CSS grid system and jQuery JavaScript library
- Developed complex automation JIRA workflows including project workflows, screen schemes, permission scheme, triggering Jira Event Listener API and notification schemes in JIRA using Atlassian Jira Plugin API based on core Java and Adaptavist Script runner based groovy scripts
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
- Highly knowledge on Hadoop Administrator has extensive abilities of building, configuring and administration of large data clusters in big data environments using Apache distribution.
- Experienced in processing Big data on the Apache Hadoop framework using MapReduce programs.
- Experience in working with Windows, UNIX/LINUX platform with different technologies such as Big Data, SQL, XML, HTML, Core Java, Shell Scripting etc.
- Migrated existing MapReduce programs to Spark using Scala and Python
- Extensively worked on Python and build the custom ingest framework.
- Worked on Rest API using python.
- Worked closely with the Architect; enhanced and optimized product Spark and python code to aggregate, group and run data mining tasks using Spark framework.
- Developed testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
- Performed SQL queries on AWS with Athena and RedShif.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Migrated Hbase data from In house data center to AWS Dynamodb using Java API.
- Responsible for API design and implementation to exposing data to/from Dynamodb.
- Used Enterprise Java Beans EJB session beans in developing business layer APIs.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.
- Experience in developing Spark Application using Spark SQL and Python for faster and accurate data processing
- Implementing events and records from Kafka by writing Spark applications in Scala
- Used Spark as execution engine for performing data analytics using Hive environment
- Solved performance issues using Pig scripts with the understanding of Joins, Groups and Aggregation and converting them to accommodate MapReduce jobs
- Usage of HiveQL instead of using Derby Database to work in shared Hive environment when timeline is critical
- Experience in using Sequence file, RCFile and AVRO during the Refresh stage
- Developing Oozie workflow for initiating and scheduling ETL process
- Creating Oozie workflow for regular incremental loads and importing them to Hive Tables that is obtained from the Teradata
- Experience in performing Spin Up AWS instances EC2-classic and EC2-VPC using cloud formatting templates
- Developed Schedulers to communicate with AWS to retrieve data
- Managing and maintaining schedule for ETL pipelines on Glue
- Developing Bash scripts to direct the log files from FTP server into Hive Tables
- Imported metadata intoHive and moving the existing Hive tables and ongoing application to Amazon AWS cloud services for development
- Developing MapReduce jobs to synthesize to the SLA protocols
- Moving data from HDFS to Cassandra using the BulkOutputFormat class in the MapReduce job
- Very good experience with both MapReduce 1 (Job tracker) and MapReduce 2 (YARN) setups
Confidential, Denver, CO
Hadoop Developer
Responsibilities:
- Analyzing, Troubleshooting and Development of Hadoop cluster using various big data analytical tools such as Spark, Pig, Hive, Scala, Tez and Kafka
- Construction of scalable distributed data solutions using Hadoop
- Performing data analytics on Hive using the help of SparkAPI over Hortonworks Hadoop YARN
- Faster testing and processing data through Spark Code using scala and SparkSQL/Streaming
- Implemented text analytics and processing with in-memory capabilities of Apache Spark written in Python
- Importing and Exporting Teradata using Sqoop from HDFS to RDBMS and vice versa
- Extraction, Transformation and Loading (ETL) of data from multiple sources like Databases, XML files and Flat Files
- Imported data from AWS S3 into Spark RDD
- Worked closely with the Architect; enhanced and optimized product Spark and python code to aggregate, group and run data mining tasks using Spark framework.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
- Developing UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Used Scala collection framework to store and process the complex consumer information.
- Used Scala functional programming concepts to develop business logic.
- Developed programs in JAVA, Scala-Spark for data reformation after extraction from HDFS for analysis.
- Developed testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
- Performed SQL queries on AWS with Athena and RedShif.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Directing data from AWS S3 bucket using Spark Streaming to perform real-time Transformation and Aggregations to build the data model and to send it to HDFS
- Incremental import by creating Sqoop metastore jobs
- Managing and Reviewing HBase log files
- Writing MapReduce jobs to run on EMR clusters and managing the workflow for running other jobs
- Extracting analytics report using EMR jobs to run on Amazon VPC cluster
- Designed and implemented Hive queries to perform filtering, evaluation, load and storing data
- Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages
- Migrated Hbase data from In house data center to AWS Dynamodb using Java API.
- Responsible for API design and implementation to exposing data to/from Dynamodb.
- Used Enterprise Java Beans EJB session beans in developing business layer APIs.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce, Hive and Sqoop as well as system specific jobs.
- Exported analyzed data to local databases using Sqoop to create visualization dashboards generating reports for the managers
- Data warehousing using Hive and managing hive tables
- Working with Spark which provides fast general engine for processing big data integrated with Python programming
- Created and managed technical documentation for launching Hadoop clusters and constructing Visualization dashboard templates for Quarter analysis.
Confidential
SQL Developer
Responsibilities:
- Developed SQL queries to generate and store data reports.
- Configured SQL jobs and maintenance plans to ensure database stability and integrity
- Coded and tested SQL queries based on the table using inline, views, filters, merge statements, dynamic SQL statements and monitored indexes to bring down processing time
- Ensuring several functions such as filters, randomizations and stratifications are active in the database
- SQL performance tuning by modifying the indexes and setting transaction isolation levels and changing query structures by initiating Inline calculation to replace sub-query-involved functions
- Generating data from multiple database servers such as ORACLE, DB2 and Access and connecting them using SSIS tool
- Building ETA process to move data from one database servers to destination using the SISS package, VBA, Export/Import Wizard
- Used SSIS control flow such as using Execute SQL task, Foreach Loop container, Script task, File System task to perform ETL functions
- Used SSRS to generate formatted reports with stored procedures and expressions.
- Designed SSRS reports providing visualization of the data
- Developing series of automation using SQL, SRSS and Report Manager to generate production formatted reports
Confidential
Data Analyst Intern
Responsibilities:
- Data extraction, compiling and tracking to generated reports post analysis
- Standardized SQL, SAS and MicroStrategy based data management infrastructure to support Market advantage
- Plan and coordinating the administration of PostgreSQL databases to ensure data accuracy, effective use of data within the Database containing the definition & structure of the build and operational guidelines
- Performing SQL queries to maintain and manage the data on a monthly to weekly frequency basis depending on SLA terms
- Predictive modeling using RStudio and performing time series analysis and time-to-event analysis to record changes in the market
- Data visualization using R packages such as ggplot and ggplot2 to generate dashboards for team meetings
- Developed optimized data and qualifying procedures
- Analyzed data using Excel and MicroStrategy to generate business decisions suggestions
- Continuously engage with senior application analysts to understand procedures & functional data reconciliation requirements to design and develop changes within the tool
- Utilized Microsoft Excel to categorize reports into a detailed pivot table to develop improved insight deriving strategy
- Redesigned data mart using extraction, transformation and loading (ETL) from various platforms for deeper insights production and effective decision making and documenting the whole process for future use
- Took part in automating data import using the SSIS package from external environment which predominantly decreased time consumption from days to hours