Sr. Hadoop Developer Resume
Austin, TexaS
PROFESSIONAL SUMMARY:
- Over 7+ years of experience in full Software Development Life Cycle (SDLC), AGILE Methodology and analysis, design, development, testing, implementation and maintenance in Hadoop, Data Warehousing, Linux and Java.
- 4 years of experience in providing solutions for Big data using Hadoop 2.x, HDFS, MR2, YARN, Kafka, PIG, Hive, Sqoop, HBase, Cloudera Manager, Zoo keeper, Oozie, Hue, CDH5 & HDP 2.x.
- Experienced in Big data, Hadoop, NoSQL and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce2, YARN programming paradigm.
- Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks. Worked on debugging, performance tuning of Hive Jobs.
- Migrating tables from RC format to ORC and data induction and other customized file formats.
- Extending Pig and Hive core functionality by writing customized User Defined Functions for analysis of data, file processing, by running PigLatin Scripts.
- Having experience in creating Hive internal/external Tables using shared Meta Store.
- Written Sqoop Queries to import data into Hadoop from Teradata/SQLServer.
- Extensive experience working with real time streaming applications and batch style large scale distributed computing applications, worked on integrating Kafka with NiFi and Spark.
- Good conceptual understanding and experience in cloud computing applications using Amazon EC2, S3, EMR, SQS, SNS,RDS, Glue.
- In - depth knowledge of Statistics, Machine Learning and Data mining.
- Experienced supervised learning techniques like Multi-Linear Regression, Nonlinear Regression, Logistic Regression, Artificial Neural Networks, Support Vector Machine, Decision Tree, Random Forest. Experienced with main unsupervised learning techniques.
- Proficient in applying performance tuning concepts to SQL Queries, Informatica Mappings, Session and workflow properties, and database.
- Worked on Producer API and created a custom partitioner to publish the data to the Kafka Topic. Worked on POC for streaming data using Kafka and spark streaming.
- Implemented Kafka Customer with Spark-streaming and Spark SQL using Scala. Validated the Dstream and created generated new Dstream and saved the data in HDFS.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily. Involved in developing Hive DDLs to create, alter and drop Hive tables and storm, & Kafka.
- Having extensive knowledge of RDBMS such as Oracle, Microsoft SQL Server and MYSQL.
- Good understanding of No SQL databases such as HBase, Cassandra and MongoDB.
- Supported MapReduce Programs running on the cluster and wrote custom MapReduce Scripts for Data Processing in Java.
- Experience with operating ETL processes and data pipelines to build large, complex data sets.
- Encountered in developing Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Good understanding and working knowledge of Data Structures, Design Patterns, Algorithms and Object-Oriented design.
- Capable ofoading from disparate data sets - Familiarity with data loading tools like Flume, Sqoop.
- Experience in Web Services using XML, HTML and SOAP.
- Diverse experience utilizing Java tools in business, Web, and client-server environments including Java Platform, J2EE, EJB, JSP, Servlets, Struts, Spring,JDBC and Hibernate technologies and application servers like Web Sphere and WebLogic.
- Familiarity in working with popular frameworks likes Struts, Hibernate, SpringMVC and AJAX.
TECHNICAL SKILLS:
Big data/Hadoop Ecosystem: HDFS, MR2, HIVE, PIG, HBase, Sqoop, Flume, Oozie, Storm,Airflow and Avro
Java / J2EE Technologies: Core Java, Servlets, JSP, JDBC, XML, REST, SOAP, WSDL
Programming Languages: C, C++, Java, Scala,python, SQL, PL/SQL, Linux shell scripts.
NoSQL Databases: MongoDB, Cassandra, Hbase
Database: Oracle 11g/10g, DB2, MS-SQL Server, MySQL, Teradata.
Web Technologies: HTML, XML, JDBC, JSP, JavaScript, AJAX, SOAP
Frameworks: MVC, Struts 2/1, Hibernate 3, Spring 3/2.5/2.
Tools: Used: Eclipse, IntelliJ, GIT, Putty, Winscp
Operating System: Ubuntu (Linux), Win 95/98/2000/XP, Mac OS, RedHat
ETL Tools: Informatica, pentaho,Talend
Testing: Hadoop Testing, Hive Testing, Quality Center (QC)
Monitoring and Reporting tools: Ganglia, Nagios, Custom Shell scripts.
PROFESSIONAL EXPEREINCE:
Confidential, Austin, Texas
Sr. Hadoop Developer
Responsibilities:
- Worked on analysing Hadoop cluster using different big data analytic tools including Hive and MapReduce.
- Worked on configuring of Jobs in Automic Scheduler UC4 and creating workflows.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on POC's with Apache Spark using Scala to implement Spark project.
- Consumed the data from Kafka using Apache Spark.
- Load and transform large sets of structured, semi structured and unstructured data.
- Made use of .yaml for configuring the pipeline and creating the hive external and managed tables.
- Involved in loading data from LINUX file system,Mainframe Files and Oracle to HDFS
- Importing and exporting data into HDFS and Hive using Sqoop and internal framework named AORTA.
- Data Validations performed on the data at all stages using an internal Data Quality framework which does the schema validations and all the basic checks on the data.
- Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
- Worked on Data Modelling, flattening the tables which helps BI team using CA Erwin.
- Worked in creating HBase tables to load large sets of semi structured data coming from various sources.
- Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using Java.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Responsible for loading data files from various external sources like ORACLE, MySQL into staging area in MySQL databases.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Actively involved in code review and bug fixing for improving the performance.
- Worked on creating the RDD's, DF's for the required input data and performed the data transformations and actions using Spark Scala.
- Created RDD's and DF's has been used to implement business analysis
- Experienced with batch processing of data sources using Apache Spark and Elastic search.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- By using Spark Scala code found the duplicate products and deleted those.
- Pulled categorical data into Postgres database and made business analysis.
- Good experience with Informatica open studio for designing ETL Jobs for Processing of data.
- Created Talend Mappings to populate the data into dimensions and fact tables.
- Broad design, development and testing experience with Talend Integration Suite &Talend MDM and knowledge in Performance Tuning of mappings.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in PIE McQueen bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
Environment: Hadoop , Java, HDFS, MapReduce, Hive, Schema Registry, Kafka, Spark, ScalaAutomic,HBase,Java, SQL scripting, ETL Tools, Teradata, Splunk, Linux shell scripting,Talend, GIT, Intellij.
Confidential, Raleigh, North Carolina
Sr. Hadoop Developer
Responsibilities:
- Tuning spark application to improve performance. Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Implement UDF s to consume complex Cassandra UDT.
- Used Kafka to load data in to HDFS and move data into NoSQL database(Cassandra)
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Cassandra and SQL.
- Expertise in designing columnar families in Cassandra and writing queries in CQL to analyse data from Cassandra tables.
- Experience in using DataStax Spark-Cassandra connectors to get data from Cassandra tables and process them using Apache Spark.Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Responsible for gathering the business requirements for the Initial POCs to load the enterprise data warehouse data to Greenplum databases.
- Fixed a bug in the tls checking code that prevented s with long chains to be used in golang.
- Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
- Design and improve internal search engine using Big data and SOLR/Fusion.
- Data migration from various data sources to SOLR via stages according to the requirement
- Extensively worked on Jenkins for continuous integration and for End to End automation for all build and deployments.
- Work with cross functional consulting teams within the data science and analytics team to design, develop, and execute solutions to derive business insights and solve clients' operational and strategic problems.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
- Used Pyspark to perform transformational and data processing on models.
- Worked on HBase to perform real time analytics and experienced in CQL to extract data from Cassandra tables.
- Worked on Apache Nifi to Uncompress and move json files from local to HDFS
- Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries.
- Work with Architecture and Development teams to understand usage patterns and work load requirements of new projects to ensure the Hadoop platform can effectively meet performance requirements and service levels of application.
Environment: Hadoop , AWS, Java, HDFS, MapReduce, Spark, Pig, Hive, Impala, Sqoop, Flume, Kafka, HBase, Oozie, Java, SQL scripting,PySpark,Cassandra, Linux shell scripting, Eclipse and Cloudera.
Confidential, San Luis Obispo, CA
Sr. Hadoop Developer
Responsibilities:
- Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years' worth of claim data to detect and separate fraudulent claims.
- Worked with the advanced analytics team to design fraud detection algorithms and then developed MapReduce programs to efficiently run the algorithm on the huge datasets.
- Ran data formatting scripts in Python and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
- Created ETL Mappings in Informaticafor loading data into Data Warehouse
- Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
- Involved in administration, installing, upgrading and managing CDH3, Pig, Hive & HBase.
- Played a key-role is setting up a 50 node Hadoop cluster utilizing Apache Spark by working closely with the Hadoop Administration team.
- Created Hive tables to store data into HDFS, loading data and writing hive queries that will run internally in map-reduce way.
- Hands on experience in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions. Extracting real time data using Kafka and spark streaming by Creating DStreams and converting them into RDD, processing it and stored it into Cassandra.
- Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
- Involved in Cluster coordination services through Zookeeper.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Played a key role in installation and configuration of the various Hadoop ecosystem tools such as Solr, Kafka, Pig, HBase and Cassandra.
- Implemented various hive optimization techniques like Dynamic Partitions, Buckets, Map Joins, Parallel executions in Hive.
- Monitor data lake connectivity,security, performance and File system management
- Conduct day-to-day administration and maintenance work on the datalake environment
- Scheduled and executed workflows in Oozie to run Hive and Spark jobs.
- Worked with Airflow (replaced the work of oozie).
- Built centralized logging to enable better debugging using Elastic Search, Logstash and Kibana.
- Efficiently handled periodic exporting of SQL data into Elastic search.
- Worked on GitHub and Jenkins continuous integration tool for deployment of project packages.
- Parse Json files through Spark core to extract schema for the production data using SparkSQL and Scala.
- Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances.
Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, Kafka, CDH3, Cassandra, Python, Oozie, Java collection, Scala, AWS cloud, SQL, NoSQL, Bitbucket, Jenkins, HBase, Flume, spark, Solr, Zookeeper, ETL, Centos, Eclipse.
Confidential, Dublin, OH
Hadoop Developer
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hbase database and Sqoop.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hbase database and Sqoop.
- Responsible for building scalable distributed data solutions using Hadoop.
- Implemented nine nodes CDH3 Hadoop cluster on Red hat LINUX.
- Involved in loading data from LINUX file system to HDFS.
- Worked on installing cluster, commissioning & decommissioning of datanode, namenode recovery, capacity planning, and slots configuration.
- Created HBase tables to store variable data formats of PII data coming from different portfolios.
- Implemented a script to transmit sysprin information from Oracle to Hbase using Sqoop.
- Implemented best income logic using Pig scripts and UDFs.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on tuning the performance Pig queries.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Responsible to manage data coming from different sources.
- Involved in loading data from UNIX file system to HDFS.
- Load and transform large sets of structured, semi structured and unstructured data
- Cluster coordination services through Zookeeper.
- Experience in managing and reviewing Hadoop log files.
- Installed Oozie workflow engine to run multiple Hive and pig jobs.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Environment: Hadoop, HDFS, Pig, Hive, Sqoop, HBase, Shell Scripting, Ubuntu, Linux Red Hat.
Confidential
Java Developer
Responsibilities:
- Performed analysis for the client requirements based on the developed detailed design documents.
- Developed Use Cases, Class Diagrams, Sequence Diagrams and Data Models.
- Developed STRUTS forms and actions for validation of user request data and application functionality.
- Developed JSP's with STRUTS custom tags and implemented JavaScript validation of data.
- Developed programs for accessing the database using JDBC thin driver to execute queries, Prepared statements, Stored
- Procedures and to manipulate the data in the database
- Used JavaScript for the web page validation and Struts Validator for server-side validation
- Designing the database and coding of SQL, PL/SQL, Triggers and Views using IBM DB2.
- Developed Message Driven Beans for asynchronous processing of alerts.
- Used ClearCase for source code control and JUNIT for unit testing.
- Involved in peer code reviews and performed integration testing of the modules. Followed coding and documentation standards.
Environment: Java, Struts, JSP, JDBC, XML, Junit, Rational Rose, CVS, DB2, Windows.