Big Data Developer Resume
SUMMARY
- 7+ years of total IT experience on gathering System Requirements, Analyzing the requirements, Designing and developing systems.
- More than 5+ years of big data application development experience with Hadoop ecosystem components like HDFS, MapReduce, Spark, Hive, Pig, HBase, Sqoop, Flume, Kafka, ZooKeeper, and Oozie.
- 4+ years of experience in Data Warehousing and ETL using Pentaho PDI 8.1 and implementing Spark with Scala applications for faster processing.
- Experienced Hadoop/Java developer having end to end experience in developing applications in Hadoop ecosystem.
- Extensive experience on Spark computation engine with Spark core, Spark Streaming, and Spark SQL for data processing.
- Hands on experience in writing HiveQL queries to do data cleansing and processing along with working on hive performance optimization using Partitioning and Bucketing and Parallel Execution concepts.
- Experience in developing customized UDF’s in java to extend Hive and Pig Latin functionality.
- Excellent understanding and knowledge on NOSQL databases like HBase and MongoDB.
- Good working experience on different file formats like PARQUET, TEXTFILE, AVRO, ORC and different compression codecs GZIP, SNAPPY, LZO.
- Strong ability to compile Java programming (Core Java) including OOPS concepts, Class, Method, Inheritance, Encapsulation, Loop, Exception handling etc.
- Hands on experience in application development using Java, RDBMS (SQL), and Linux/Unix shell scripting.
- Good knowledge on Data Warehousing, ETL development, Distributed Computing, and large scale data processing.
- Knowledge on AWS EC2 and S3 along with services such as EMR and Redshift.
- Experience with agile/scrum methodologies to iterate quickly on product changes, developing user stories and working through backlog.
- Knowledge on setting up workflows using Apache Oozie workflow engine for managing and scheduling Hadoop jobs using Oozie - Coordinator.
- Hands on experience on build tools like Maven, SBT and Ant.
- Experience working on Version control tools like SVN and Git revision control systems such as GitHub and JIRA to track issues.
- Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables, concepts like Lambda Architecture, and Batch processing, Oozie.
- Quick learner with excellent communication and inter-personal skills who possesses high degree of self-motivation
- Ability to coordinate in a team environment who is a responsible team player, detail oriented, analytical, and time bound.
TECHNICAL SKILLS
Big Data Ecosystemts: HDFS, Map-Reduce, Hive, Pig, Sqoop, Flume &Zookeeper, Spark, Kafka
Language: Java, Scala, Python, C, C++
RDBMS/Databases: Oracle10g and MS-SQL Server
Operating Systems: Windows2003, UNIX, Linux
Build Tools: ANT, Maven and SBT
Version Control Tools: SVN and git
ETL Tool: Pentaho PDI 8.1, iWay, Talend
NoSQL Databases: Hbase, Cassandra
Tools: SQL Developer, Toad, WinSCP, Putty, Control-M, Autosys
PROFESSIONAL EXPERIENCE
Confidential
Big Data Developer
Environment: Horton Works (HDP 2.3), HDFS, Map-Reduce, Hive, Zookeeper, Oozie, Pig, Scoop, Python, Shell Scripting, Spark, Core Java, Jenkins, MySQL, Talend.
Responsibilities:
- Created documents that showcase the design and volume of data that reside in Enterprise Data Lake (EDL).
- Tracked and fixed any inconsistencies of fields/columns between landing, staging, and consumer zones.
- Communicated design changes with downstream consumers and helped them embrace the shift they were going to go through.
- Created views and queries for downstream consumers to address any gaps in the data they require.
- Created Sqoop scripts exporting/importing tables from relational databases to HDFS and vice-versa.
- Worked on partitions and buckets for analyzing large volumes of data in Hive.
- Conducted POC’s using spark ingestion and spark SQL.
- Took part in solution architecture meetings to come up with solutions for existing problems.
- Tested out Talend as the ETL tool with the existing infrastructure.
- Developed stored procedures for different consumers across the bank to address their data needs.
- Handled responsibility of bringing teams together to address any critical issues.
- Optimized Tidal jobs to increase performance so that SLA timelines are met.
- Fixed any table level or field level discrepancies across environments (Dev, IST, UAT, Prod) to test any upgrades or new functionality.
- Responsible in managing and analyzing data coming from different sources.
- Experienced on loading and transforming large sets of structured, semi-structured, and unstructured data.
- Used Jenkins tools for continuous integration services for software development and automated builds.
Confidential
Big Data Developer
Environment: Horton Works (HDP 2.3), HDFS, Map-Reduce, Hive, Zookeeper, Oozie, Pig, Scoop, Python, Shell Scripting, Spark, Core Java, Jenkins, MySQL, Tidal, iWay
Responsibilities:
- Created and optimized ingestion scripts in Python.
- Imported tables/data using Sqoop from different RDBMS databases to HDFS.
- Cleaned up lot of managed & external tables in Hive.
- Optimized Tidal scheduling jobs to address bottlenecks and improve performance.
- Implemented test scripts for QA teams to test functionality for any new or existing changes.
- Transforming intermediate scripts in between landing, staging & consumer zones based on requirements.
- Analyzing job log files to identify potential error(s) disrupting the daily loading in production environment.
- Developed ETL workflows leveraging iWay and Tidal.
- Worked with file formats such as Json, Avro, Parquet, ASCII, and EBCDIC.
- Monitoring critical jobs and large data dumps to avoid any duplication.
- Designed test files, queries, and hive tables for QA teams to test functionality.
- Performed data analytics in Hive and then exported this metrics back to Oracle Database using Sqoop.
- Leveraged SVN to push changes from Dev environment to UAT and PROD.
- Altered HQL scripts to add filters and/or improve query performance.
- Automated scripts to apply one function to hundreds of tables at once.
- Tested out spark ingestion as a POC to compare to existing ingestion method.
- Cleaned up lot of tables/data to increase disk space.
- Loaded data from UNIX file system to HDFS and vice versa.
Confidential
Hadoop Developer
Environment: Cloudera Distribution(CDH) 5.1, Hive, Impala, Flume, Control-M, Pentaho 7.1, MySQL, Spark 2.2, Scala, HDFS, Unix, Hbase, GIT, Apache Parquet.
Responsibilities:
- Performed Data Insysgestion, Batch Processing, Data Extraction, Transformation, Loading and Real Time Streaming using Hadoop Frameworks.
- Reading, Processing and parsing CSV source data files through Spark/Scala and ingesting to Hive tables.
- The CSV source data is read through the Scala programming language from core (creating RDDs, DataFrame, Dataset, Scala Methods, Scala Classes and Objects, Pattern Matching, Working with Lists, Collections etc.)
- Used Spark for interactive queries, processing of streaming data and integration with HBase for huge volume of data.
- Used Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data Sets and Data Frames APIs.
- Performed several Transformations and Actions on RDDs in Spark such as map, flatMap, filter, reduceByKey, groupByKey, combineByKey, union, join, countByKey, lookup (key) etc.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and used Sqoop to import and export the data between Oracle and HDFS.
- Implemented Spark SQL to access Hive tables into Spark for faster processing of data.
- Used Spark Data Frame Operations to perform required validations and to perform analytics on the Hive Tables.
- Extensively worked on Jenkins for continuous integration and for End to End automation for all build and deployments.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Design and develop the applications using Agile Methodologies.
- Loading the data into Hadoop ecosystem (Hive and Impala) by using Pentaho ETL tool (Spoon).
- Extensively worked on Hive tables and Impala tables along with partitions and buckets for analyzing large volume of data.
- Version controlling by using GIT Hub, Source Tree tools and document maintenance by using JIRA, Confluence tools.
- Expert in Performance tuning and optimization for Pentaho jobs and SQL queries.
- Knowledge transition to the end users and junior developers to teach understand the hive queries and the business requirements
- Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
- Used Apache parquet with Hive to take advantage of the compressed and efficient columnar data representation available to this project in Hadoop ecosystem.
Confidential
Hadoop Developer
Environment: Horton Works Hadoop Distribution (HDP 2.4), HDFS, Spark 1.6, Scala 2.1, Kerberos, Unix, Hive, tableau, Oozie, HBase, Kafka, NoSQL, MySQL, Pentaho Data Intigration 7.1
Responsibilities:
- Primary responsibilities include building Scalable distributed data solutions using Hortonworks Hadoop ecosystem.
- Created a Spark POC to capture user click stream data and find what topics they are interested in. Scala language is used in Spark project.
- Used Apache Avro with Hive to make the advantages of compressed, efficient columnar data representation available to this project in Hadoop ecosystem.
- Had experience with Source Code Repository systems (SVN) and used revision control systems such as Git.
- Extensively worked on Hive tables, partitions and buckets for analyzing large volumes of data.
- Scheduled the Hive queries on a daily basis by using oozie coordinator and by writing an oozie workflow.
- Developed the application using Agile Methodology.
- Worked on importing data from Oracle database to HDFS using Sqoop tool.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Extensive experience in writing HIVE QUERY scripts to transform raw data from several data sources into forming baseline data.
- Loading the data in to hadoop ecosystem by using Pentaho ETL tool (Spoon) from various data sources and file systems.
- Analyzed the data by performing Hive queries and running HIVE scripts to study customer behavior.
- Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis.
- Developed MapReduce programs for pre-processing and cleansing the data in HDFS obtained from heterogeneous data sources to make it suitable for ingestion into Hive schema for analysis.
- Responsible for creating Hive tables, loading the structured data resulted from Map Reduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
- Implemented Hive Generic UDF's to implemented business logic around custom data types.
- Implemented Partitions, Buckets in Hive for optimization.
- Involved in creating Hive External/Internal tables and Views, loading structured data and writing Hive queries which will run internally in map reduce way.
- Created HBase tables to store various data formats of data coming from different portfolios.
- Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster
Confidential
Java Developer Intern
Environment: JDK1.5, Core Java, J2EE, Linux, HTML, JSP, Springs (IOC and AOP), JSF, Web Sphere, Hibernate, JavaScript, Maven, CSS, DB2, XML, UML, XSLT, FTP, HTTP, RSA7.0, JUnit, Log4j, Apache Velocity, JMS, JDBC, EJB and Web Services
Responsibilities:
- Designed the application UML Class Diagrams, Sequence diagrams using RSA.
- Involved in creating technical design for the project along with core team members.
- Java application was run on JVM (Java Virtual Machine).
- Interaction with business requirements team and developed business process
- Developed task utility services necessary for generating documents
- Developed the application by using core java programming skills.
- Handled overnight build in this project and involved in the proper release.
- Worked in Windows and Oracle Enterprise Linux, Apache Tomcat, Oracle WebLogic Server.
- Developed shell scripts, on Linux platform, to process data from upstream systems that were scheduled to be executed at specific times by Autosys.
- Tested the application by using various automation tools (Quick test Professional - QTP) and Load Runner.
- Developed code to validate the state of Business Objects (POJO’S) using singleton pattern