- Overall 7 + years of IT experience in software development, which includes hands on experience in Big Data Engineering and Analytics, Java Application Development.
- Expertise with the tools in Hadoop Ecosystem including Spark, Hive, HDFS, MapReduce, Sqoop, Pig, Kafka, Yarn, Oozie, and Zookeeper.
- Strong programming experience using Java, Scala, Python and SQL.
- Strong fundamental understanding of Distributed Systems Architecture and parallel processing frameworks.
- Strong experience designing and implementing end - to-end data pipelines running on terabytes of data.
- Expertise in developing production ready Spark applications utilizing Spark-Core, Dataframes, Spark-SQL, Spark-ML and Spark-Streaming API's.
- Strong experience troubleshooting failures in spark applications and fine-tuning for better performance.
- Experience in using D-Streams in spark streaming, accumulators, Broadcast variables, various levels of caching and optimization techniques in spark.
- Strong experience working with data ingestion tools Sqoop and Kafka.
- Good knowledge and development experience with using MapReduce framework.
- Hands on experience in writing ad-hoc Queries for moving data from HDFS to Hive and analyzing data using Hive QL.
- Proficient in creating Hive DDL's, writing Hive custom UDF’s.
- Knowledge in job workflow managing and monitoring tools like Oozie and Rundeck.
- Experience in designing, implementing and managing secure authentication mechanism to Hadoop cluster with Kerberos.
- Experience in working with NoSQL database like HBase, Cassandra and Mongo DB.
- Experience in ETL process consisting of data transformation, data sourcing, mapping, conversion and loading.
- Good knowledge in creating ETL jobs through Talend to load huge volumes of data into Hadoop Ecosystem and relational databases.
- Experience working with Cloudera, Hortonworks and Amazon AWS EMR distributions.
- Good experience in developing applications using Java, J2EE, JSP, MVC, EJB, JMS, JSF, Hibernate, AJAX and web-based development tools.
- Strong experience in RDBMS technologies like MySQL, Oracle, Snowflake, Redshift and Teradata.
- Strong expertise in creating Shell-Scripts, Regular Expressions and Cron Job Automation.
- Good knowledge in Web Services, SOAP programming, WSDL, and XML parsers like SAX, DOM, AngularJS, Responsive design/Bootstrap.
- Experience working with containerization engines like Docker, Kubernettes.
- Experience with various version control systems such as CVS, TFS, SVN.
- Worked with geographically distributed and culturally diverse team, including roles that involve interaction with clients and team members.
BigData/Hadoop Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark(Scala & Python), Kafka and Oozie
No-SQL Databases: HBase, Cassandra, MongoDB
Languages: Java, Scala, Python, SQL
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.
Databases: Microsoft SQL Server, MySQL, Oracle, DB2
Build and Version Tools: Jenkins, Maven, Git
Development Tools: Eclipse, IntelliJ
Development Methodologies: Agile/Scrum, Waterfall
Confidential, Denver, CO
Sr. Hadoop/Spark Developer
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations and other capabilities.
- Worked extensively with Sqoop for importing data from Oracle.
- Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
- Involved in creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
- Good experience with continuous Integration of application using Bamboo.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
- Designed, documented operational problems by following standards and procedures using JIRA.
Environment: Hadoop, Spark, Scala, Python, Hive, Sqoop, Oozie, Kafka, Amazon EMR, YARN, JIRA, Amazon AWS, Shell Scripting, SBT, GITHUB, Maven.
Confidential, Hartford, CT
- Involved in requirement analysis, design, coding and implementation phases of the project.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Written new spark jobs in Scala to analyze the data of the customers and sales history.
- Used Kafka to get data from many streaming sources into HDFS.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Good experience in Hive partitioning, Bucketing and Collections perform different types of joins on Hive tables.
- Created Hive external tables to perform ETL on data that is generated on daily basics.
- Written HBase bulk load jobs to load processed data to Hbase tables by converting to HFiles.
- Performed validation on the data ingested to filter and cleanse the data in Hive.
- Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
- Loaded the data into hive tables from spark and used parquet columnar format.
- Developed oozie workflows to automate and product ionize the data pipelines.
- Developed Sqoop import Scripts for importing data from Netezza.
Environment: Hadoop, HDFS, Hive, Sqoop, Kafka, Spark, Shell Scripting, HBase, Scala, Python, Kerberos, Maven, Ambari, Hortonworks, MySQL.
Confidential, New York City, NY
Big Data/Hadoop Developer
- Actively Participated in all phases of the Software Development Life Cycle (SDLC) from implementation to deployment.
- Responsible for building scalable distributed data solutions using Hadoop.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
- Responsible to manage the test data coming from different sources.
- Analyzed data using Hadoop components Hive and Pig.
- Load and transform large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Involved in importing and exporting the data from RDBMS to HDFS and vice versa using Sqoop.
- Involved in loading data from UNIX file system to HDFS.
- Responsible for creating Hive tables, loading data and writing hive queries.
- Created Hive External tables and loaded the data into tables and query data using HQL
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts.
- Extracted the data from Teradata into HDFS using the Sqoop.
- Exported the patterns analyzed back to Teradata using Sqoop.
- Experience in Monitoring System Metrics and logs for any problems adding, removing, or updating Hadoop Cluster.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie workflows for batch processing and scheduling workflows dynamically
- Involved in requirement analysis, design, coding and implementation phases of the project.
Environment: Hadoop, Spark, Scala 1.5.2, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Oozie, Zookeeper
Confidential, Camp Hill, PA
- Developed complex MapReduce jobs in Java to perform data extraction, aggregation and transformation
- Load the data into HDFS from different Data sources like Oracle, DB2 using Sqoop and load into Hive tables.
- Analyzed big data sets by running Hive queries and Pig scripts.
- Integrated the hive warehouse with HBase for information sharing among teams.
- Developed the Sqoop scripts for the interaction between Pig and MySQL Database.
- Worked on Static and Dynamic partitioning and Bucketing in Hive.
- Scripted complex Hive QL queries on Hive tables for analytical functions.
- Developed complex Hive UDFs to work with sequence files.
- Designed and developed Pig Latin scripts and Pig command line transformations for data joins and custom processing of Map Reduce outputs.
- Created dashboards in Tableau to create meaningful metrics for decision making.
- Performed rule checks on multiple file formats like XML, JSON, CSV and compressed file formats.
- Monitored System health and logs and respond accordingly to any warning or failure conditions.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
- Used storage format like AVRO to access multiple columnar data quickly in complex queries.
- Implemented Counters for diagnosing problem in queries and for quality control and application-level statistics.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Implemented Log4j to trace logs and to track information.
- Developed some helper class for abstracting Cassandra cluster connection act as core toolkit.
- Installed Oozie workflow engine and scheduled it to run data/time dependent Hive and Pig jobs
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Sqoop, Oozie, Hadoop, HDFS, Map Reduce, Hive, HBase, Linux.
- Involved in Requirement Analysis, Design, Development and Testing of the risk workflow system.
- Involved in the implementation of design using vital phases of the Software development life cycle (SDLC) that includes Development, Testing, Implementation and Maintenance Support.
- Applied OOAD principle for the analysis and design of the system.
- Implemented XML Schema as part of XQuery query language
- Applied J2EE design patterns like Singleton, Business Delegate, Service Locator, Data Transfer Object (DTO), Data Access Objects (DAO) and Adapter during the development of components.
- Used RAD for the Development, Testing and Debugging of the application.
- Used WebSphere Application Server to deploy the build.
- Developed front-end screens using Struts, JSP, HTML, AJAX, jQuery, Java script, JSON and CSS.
- Used J2EE for the development of business layer services.
- Developed Struts Action Forms, Action classes and performed action mapping using Struts.
- Performed data validation in Struts Form beans and Action Classes.
- Developed POJO based programming model using spring framework.
- Used IOC (Inversion of Control) Pattern and Dependency Injection of Spring framework for wiring and managing business objects.
- Used Web Services to connect to mainframe for the validation of the data.
- SOAP has been used as a protocol to send request and response in the form of XML messages.
- JDBC framework has been used to connect the application with the Database.
- Used Eclipse for the Development, Testing and Debugging of the application.
- Log4j framework has been used for logging debug, info & error data.
- Used Hibernate framework for Entity Relational Mapping.
- Used Oracle 10g database for data persistence and SQL Developer was used as a database client.
- Extensively worked on Windows and UNIX operating systems.
- Used SecureCRT to transfer file from local system to UNIX system.
- Performed Test Driven Development (TDD) using JUnit.
- Used Ant script for build automation.
- PVCS version control system has been used to check-in and checkout the developed artifacts. The version control system has been integrated with Eclipse IDE.
- Used Rational Clear quest for defect logging and issue tracking.