- Over 8 years of diversified IT experience in E2E data analytics platforms (ETL - BI-Java) as Big data, Hadoop, Java/J2EE Development and System Analysis .
- Worked for over 5 years with Big Data/ Hadoop Ecosystem in the implementation of Data Lake .
- Hands on experience Hadoop framework and its ecosystem like Distributed file system (HDFS), MapReduce, Pig, Hive, Sqoop, Flume and Spark .
- Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workflows), extending the functionality by writing custom UDFs.
- Extensive experience in developing Data warehouse applications using Hadoop, Informatica, Oracle, Teradata, MS SQL server on UNIX and Windows platforms and experience in creating complex mappings using various transformations and developing strategies for Extraction, Transformation and Loading (ETL) mechanism by using Informatica 9.x/8.x.
- Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning, Dynamic-Partitioning, Bucketing and Parallel Execution concepts.
- Experience in analyzing data using Hive QL, Pig Latin , and custom MapReduce programs in Java , custom UDF s.
- Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
- Experience on Cloud computing infrastructure AWS (amazon web services).
- Hands-on experience with Amazon EC2, Amazon S3, EMR, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family.
- Created modules for spark streaming in data into Data Lake using Strom and Spark.
- Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables, concepts like Lambda Architecture, and Batch processing, Oozie .
- Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping designer, Mapplet Designer, ETL Transformations, Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.
- Hands on experience with AWS Databases such as RDS(Aurora), Redshift and DynamoDB
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Extensively worked on Python Programming for Automation and connecting different ecosystems Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
- Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts
- Worked closely to review pre- and post-processed data to ensure data accuracy and integrity with Dev and QA teams.
- Experience in Java, J2ee, JDBC, Collections, Servlets, JSP, Struts, Spring, Hibernate, JSON, XML, REST, SOAP Web services, Groovy, MVC, Eclipse, Weblogic, Websphere, and Apache Tomcat severs.
- Extensive knowledge of Data Modeling, Data Conversions, Data integration and Data Migration with specialization in Informatica Power Center.
- Expertise in extraction, transformation and loading data from heterogeneous systems like flat files, excel, Oracle, Teradata, MSSQL Server.
- Good work experience with UNIX/Linux commands, scripting and deploying the applications on the servers. Maintained tuning, and monitoring Hadoop jobs and clusters in a production environment.
- Strong skills in algorithms, data structures, Object oriented design, Design patterns, documentation and QA/testing.
- Excellent domain knowledge in Insurance, Telecom and Banking .
Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, MongoDB, Apache Spark, Spark Streaming, HBase, Flume, Impala
Hadoop Distribution: Cloudera, Horton Works, Apache, AWS
Languages: Java, SQL, PL/SQL, Pig Latin, HiveQL, Scala, python, PySpark, Shell Scripting, Regular Expressions
Operating Systems: Windows(xp/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.
Portals/Application servers: WebLogic, WebSphere Application server, WebSphere Portal server, JBOSS
Build Automation tools: SBT, Ant, Maven
Version Control: GIT
IDE & Build Tools, Design: Eclipse, Visual Studio, Net Beans, Rational Application Developer, Junit
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, MongoDB), Teradata.
AWS Bigdata Engineer
Confidential, Malvern, PA
- Worked in AWS environment for development and deployment of Custom Hadoop Applications.
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shellscripts (for scheduling of few jobs) extracted and loaded data into DataLake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
- Designed a data workflow model to create a data lake in hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
- Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Hadoop Cluster in Parquet format.
- Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Loading log data directly into HDFS using Flume.
- Leveraged AWS S3 as storage layer for HDFS.
- Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
- Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
- Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
- Used Confluence to store the design documents and the STMs
- Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
- Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology
Environment: SPARK, Hive, Pig, Flume, Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL
Sr. Big Data Developer
Confidential, Bridgeville, PA
- Created and worked on Sqoop jobs with incremental load to populate Hive External tables.
- Designed and developed Hive tables to store staging and historical data.
- Created Hive tables as per requirement, internal and external tables are defined with appropriate static and dynamic partitions, intended for efficiency.
- Experience in using ORC file format with Snappy compression for optimized storage of Hive tables.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and used them using Impala process engine
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Created Oozie workflows for sqoop to migrate the data from source to HDFS and then to target tables.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Responsible for building scalable distributed data solutions using Hadoop .
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Involved in migrating MapReduce jobs into Spark jobs and used Spark SQL and Data Frames API to load structured and semi - structured data into Spark clusters.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Worked extensively with Sqoop for importing metadata from Oracle .
- Developed Oozie workflow jobs to execute Hive, Pig, Sqoop and MapReduce actions.
- Configured Flume to transport web server logs into HDFS .
- Worked on Amazon Web Services (AWS), Amazon Cloud Services like Elastic Compute Cloud (EC2), Simple Storage Service(S3), Elastic Map Reduce (EMR), Amazon Simple DB and Amazon Cloud Watch.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
- Used Apache Kafka for importing real time network log data into HDFS.
- Worked on numerous POCs to prove if Big Data is the right fit for a business case.
- Worked on data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka .
- Created web-based User interface for creating, monitoring and controlling data flows using Apache Nifi .
Environment : Apache Hadoop, CDH 4.7, HDFS, MapReduce, AWS, Sqoop, Flume, Pig, Hive, HBase, Oozie, Scala, Spark, Spark Streaming, Kafka, Linux
Big Data Developer
Confidential, Dallas, TX
- Worked with the source team to understand the format & delimiters of the data files.
- Responsible for generating actionable insights from complex data to drive significant business results for various application teams.
- Developed and implemented API services using Python in spark .
- Troubleshoot and resolve data quality issues and maintain important level of data accuracy in the data being reported.
- Extensively implemented POC's on migrating to Spark - Streaming to process the live data.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Re-writing existing map-reduce jobs to use new features and improvements for achieving faster results.
- Analyses large amount of data sets to determine optimal way to aggregate and report on it.
- Performance tuned slow running resource intensive jobs.
- Worked on Data serialization formats for converting complex objects into sequence bits by using Avro, Parquet, JSON, CSV formats.
- Hands on experience working on in-memory based Apache Spark application for ETL transformations.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python .
- Developed multiple POCs using Spark and deployed on the Yarn cluster, compared the performance of Spark , with Hive and SQL/Teradata .
- Developed Flume configuration to extract log data from different resources and transfer data with different file formats ( JSON, XML , and Parquet ) to hive tables.
- Setup Oozie workflow /sub workflow jobs for Hive/SQOOP/HDFS actions.
- Experience in accessing Kafka cluster to consume data into Hadoop.
- Involved in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Worked with business and functional requirement gathering team, updated user comments in JIRA and documented in confluence
- Handled tasks like maintaining accurate roadmap for project or certain product.
- Monitoring the sprints, burndown charts and completing the monthly reports.
Environment : Hive, SQL, Pig, Flume, Kafka, Map reduce, SQOOP, AWS, Spark, Python, Java, Shell Scripting, Teradata, Oracle, Oozie, Cassandra
Big data Developer
Confidential, Arlington, VA
- Gathered business requirements in meetings for successful implementation and POC (Proof - of-Concept) of Hadoop Cluster.
- Importing data in regular basis using Sqoop into the Hive partition and controlled work flow by using apache Oozie .
- Developed Sqoop Jobs to both import data into HDFS from Relational Database Management System like Oracle & DB2 and export data from HDFS to Oracle .
- Developing HQL queries to implement the select, insert, update and operations to the database by creating HQL named queries.
- Involved in data extraction that may include analysing, reviewing, modelling based on requirements using higher level tools such as Hive and Impala .
- Involving in migrating HiveQL into Impala to minimize query response time.
- Involving in creating Hive tables, loading with data and writing hive queries.
- Developed Pig functions to pre-process the data for analysis.
- Created HBase tables to store all data.
- Deployed the Hbase cluster in cloud (AWS) environment with scalable nodes as per the business requirement.
- Analysed identified defects and its root cause and recommended course of actions.
- Loaded data into Hive Tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
- Worked on streaming the analysed data to the existing relational databases using Sqoop for making it available for visualization and report generation by the BI team.
- Generated reports and did predictions using BI Tool called Tableau , Integrated data by using Talend .
- Deployed the Hbase cluster in cloud (Amazon AWS) environment with scalable nodes as per the business requirement.
Environment: HDFS, Hive, MapReduce, Sqoop, Impala, Java, Pig, SQL Server, HBase, Oracle and Tableau, AWS.
- Integrated Kafka with Storm for real time data processing and written some storm topologies to store the processed data directly to MongoDB and HDFS.
- Experience in writing Spark SQL scripts.
- Imported data from different sources into Spark RDD for processing.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Involved in loading data from edge node to HDFS using shell scripting.
- Worked on installing cluster, commissioning and decommissioning of Datanode, Namenode high availability, capacity planning and slots configuration.
- Completion of unit testing for the new Hadoop jobs in standalone mode designated for Unit region using MR Unit.
- Developed Spark scripts by using Scala and Python shell commands as per the requirement.
- Experience in managing and reviewing Hadoop log files.
- Experience in Hive partitioning, bucketing and perform joins on Hive tables and implementing Hive SerDe like REGEX, JSON and Avro.
- Optimized Hive analytics Sql queries, created tables/views, written custom UDF's and Hive based exception processing.
- Involved in transforming the Teradata to legacy lables to HDFS and HBase tables using Sqoop and vice versa.
- Configured Fair Scheduler to provide fair resources to all the applications across the cluster.
Environment: Hortonworks Hadoop, Ambari, Spark, Solr, Kafka, MongoDB, Linux, HDFS, Hive, Pig, Sqoop, Flume, Zookeeper, RDBMS.
- Functional and UI design has been designed.
- Implementation at I/O level.
- Creation of Record sets and BIOs for the database schema .
- Created Relationships for data Integrity .
- Used Form Slots by considering the BIO schema .
- Attachments of documents has been provided for work orders/invoices .
- Authentication and authorization have been achieved by creating users and profiles in admin page.
- Write JSP and Servlets to add functionality to web application based on customer requirements
- Use J2EE design patterns to create application, including utilizing EJB for business logic.
- Create and execute test cases in JUnit for unit testing of application.
- Wrote SQL queries to retrieve data from the database using JDBC.
- Worked on designing/developing of large, transactional, enterprise applications.
- Made crud operation in the database like SQL.
- Implemented object-permissions at widget, menu, and form levels.
- Developed Form level extensions to achieve UI level validations and BIO level extensions to fulfil Functional requirements and validations.
- Involved in Development process and have knowledge in usage of Tracker Tools like JIRA .
- Having good Knowledge in Epiphany Platform (Open Architecture).
- Having Extensive Hands on Experience on Complex PL/SQL Programming .
- Designed and coded application components in an Agile/TDD environment utilizing a test driven development and pair-programming.
Environments: CRB Studio, Web logic server 8.1, LDAP, java 1.8, SQL Server, SQL, Junit, EJB’S, JSP, HTML, CSS.