- Having 15+ years of Experience in IT industry in Designing, Developing and Maintaining Web based Applications using Big Data Technologies likeHadoop Eco System, Big data, Spark Ecosystems, Scala, ETL, RDBMS, Core Java and related technologies with domain exposure in Banking, retail, insurance, Management Systems.
- 5+ years of strong experience in developing distributed computing Big Data Analytical applications using open Source frameworks like Apache Spark, Apache Hadoop, Hive, Kafka etc.
- Good understanding of Apache Spark features & advantages over map reduce systems.
- In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming.
- Expertise in writing Spark RDD transformations, actions, Data Frame's, Persistence (Caching), Accumulators, Broadcast Variables, Case classes for the required input data and performed the data transformation using Spark-core.
- In depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Task scheduler, Stages and task.
- Good understanding Driver, Executor Spark web UI.
- Experience in submitting Apache Spark job to YARN.
- Experience in real time processing using Apache Spark streaming and Kafka as messaging system.
- Experienced with NoSQL databases like HBase and Cassandra.
- Expertise in developing Real-Time Streaming Solutions using Spark Streaming
- Experienced in Developing Spark programs using Scala and Java API's.
- Expertise in using Spark-SQL with various data sources like JSON, Parquet and Hive.
- Excellent understanding ofHadoopArchitecture and Daemons such as HDFS, Name Node, Data Node, Job Tracker, Task Tracker and Map Reduce Concepts.
- Experienced as Hadoop, expertise in providing end to end solutions for real time big data problems by implementing distributed processing concepts such as map reduce on Hadoop frameworks such as HDFS and Hadoop Ecosystem components.
- Strong experience in Spark SQL UDFs, Spark SQL Performance, Performance Tuning
- Hands on experience in working with input file formats like orc, parquet, Json, Avro.
- Implemented Sqoop for large data transfers from RDMS to HDFS/HBase/Hive and vice-versa.
- Expertise in using Flume in Collecting, aggregating and loading log data from multiple sources into HDFS.
- Experienced in using Pig scripts to do transformations, eventjoinsfilters and some pre-aggregations before storing the data onto HDFS.
- Scheduled various ETL process and Hive scripts by developing Oozie workflows.
- Hands on Experience in writing SQL and PL/SQL queries.
- Strong experience in Informatica ETL Tool, Data warehousing and Business intelligence.
- Good level of understanding in Core Java JEE technologies as JDBC, Servlets, and JSP and Scala.
- Worked on Web logic, Tomcat Web Server for Development and Deployment of the Java/J2EE Applications.
- Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS and Amazon EC2, Amazon EMR.
- Experience in Complete Software Development Life Cycle (SDLC) which includes Requirement Analysis, Design, and Coding, Testing and Implementation using Agile (Scrum).
- Worked with operating systems like Linux, UNIX, and Windows.
Languages: Scala,Core Java & J2EE Technologies, Servlets and JSP.
Big Data/Hadoop technologies: Apache Hadoop, Spark, Hive, HDFS, Pig, Sqoop, flumeKafka, Zookeeper and Oozie
RDBMS: Oracle, SQL Server, Teradata, MySQL
NoSQL DBMS: HBase, Cassandra
Scripting Languages: UNIX Shell script
Development Tools: IntelliJ, Eclipse, NetBeans
Servers: WebLogic and Tomcat
Operating Systems: UNIX, Windows, LINUX
Cloud: AWS, AZURE
Confidential, Bentonville, AR
- Coordinating with BI team to gather requirements for various data mining projects.
- Configured Spark streaming to get ongoing information from the Kafka and stored the stream information to HDFS and Cassandra.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into Cassandra.
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
- Used various Spark Transformations and Actions for cleansing the input data.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyse data from Cassandra tables for quick searching, sorting and grouping.
- Load and transform large sets of structured, semi structured and unstructured data.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Processed the real time steaming data using Kafka, integrating with Spark streaming API.
- Consumed JSON messages using Kafka and processed the JSON file using Spark Streaming.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.
- Worked extensively with Sqoop for importing metadata from MySQL and assisted in exporting analysed data to relational databases using Sqoop.
- Created Hive tables as per requirement, internal and external tables are defined with appropriate static and dynamic partitions, intended for efficiency and bucket to and write HQL scripts to perform data analysis.
- Worked on Hive optimization techniques using joins, sub queries and used various functions to improve the performance of long running jobs.
- Optimized Hive QL by using execution engine like Spark.
- Experienced in migrating HiveQL into Spark SQL into Spark engine to minimize query response time.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing, development and production.
- Handled importing of data from various data sources, performed transformations using Hive, Spark and loaded data into HDFS.
- Developed multiple spark Jobs in Scala for data cleaning and pre-processing.
- Used Sqoop,Hadoop, spark and Oozie for building data pipeline.
- Cluster coordination services through Zookeeper.
- Automated all the jobs, for pulling data from FTP server to load data into Hive tables using Oozie workflows.
Environment: Apache Spark, Apache Kafka, Scala, Cassandra, Hive, Sqoop, Hadoop, HDFS, Scala, Oozie, MySQL.
Confidential, Salt lake City, Utah
- Worked with Hadoop Ecosystem components like Cassandra, Sqoop, Flume, Oozie, Hive and Pig.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Developed PIG and Hive UDF's in java for extended use of PIG and Hive and wrote Pig Scripts for sorting, joining, filtering and grouping the data.
- Developed programs in Spark based on the application for faster data processing than standard MapReduce programs.
- Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs.
- Developed the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
- Used Hadoop FS actions to move the data from upstream location to local data locations.
- Written extensive Hive queries to do transformations on the data to be used by downstream models.
- Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Developed a data pipeline using Kafka, Cassandra and Hive to ingest, transform and analysing customer behavioural data.
- Great familiarity with Hive joins & used HQL for querying the databases eventually leading to complex Hive UDFs.
- Responsible to migrate iterative map reduce programs into Spark transformations using Spark and Scala.
- Used Scala to write the code for all the use cases in Spark and Spark SQL.
- Expertise in implementing Spark and Scala application using higher order functions for both batch and interactive analysis requirement. Implemented SPARK batch jobs.
- Worked with Spark core, Spark Streaming and spark SQL modules of Spark.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
- Developed a data pipeline using Spark and Hive to ingest, transform and analysing data.
- Performed visualizations per business requirements using custom visualization tool.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, effective & efficient Joins, Transformations and other during ingestion process itself.
- Analysed the SQL scripts and designed the solution to implement using Scala.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Involved in creating Hive tables, and loading and analysing data using hive queries
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EMR and RDS.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Cassandra, Sqoop, Amazon AWS, Tableau, Oozie, Cloudera, Oracle 12c, Linux.
- Interacted with business community and gathered requirements based on changing needs. Incorporated identified factors into Informatica mappings to build the DataMart.
- Developed a standard ETL framework to enable the reusability of similar logic across the board. Involved in System Documentation of Dataflow and methodology.
- Assisted in designing Logical/Physical Data Models, forward/reverse engineering using Erwin 7.0.
- Developed mappings to extract data from SQL Server, Oracle, Teradata, Flat files and load into DataMart using the PowerCenter.
- Developed common routine mappings. Made use of mapping variables, mapping parameters and variable functions.
- Used Informatica Designer to create complex mappings using different transformations like Filter, Router, Connected & Unconnected lookups, Stored Procedure, Joiner, Update Strategy, Expressions and Aggregator transformations to pipeline data to DataMart.
- Developed Slowly Changing Dimension for Type 3 SCD
- Used mapplets for use in mappings thereby saving valuable design time and effort
- Used Informatica Workflow Manager to create, schedule, execute and monitor sessions, Worklets and workflows.
- Troubleshooting database, workflows, mappings, source, and target to troubleshoot the bottlenecks and improved the performance.
- Written Indexes, primary keys and checked other performance tuning at data base level.
- Implemented various Performance Tuning techniques on Sources, Targets, Mappings, Workflows and database tuning.
- Involved in generating reports from Data Marts using Cognos.
- Defects were tracked, reviewed and analysed.
- Used Source Analyzer and Warehouse designer to import the source and target database schemas, and the Mapping Designer to map the sources to the target.
- Performed Configuration Management to Migrate Informatica mappings/sessions /workflows from Development to Test to production environment.
Environment: Informatica PowerCenter, MS SQL Server 2012/2008R2, Oracle 10g, MS Windows, Shell Scripts, Teradata, SQL, and PL/SQL.
- Responsible for understanding the scope of the project and requirement gathering.
- Developed the web tier using JSP, Struts MVC to show account details and summary.
- Created and maintained the configuration of the Spring Application Framework.
- Implemented various design patterns - Singleton, Business Delegate, Value Object and Spring DAO.
- Used Spring JDBC to write some DAO classes which interact with the database to access account information.
- Mapped business objects to database using Hibernate.
- Involved in writing Spring Configuration XML files that contains declarations and other dependent objects declaration.
- Used Tomcat web server for development purpose.
- Involved in creation of Test Cases for Unit Testing.
- Used Oracle as Database and used Toad for queries execution and involved in writing SQL scripts, PL/ SQL code for procedures and functions.
- Used CVS, Perforce as configuration management tool for code versioning and release.
- Developed application using Eclipse and used build and deploy tool as Maven.
- Used Log4J to print the logging, debugging, warning, info on the server console.
- JDBC connection pooling for accessing embedded and legacy data sources
- Developed front-end user interface screens and server side scripts using JSP, HTML, Java Script, Servlets, Custom Tags and XML.
- Used XML Spy for creating and validating XML files and for generating XSL style sheets.
- Designed and Implemented Server Objects using Java Servlets, EJB, JDBC.
Environment: Java, J2EE, JSON, LINUX, XML, XSL, CSS, Java Script, Eclipse