- 8+ years of experience in IT experience in software design, development, implementation and support of business applications for Telecom, health and Insurance industries
- Experience in Big data Hadoop, Hadoop Ecosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Airflow, Oozie, and Zookeeper
- Worked extensively on installing and configuring Hadoop ecosystem components Hive, SQOOP, HBase, Zookeeper and Flume
- Good Knowledge in writing Spark Applictions in Python(Pyspark)
- Hands of experience on build tools like Maven,Log4j, Junit and Ant
- Working with the data extraction, transformation and load using Hive, Pig and HBase
- Hands on Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Implemented ETL operations using Big Data platform
- Hands on experience on Streaming data ingestion and Processing
- Experienced in designing different time driven and data driven automated workflows using Airflow.
- MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in spark streaming.
- Highly Acumen in choosing an efficient ecosystem in Hadoop and providing the best solutions to Big Data problems.
- Well versed with Design and Architecture principles to implement Big Data Systems.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency
- Acumen on Data Migration from Relational Database to Hadoop Platform using SQOOP.
- Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
- Good understanding of MPP databases such as HP Vertica and Impala.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS
- Expertise in relational databases like Oracle, My SQL and SQL Server.
- Experienced in implementing projects in Agile and Waterfall methodologies.
- Well versed with Sprint ceremonies that are practiced in Agile methodology.
- Highly involved in all phases of SLDC with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in a client server environment, Object Oriented Technology and Web based applications.
- Strong analytical and problem solving skills, highly motivated, good team player with very Good communication & interpersonal skills.
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Airflow Oozie, Zookeeper, Spark, Ambari, Storm.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, Amazon EMR
No SQL Databases: HBase
Methodology: Agile, waterfall
Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J.
App/Web servers: JBoss and Tomcat
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2
Operating systems: UNIX, LINUX, Mac os and Windows Variants
Data analytical tools: R and MATLAB
ETL Tools: Talend, Informatica
Confidential, Houston, TX
Sr. Hadoop/Spark Developer
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig,Hive, HBase, Oozie, Sqoop, Flume, Spark, Impala.
- Ingested the data from Relational Databases to HDFS using SQOOP
- Implemented advanced procedures like text analytics and processing using the in - memory computing capabilities like Apache Spark written in python
- Implemented Spark using python and Spark SQL for faster testing and processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Implemented intermediate functionalities like events or records count from the flume sinks or Kafka topics by writing Spark programs in java and python.
- Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Experienced in transferring Streaming data, data from different data sources into HDFS, No SQL databases
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Well versed with the Database and Data Warehouse concepts like OLTP, OLAP, Star Schema
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- AWS provides a secure global infrastructure, plus a range of features that use to secure the data in the cloud
- Worked and learned a great deal from AmazonWebServices (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
- Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Developed multiple Kafka Producers and Consumers from scratch to as per the software requirement specifications.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using python.
- Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language python.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Real time streaming the data using Spark with Kafka.
- Experience designing and executing time driven and data driven Airflow workflows.
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Processed Multiple Data sources input to same Reducer using Generic Writable and MultiInput format.
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
Environment: Hadoop, Hive, Flume, Map Reduce, Sqoop, Kafka, Spark, Yarn, Pig, Cassandra, Oozie, shell Scripting, Scala, Maven, Java, JUnit, agile methodologies, MySQL
Confidential, Stamford, CT
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming
- Converting the existing relational database model to Hadoop ecosystem.
- Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
- Developed Schedulers that communicated with the Cloud based services(AWS) to retrieve the data.
- Strong experience in working with ELASTIC MAPREDUCE and setting up environments on Amazon AWS EC2 instances.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
- Imported the data from different sources like AWS S3, LFS into Spark RDD.
- Experienced in working with Amazon Web Services (AWS) EC2 and S3 in Spark RDD
- Managed and reviewed Hadoop and HBase log files.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Analyze table data and implement compression techniques like Teradata Multivalued compression
- Involved in ETL process from design, development, testing and migration to production environments.
- Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
- Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
- Provide guidance to development team working on PySpark as ETL platform.
- Writing Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs
- Generating analytics reporting on probe data by writing EMR (elastic map reduce) jobs to run on Amazon VPC cluster and using Amazon data pipelines for automation.
- Worked with Elastic MapReduce (EMR) on Amazon Web Services (AWS).
- Have good understanding of Teradata MPP architecture such as Partitioning, Primary Indexes,
- Good knowledge in Teradata Unity, Teradata Data Mover, OS PDE Kernel internals, Backup and Recovery
- Created HBase tables to store variable data formats of data coming from different portfolios.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop
- Creating Hive tables and working on them using HiveQL.
- Creating and truncating HBase tables in hue and taking backup of submitter ID
- Developed data pipeline using Kafka to store data into HDFS.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Involved in review of functional and non-functional requirements.
- Developed ETL Process using HIVE and HBASE.
- Worked as an ETL Architect/ETL Technical Lead and provided the ETL framework Solution for the Delta process, Hierarchy Build and XML generation.
- Prepared the Technical Specification document for the ETL job development.
- Responsible to manage data coming from different sources.
- Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by using Flume.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Installed and configured Apache Hadoop, Hive and Pig environment.
Environment: Hadoop, HDFS, pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Kafka.
Confidential, Irvine, TX
- Worked on importing data from various sources and performed transformations using MapReduce, hive to load data into HDFS
- Responsible for building scalable distributed data solutions using Hadoop.Written various Hive and Pig scripts
- Created HBase tables to store variable data formats coming from different portfoliosPerformed real time analytics on HBase using Java API and Rest API.
- Deployed Hadoop Cluster in the following modes.
- Standalone,Pseudo-distributed and Fully Distributed
- Implemented Name Node backup using NFS. This was done for High availability
- Integrated NoSQL database like HBase with Map Reduce to move bulk amount of data into HBase.
- Logical implementation and interaction with HBase
- Assisted in creation of large HBase tables using large set of data from various portfolios
- Efficiently put and fetched data to/from HBase by writing MapReduce job
- Experienced with different scripting language like Python and shell scripts.
- Experienced in working with data analytics, web Scraping and Extraction of data in Python
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Designed & Implemented database Cloning using Python and Built backend support for Applications using Shell scripts
- Worked on Ansible Python API for controlling the nodes and extending various python events
- Implemented Frameworks using Java and python to automate the ingestion flow.
- Implemented Hive UDFs to validate against business rules before data move to Hive tables.
- Experienced with join different data sets using Pig join operations to perform queries using pig scripts.
- Experienced with Pig Latin operations and writing Pig UDF’s to perform analytics.
- Cluster coordination services through Zookeeper
- Implemented Unix shell scripts to perform cluster admin operations.
- Exported the analyzed data to the relational databases using Sqoop for visualization andto generate reports
- Worked on the Ingestion of Files into HDFS from remote systems using MFT (Managed File Transfer).
- Experience in monitoring and managing Cassandra cluster.
- Analyzed the weblog data using the HiveQL, integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Written Pig scripts for sorting, joining and grouping the data
- Experienced with working on Avro Data files using Avro Serialization system.
- Solved small file problem using Sequence files processing in Map Reduce.
- Developed Shell scripts to automate routine DBA tasks
Environment: HDFS, Map Reduce, Pig, Hive, Sqoop, Flume,HBase, Java, Maven, Avro, Cloudera, Eclipse and Shell Scripting
- Designed and implemented the training and reports modules of the application using Servlets, JSP and ajax
- Interact with Business Users and Develop Custom Reports based on the criteria defined.
- Requirement gathering and information collection. Analysis of gathered information so as to prepare a detail work plan and task breakdown structure
- Developed custom JSP tags for the application
- Involved in the phases of SDLC (Software Development Life Cycle) including Requirement collection, Design and analysis of Customer specification, Development and Customization of the application
- Used Quartz schedulers to run the jobs in a sequential with in the given time
- Strong hands on experience using Teradata utilities (FastExport, MultiLoad, FastLoad, Tpump, BTEQ and QueryMan).
- Familiar in Creating Secondary indexes, and join indexes in Teradata.
- Implementation of Physical modeling on Teradata such as creation of tables, indexes, views, normalization.
- Implemented the reports module applications using jasper reports for business intelligence
- Deployed application on tomcat server for business application in client location
- End-to-End System development and testing of Unit integration and System integration
- Co-ordination activities with Onshore and Offshore team of 10+ members
- Responsible for Effort estimation and timely production deliveries
- Creation and Execution of half yearly and yearly load jobs which updates new rate and discounts etc. for the claim calculations in Database and Files
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
- Configured the project on Web Logic 10.3 application servers
- Implemented the online application using Core Java, JDBC, JSP, Servlets, spring, Hibernate, Web Services, SOAP, and WSD
- Involved in the analysis, design, implementation, and testing of the project.
- Implemented Singleton, Factory Design Pattern, DAO Design Patterns based on the application requirements
- Used SAX and DOM parsers to parse the raw XML documents
- Used RAD as Development IDE for web applications
- Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse, using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
- Scheduling the sessions to extract, transform and load data in to warehouse database on Business requirements.
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
- Deployed applications on JBoss 4.0 server
- Extensively configured the build files to different Environments
- Developed Session Beans which encapsulates the workflow logic
- Used Entity Beans to persist the data into database
- Used JMS to establish message communication
- Responsible for the performance PL/ SQL procedures and SQL queries
- Used CVS for the concurrent development in the team and for code repository
- Developed web components using JSP, Servlets and JDBC.
Environment: Java, J2EE, Servlets, Struts 1.1, Spring, JSP, JMS, JBoss 4.0, SQL Server 2000, Ant, CVS, PL/SQL, Hibernate, Eclipse, Linux