- Expertise in all components of Hadoop/Spark Ecosystems - Spark, Hive, Pig, Flume, Sqoop, HBase, Kafka, Oozie, Impala, Stream sets, Apache NIFI, Hue, AWS.
- 3+ years of experience working in programming languages Scala/Python.
- Extensive knowledge on data serialization techniques like Avro, Sequence Files, Parquet, JSON and ORC.
- Knowledge in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Acute knowledge on Spark architecture and real-time streaming using Spark.
- Hands on experience with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
- Good knowledge on Amazon Web Services (AWS) cloud services like EC2, S3, EMR and VPC.
- Experienced in Data Ingestion, Data Processing, Data Integration, Data Aggregations, and Visualization in Spark Environment.
- Hands on experience in working with large volume of Structured and Un-Structured data.
- Expert in migrating the code components from SVN repository to Bit Bucket repository.
- Experienced in building Jenkins pipelines for continuous code integration from Github into Linux machine. Experience in Object Oriented Analysis Design (OOAD) and development.
- Good understanding in end-to- end web applications and design patterns.
- Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
- Experience in implementing by using agile methodology. Well versed in using Software development methodologies like Agile Methodology and Waterfall processes.
- Experienced in handling databases: Netezza, Oracle and Teradata.
- Strong team player with good communication, analytical, presentation and inter-personal skills.
Bigdata Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Zookeeper, Scala, Spark, Kafka, Flume, Ambari, Hue
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR
Database: Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Language: C, C++, Java, Scala, Python
AWS Components: IAH, S3, EMR, EC2,Lambda, Route 53, Cloud Watch, SNS,
Methodologies: Agile, Waterfall
Build Tools: Maven, Gradle, Jenkins.
Databases: NO-SQL, HBase, Cassandra, MongoDB, DynamoDB
IDE Tools: Eclipse, Net Beans, Intellij
Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML
Others Tools: Tableau, Datameer, AutoSys
Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
Confidential - Huntington Beach, CA
- Designed and Developed data migration from legacy systems to Hadoop environment.
- Have an experience to load and transform large sets of structured, semi structured and unstructured data, using Sqoop from Hadoop Distributed File Systems to Relational Database Systems and also Relational Database Systems to Hadoop Distributed File Systems .
- Worked on using cloud formation template, CI- CD tools like concourse to automate the data pipeline .
- Hands on experience with container technologies such as Docker, embed containers in existing CI/ CD pipelines.
- Monitoring and troubleshooting the Cloudera cluster services like HDFS, MR, Yarn, Hive, Sqoop, Oozie, Sentry and Zookeeper .
- Worked with Datameer vendor and maintained a healthy relationship, working both on productstrategy and resolving issues
- Worked on License strategy on Datameer and also sizing exercise on Hadoop to plan Hadoop capacity for run Datameer jobs using YARN pools
- Experience in developing high level design to meet customer requirements across DC, hybrid cloud .
- Used Storm and Kafka Services to push data to HBase and Hive tables .
- Installed Kafka cluster with separate nodes for brokers.
- Integrated Kafka with Flume in sand box Environment using Kafka source and Kafka sink .
- Set up independent testing lifecycle for CI/ CD scripts with Vagrant and Virtual box .
- Assisted in data analysis, star schema data modeling and design specific to data warehousing and business intelligence environment .
- Familiarity with NoSQL databases including Hbase and Cassandra .
- Experienced with NoSQL databases like HBase, MongoDB and Cassandra and wrote Storm topology to accept the events from Kafka producer and emit into Cassandra DB .
- Experienced with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, Sqoop, Spark, Yarn and Oozie, Zookeeper .
- Fetch data to/from Mainframe DB2, VSAM files, MS-SQL Server, Azure Data Lake & BLOB using Sqoop and create the file and store into HDFS.
- Created a new Hive table, Loading & Un-loading data from data lake or Flat files
- Development on Linux systems and UNIX shell scripting as per the business requirements .
- Experienced in creating the shell scripts and made jobs automated.
- Perform in-memory data processing and real time streaming analytics using Apache Spark with Scala and Java .
Environment: Hadoop, Hive, Datameer,AutoSys, COTS, Pig, SQOOP, Kafka, Azure data lake, Azure data factory, Azure data bricks, Spark, OOZIE, Python, Hybrid Cloud, SCALA, UNIX, Shell scripting,Zookeeper, Oracle PL/SQL, RDBMS, AWS, Oracle Golden Gate, Kyvos, Tableau/Qlik.
Confidential - Boston, MA
- Worked on Hortonworks-HDP 2.5 distribution.
- Involved in review of functional and non-functional requirements.
- Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Spark, Scala and Stream Sets.
- Distinguished proficiency in Hybrid Cloud Based Files Hosting Services
- Experience on using Azure Logic App for email notification of ETL data driven workflows.
- Developing, Deploying & Scheduling the cube in Azure Analysis Services (Cube development by using Visual studio 2017).
- Streaming data analysis using kafka spark streaming
- Building real time event processing system based on Kafka spark streaming to handle real time trading information
- Designed Hive repository with external tables, internal tables, buckets, partitions, ACID property, UDF and ORC compressions for incremental data load of parsed data for analytical & operational dashboards.
- Experience in using Apache Storm, Spark Streaming, Apache Spark, Apache NiFi,Zookeeper, Kafka and Flume in creating data streaming solutions.
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Involved in importing data from Microsoft SQL Server, MySQL, and Teradata into HDFS using Sqoop.
- Good knowledge in using Apache NIFI to automate the data movement.
- Developed Sqoop scripts to import data from relational sources and handled incremental loading.
- Extensively used Stream Sets Data Collector to create ETL pipeline for pulling the data from RDBMS system to HDFS.
- Implemented the data processing framework using Scala and Spark SQL.
- Worked on implementing the performance optimization methods to improve the data processing timing.
- Experienced in creating the shell scripts and made jobs automated.
- Extensively worked on Data frames and Datasets using Spark and Spark SQL.
- Responsible for defining the data flow within Hadoop eco system and direct the team in implement them and exported the result set from Hive to MySQL using Shell scripts.
- Worked on Kafka Streaming using stream sets to process continuous integration of data from Oracle systems to hive tables.
- Developed a generic utility in Spark for pulling the data from RDBMS system using multiple parallel connections.
- Integrated existing code logic in HiveQL and implemented in the Spark application for data processing.
- Extensively used Hive/Spark optimization techniques like Partitioning, Bucketing, Map Join, parallel execution, Broadcast join and Repartitioning.
Environment: Spark, Python, Scala, Hybrid Cloud, COTS, Hive, Azure data lake, Azure data factory, Azure data bricks, Hue, UNIX Scripting, Spark SQL, Stream sets,Zookeper, Kafka, Impala, Beeline, Git, Tidal.
Confidential, Oak Brook, IL
- Worked on Cloudera CDH distribution.
- Design and Implement historical and incremental data ingestion techniques from multiple external systems using Hive, pig and sqoop ingestion tools.
- Developing and implementing data orchestration using Azure Data Factory with various input and output data sources Performing ETL process using Datameer.
- Gave a demo to business users on using Datameer for analytics.
- Worked on analysing the data using Datameer.
- Design physical data models for structured and semi-structured to validate the raw data into HDFS.
- Design map/reduce logic and HIVES queries for generating aggregated metrics.
- Involved in Design, implementation, development and testing phases in the project.
- Responsible to monitor the jobs in production cluster while and trace the error logs when the jobs fails.
- Design and Develop data migration logic for exporting data from MySQL to Hive.
- Design and Develop complex workflow in Oozie for recurrent job execution.
- Used SSRS reporting tool for the generation of data analysis reports.
Environment: Hadoop, Datameer, Confluent Kafka,Zookeeper, Hortonworks HDF, HDP, Azure data lake, Azure data factory, Azure data bricks, NIFI, Linux, Splunk, Yarn, Clouder 5.13, Spark, Tableau.
Confidential, San Antonio, TX
- Worked on Cloudera CDH distribution.
- Hand on experience on cloud services like Amazon Web Services (AWS)
- Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location.
- Involved in complete SDLC - Requirement Analysis, Development, Testing and Deployment into Cluster.
- Worked hand-in-hand with the Architect; enhanced and optimized product Spark code to aggregate, group and run data mining tasks using Spark framework.
- Extracted data from various SQL database sources into HDFS using Sqoopand also ran Hive scripts on the huge chunks of data.
- Implemented a prototype for the complete requirements using Splunk, python and Machine learning concepts.
- Design and Implementation of Map reduce code logic for Natural Language Processing of Free Form Text.
- Deployed the project on Amazon EMR with S3 Connectivity.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service(S3).
- Loaded the data into Simple Storage Service (S3) in the AWS Cloud.
- Good Knowledge in using of Amazon Load Balancer for Auto scaling in EC2 servers.
- Implemented Spark scripts to migrate map reduce jobs into Spark RDD transformations, streaming data using Apache Kafka.
- Implemented Spark SQL queries which intermix the Hive queries with the programmatic data manipulations supported by RDDs and data frames in scala and python.
- Involved in Deployment of Code Logic and UDFsacross the cluster.
- Communicate deliverables status to user/stakeholders, client and drive periodic review meetings.
- Worked on Data Processing using Hive queries in HDFS and the shell Scripts to wrap the HQL scripts.
- Developed and Deployed Oozie Workflows for recurring operations on Clusters.
- Experienced in performance tuning of hadoop jobs for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Worked extensively with Sqoop for importing metadata from Oracle.
- Used Tableau reporting tool to generate reports from the outputs stored in HDFS.
Environment: Hadoop, Spark, HDFS, Hive, Map Reduce, Sqoop, Oozie, Tableau.
- Responsible for building scalable distributed data solutions using Hadoop.
- Hadoop installation, Configuration of multiple nodes using Clouder platform.
- Installed and configured a Hortonworks HDP 2.2 using Ambari and manually through command line. Cluster maintenance as well as creation and removal of nodes using tools like Ambari, Cloudera Manager Enterprise and other tools.
- Handling the installation and configuration of a Hadoop cluster.
- Building and maintaining scalable data pipelines using the Hadoop ecosystem and other open source components like Hive and HBase.
- Involved in developer activities of installation and configuring Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Involved in Cluster Level Security, Security of perimeter (Authentication- Cloudera Manager, Active directory and Kerberos) Access (Authorization and permissions- Sentry) Visibility (Audit and Lineage - Navigator) Data ( Data Encryption at Rest)
- Handling the data exchange between HDFS and different web sources using Flume and Sqoop.
- Monitoring the data streaming between web sources and HDFS and functioning through monitoring tools.
- Close monitoring and analysis of the MapReduce job executions on cluster at task level.
- Inputs to development regarding the efficient utilization of resources like memory and CPU utilization based on the running statistics of Map and Reduce tasks.
- Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera Distribution including configuration management, monitoring, debugging, and performance tuning Scripting Hadoop package installation and configuration to support fully-automated deployments.
- Day-to-day operational support of our Cloudera Hadoop clusters in lab and production, at multi-petabyte scale.
- Changes to the configuration properties of the cluster based on volume of the data being processed and performed by the cluster.
- Involved in creating Spark cluster in HDInsight by create Azure compute resources with Spark installed and configured.
- Setting up automated processes to analyze the system and Hadoop log files for predefined errors and send alerts to appropriate groups and an Excellent working knowledge on SQL with databases.
- Commissioning and De-commissioning of data nodes from cluster in case of problems.
- Setting up automated processes to archive/clean the unwanted data on the cluster, in particular on Name Node and Secondary Name node.
- Set up and managing HA Name Node to avoid single point of failures in large clusters.
- Discussions with other technical teams on regular basis regarding upgrades, process changes, any special processing and feedback.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions. Documented the systems processes and procedures for future references.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Administering and Maintaining Cloudera Hadoop Clusters Provision physical Linux systems, patch, and maintain them.
Environment: Hadoop, Confluent Kafka, Hortonworks HDF, HDP, NIFI, Linux, Splunk, Yarn, Cloudera 5.13, Spark, Tableau.
- Involved in complete SDLC - Requirement Analysis, Development, Testing and Deployments.
- Involved in resolving critical Errors.
- Responsible to deploy the deliverables of sprints successfully.
- Involved in capturing the client's requirements and enhancements on the application document the requirements and populate to the associated teams.
- Design and Implementation of REST Full services and WSDL in VORDEL.
- Implemented complex SQL quires to get the analysis reports.
- Created Desktop applications using J2EE, Swings.
- Involved in developing applications using Java, JSP, Servlets, Swings.
- Developed UI using HTML, CSS, Ajax, JQuery and developed Business logic and Interfacing Components using Business Objects, XML and JDBC.
- Created applications, connection pools, deployment of JSP & Servlets.
- Used Oracle, MySQL database for storing user information.
- Developed backed for application using PHP for web applications.
- Experienced with the Agile Methodologies.
Environment: SOAP, REST, HTML, WSDL, 22'Vordel, SQL Developer