- 8 years of IT Operations experience with 4+ years of experience in Hadoop Administration and 2+ years of experience in Software Development
- Excellent understanding of Distributed Systems and Parallel Processing architecture.
- Proven organizational, time management and multi - tasking skills and ability to work independently and quickly learn new technology and adopt to new environment
- Developed various reusable objects used across the organization to leverage application logics.
- Created automations using Blue Prism to improve user's ability creation time.
- Proficient in Python, experience building, and product ionizing end-to-end systems
- Knowledge of Information Extraction, NLP algorithms coupled with Deep Learning
- Experience with file systems, server architectures, databases, SQL, and data movement (ETL).
- Experience with Supervised or Unsupervised Machine Learning algorithms.
- Hands on experience in working with Hadoop Ecosystems Including Hive, HBase and Spark.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and MapReduce programming paradigm.
- Experience in importing and exporting data using Sqoop from Relational Database to HDFS and from HDFS to Relational Database.
- Worked on Oozie for workflow management, with separate workflows for each layer like Staging, Transformations and Archive layers.
- Worked on MapReduce programs for parallel processing of data and for custom input formats.
- Worked on Pig for ETL Transformations and optimized Hive Queries.
- Worked on Utilizing backlog management tool (JIRA) and track progress on tasks stories and update in a timely manner.
- Worked effectively with teammates and our business partners to ensure all viable test scenarios are identified and documented in the TSM (Test Scenario Matrix).
- Expertise in Control Room Resource, Session Management, Blue Prism Calculation functions and debugging of Process Solutions.
- Working knowledge on Blue Prism concepts like debugging a work flow, managing the packages, creating VBO's, arguments, control flows.
- Good hands on experience in creating the RDD's, Data frames for the required input data and performed the data transformations using Spark and Scala.
- Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing. Experience optimizing ETL workflows.
- Loading the data from the different Data sources like (Teradata and Oracle) into HDFS using sqoop and load into partitioned Hive tables.
- Good Knowledge on NoSQL databases such as Cassandra, MongoDB.
- Worked in Agile environment. Participated in scrum meetings/standups.
Confidential, New York, New York
- Worked on developing architecture document and proper guidelines
- Responsible for installation and configuration of Hive, Pig, Hbase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Developed the Sqoop scripts to make the interaction between Hive and vertical Database.
- Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
- Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Streamed AWS log group into Lambda function to create service now incident.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Created Managed tables and External tables in Hive and loaded data from HDFS.
- Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
- Scheduled several times based Oozie workflow by developing Python scripts.
- Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
- Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
- Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
- Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
- Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
- Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
- Scheduled map reduces jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
- Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
- Improved the Performance by tuning of HIVE and map reduce.
- Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem
Environment: HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, Kafka, Spark, Scala and ETL, Python
Confidential, New York, New York
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Sqoop.
- Created POC on Hortonworks and suggested the best practice in terms HDP, HDF platform
- Responsible for building scalable distributed data solutions using Hadoop.
- Load the data into spark RDD and performed in-memory data computation to get faster output response.
- Developed Spark jobs and Hive Jobs to transform data.
- Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
- Worked on Oozie workflow engine for job scheduling Imported and exported data into MapReduce and Hive using Sqoop.
- Developed Sqoop scripts to import, export data from relational sources and handled incremental loading on the data by date.
- Developed Kafka consumer component for Real-Time data processing in Java and Scala.
- Used Impala to query Hive tables for faster query response times.
- Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Created Partitioned and Bucketed Hive tables in Parquet and Avro File Formats with Snappy compression and then loaded data.
- Written Hive queries using spark SQL that integrates with spark environment.
- Developed MapReduce programs to parse the raw JSON data and store the refined data in tables
- Used Kafka to load data in to HDFS and move data into HBase.
- Captured the data logs from web server into HDFS using Flume for analysis.
- Worked on moving data pipelines from CDH cluster to run on AWS EMR.
- Involved in moving data from HDFS to AWS Simple Storage Service (S3) and extensively worked with S3 bucket in AWS.
- Developed spark application for filtering Json source data in AWS S3 location and store it into HDFS with partitions and used spark to extract schema of Json files.
- Responsible for migrating the code base from Cloudera Platform to Amazon EMR and evaluated Amazon eco systems components like Redshift
- Develop Spark code using Scala and Spark-SQL for faster testing and data processing
- Involved in the development of Spark Streaming application for one of the data sources using Scala, Spark by applying the transformations.
- Import the data from different sources like HDFS/MYSQL into Spar.
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.
Environment: Linux, Hadoop, Python, Scala, CDH, SQL, Sqoop, HBase, Hive, Spark, Oozie, Cloudera Manager, Oracle, Windows, Yarn, Spring, Sentry, AWS, S3, SQL
Confidential, San Jose, California
- Launching Amazon EC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL) and Configuring launched instances with respect to specific applications.
- Did spark streaming and micro-batch processing using Scala as programming language.
- Using Hive Script in Spark for data cleaning and transformation purpose. importing of data from various data sources; perform transformations using Hive, MapReduce, load data into HDFS and extract the data from MySQL into HDFS using Sqoop.
- Export the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team. created data pipe line process for structuring, processing, and transforming data using Kafka and Scala. created Kafka spark streaming data pipelines for consuming the data from external source and performing the transformations in Scala.
- Contributed towards developing a Data Pipeline to load data from different sources like Web, RDBMS, NoSQL to Apache Kafka or Spark cluster.
- Extensively used Pig for data cleansing. Create partitioned tables in Hive.
- Use Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Created custom python/shell scripts to import data via SQOOP from Oracle databases.
- Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs using Genie and kibana.
- Real time streaming the data using Spark with Kafka for faster processing.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDDs in Scala.
- Log data collected from the web servers was channeled into HDFS using Flume and spark streaming.
- Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
- Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to Spark for real time processing.
- Load and transform Design efficient Spark code using Python and Spark SQL, which can be forward engineered by our code generation developers.
- Utilized large sets of structured, semi structured and unstructured data.
- Created big data workflows to ingest the data from various sources to Hadoop using OOZIE and these workflows comprises of heterogeneous jobs like Hive, SQOOP and Python Script.
- Used Git-Hub for project version management.
Environments : Cloudera, Map Reduce, Spark SQL, Spark Streaming, Pig, Hive, Flume, Oozie, Java, Kafka, Eclipse, Zookeeper, Cassandra, HBase, Talend, GitHub.
Confidential, New York, New York
Job Title: SDET
- Responsible for implementation and ongoing administration of Hadoop infrastructure and setting up infrastructure
- Analyzed technical and functional requirements documents and design and developed QA Test Plan/Test cases, Test Scenario by maintaining E2E flow of process.
- Developed testing script for internal brokerage application that is utilized by branch and financial market representatives to recommend and manage customer portfolios; including international and capital markets.
- Designed and Developed Smoke and Regression automation script and Automation of functional testing framework for all modules using Selenium and WebDriver.
- Created Data Driven scripts for adding multiple customers, checking online accounts, user interfaces validations, and reports validations.
- Performed cross verification of trade entry between mainframe system, its web application and downstream system.
- Extensively used Selenium WebDriver API (XPath and CSS locators) to test the web application.
- Configured Selenium WebDriver, TestNG, Maven tool, Cucumber, and BDD Framework and created Selenium automation scripts in java using TestNG.
- Performed Data-Driven testing by developing Java based library to read test data from Excel & Properties files.
- Extensively performed DB2 database testing to validate the trade entry from mainframe to backend system. \ Developed data driven framework with Java, Selenium WebDriver and Apache POI which is used to do the multiple trade order entry.
- Developed internal application using Angular.js and Node.js connecting to Oracle on the backend.
- Expertise in debugging issues occurred in front end part of web-based application, which is developed using HTML5, CSS3, Angular JS, Node.JS and Java.
- Developed smoke automation test suite for regression test suite.
- Applied various testing technique in test cases to cover all business scenario for quality coverage.
- Interacted with development team to understand design flow, code review, discuss unit test plan.
- Executed tests in System & integration Regression testing In Testing environment.
- Conducted Defect triage meeting, Defect root cause analysis, track defect in HP ALM Quality Center, manage defect by follow up open items, and retest defects with regression testing.
- Provide QA/UAT sign off after closely reviewing all the test cases in Quality Center along with receiving the Policy sign off the project.
Environment : HP ALM, Selenium WebDriver, JUnit, Cucumber, Angular JS, Node.JS Jenkins, GitHub, Windows, UNIX, Agile, MS SQL, IBM DB2, Putty, WinSCP, FTP Server, Notepad++, C#, DB Visualizer.