- A dynamic professional with over 9+ years of diversified experience in the field of Information Technology with an emphasis on Big Data/Hadoop Eco System, SQL/NO - SQL databases, Java / J2EE technologies and tools using industry accepted methodologies and procedures.
- Hadoop Development: Extensively worked on Hadoop tools which include Pig, Hive, Oozie, Sqoop, and Spark, Data frames, Spark Streaming, HBase and MapReduce programming. Created Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in installation, configuring, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services (AWS)
- Experience in Amazon AWS services such as EMR, EC2, S3, cloud Formation, Red shift which provides fast and efficient processing of Big Data.
- Developed Spark applications using Scala for easy Hadoop transitions. Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive. Developed Spark code and Spark- SQL/Streaming for faster testing and processing of data.
- Hadoop Distributions: Worked with Apache Hadoop along enterprise version of Cloudera and Hortonworks and good Knowledge on MAPR distribution.
- Data Ingestion in to Hadoop (HDFS): Ingested data into Hadoop from various data sources like Oracle, MySQL using Sqoop tool. Created Sqoop job with incremental load to populate Hive External tables. Involved in importing the real-time data to Hadoop using Kafka and also worked on Flume. Exported the analyzed data to the relational databases using SQOOP for visualization and to generate reports for the BI team
- File Formats: Involved in running Hadoop streaming jobs to process terabytes of text data and worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
- Scripting and Reporting: Created scripts for performing data-analysis with PIG, HIVE and IMPALA and used the ANT script for creating and deploying .jar, .ear and .war files. Generated reports, extracts and statistics on the distributed data on Hadoop cluster. Generated Java APIs for retrieval and analysis on No-SQL database such as HBase and Cassandra.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2 Federation, High Availability and YARN architecture and good understanding of workload management, scalability and distributed platform architectures.
- Strong Knowledge and experience on architecture and components of Spark, and efficient in working with Spark Core, SparkSQL, Spark streaming and implemented Spark Streaming jobs by developing RDD's (Resilient Distributed Datasets) and used pyspark and spark-shell accordingly.
- Good knowledge in using apache NiFi to automate the data movement between different Hadoopsystems.
- Good experience and understanding of Python Programming, Data Mining and Machine Learning techniques and experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations
- Very good hands on experience with Pyspark for using Spark libraries by using python scripting for data analysis.
Bigdata Technologies: HDFS, MapReduce, Hive, HBase, Scala, Spark, Apache, Pig, Sqoop, Oozie, Kafka, Cassandra & MongoDB, Ambari, Apache Nifi, Apache Flink, PySpark.
Databases: MongoDB, HBase, Cassandra, Oracle 10g/11g/12c, PL/SQL, MySQL, MS SQL Server 2016/2012
Cloud: AWS S3, AWS EMR, AWS EC2, Redshift and AWS Glue
SQL Server Tools: Enterprise Manager, SQL Profiler, Query Analyzer, SQL Server 2008, SQL Server Management Studio, DTS, SSIS, SSRS, SSAS
Programming Languages: Python, Java, J2EE, PL/SQL, Pig Latin, Scala and SQL
Java/J2EE Technologies: JDBC, JNDI, JSON, JSTL, RMI, JMS, Java Script, JSP, Servlets, EJB, JSF, JQuery, AngularJS
Development Methodologies: Agile, Waterfall
ETL Tools: SSIS (SQL Server 2012 Integration Services), SQL Server 2000 DTS, Import Export Data, Talend.
IDE Tools: Eclipse, NetBeans
Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML Architecture Relational DBMS, Client-Server Architecture, OLAP, OLTP
Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
Sr. BigData Engineer/Developer
Confidential, Charlotte NC
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, oozie scheduling.
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
- Developed PySpark code to mimic the transformations performed in the on-premise environment and analyzed the Sql scripts and designed solutions to implement using pyspark.
- Automated workflows using shell scripts pull data from various databases into Hadoop and developed scripts to automate the process and generate reports.
- Created detailed AWS Security groups which behaved as virtual firewalls that controlled the traffic allowed reaching one or more AWS EC2 instances.
- Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
- Deploy and configured cloud AWS EC2 for client websites moving from self-hosted services for scalability purposes and work with multiple teams to provision AWS infrastructure for development and production environments.
- Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
- Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
- Worked with Amazon EMR to process data directly in S3 when we want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster by setting up the Spark Core for analysis work.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive.
- Involved on configuration, development of Hadoop environment with AWS cloud such as EC2, EMR, Redshift, Cloud watch, and Route.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Exported event weblogs to HDFS by creating a HDFS sink which directly deposits the weblogs in HDFS and used Elasticsearch as a distributed RESTful web services with MVC for parsing and processing XML data.
- Worked on Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Integrated Oozie with Map-Reduce, Pig, Hive, and Sqoop and developed Oozie workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, and group and aggregation and how does it translate to MapReduce jobs.
- Used Impala connectivity from the User Interface (UI) and query the results using ImpalaQL.
Environment: HDFS, Spark, Python, PySaprk, Pig, Sqoop, oozie, Mapreduce, ETL, HBase, Hive 2.3, Hadoop 3.0, MapReduce, Spark 2.3, Cassandra 3.11, Kafka 1.1, Zookeeper 3.4, Xml, Json, Python 3.6, Unix, Jenkins 2.1, Maven, Java, AWS S3, AWS EMR, AWS Glue and Impala.
Sr. BigData Engineer/Developer
Confidential - Houston, TX
- Analyze and define researcher's strategy and determine system architecture and requirement to achieve goals and developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
- Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling.
- Configured Spark Streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Implemented Spark using Python and Spark SQL for faster processing of data and Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (Pyspark)
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance and responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in MongoDB.
- Developed Real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka, Flume and JMS.
- Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data into NFS and integrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.ni
- Responsible for building scalable distributed data solutions using Hadoop cluster environment with Horton works distribution and ingested streaming data with Apache NiFi into Kafka.
- Wrote MapReduce jobs using Java API and Pig Latin. Optimized Hive QL/Pig scripts by using execution engine like Tez, Spark.
- Involved in writing custom MapReduce programs using java API for data processing and integrated Maven build and designed workflows to automate the build and deploy process.
- Involved in developing a linear regression model to predict a continuous measurement for improving the observation on wind turbine data developed using Spark with Scala API.
- Worked extensively on Spark, MLlib to develop a Logical regression model on operational Data and the Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
- Worked in AWS environment for development and deployment of custom Hadoop applications and involved in working with Elastic Map Reduce (EMR) and setting up environments on Amazon AWS EC2 instances.
- Load and transform large sets of structured, semi structured data using Hive and extract real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Used Spark and Spark-SQL to read the Parquet data and create the tables in Hive using the Scala API.
- Worked on extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs Cassandra implementation using Datastax Java API.
- Very good understanding on Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
- Imported data from various resources to the Cassandra cluster using Java APIs and configured Performance Tuning and Monitoring for Cassandra Read and Write processes for fast I/O operations and low latency time.
- Used Java API and Sqoop to export data into DataStax Cassandra cluster from RDBS worked on Cassandra for retrieving data from Cassandra clusters to run queries.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data files and involved in making code changes for a module in turbine simulation for processing across the cluster using spark-submit.
- Involved in performing the analytics and visualization for the data from the logs and estimate the error rate and study the probability of future errors using regressing models.
- Used WEB REST API to make the HTTP GET, PUT, and POST and DELETE requests from the webserver to perform analytics on the DataLake.
- Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.
- Designing & creating ETL jobs through Talend to load huge volumes of data into MongoDB, HadoopEcosystem and relational databases.
Environment: HDP 2.3.4, Hadoop, Hive, HDFS, HPC, WEBHDFS, WEBHCAT, Spark, Spark-SQL, KAFKA, Java, Scala, Web Server's, Maven Build and SBT build, AWS S3, AWS EMR, Apache Nifi, MongoDB, Cassandra, Python, PySpark, Talend, ETL and SQL.
Sr. BigData/Hadoop Engineer/Developer
Confidential, San Jose, CA
- Used Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Collected and aggregated large amounts of web log data from different sources such as web servers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
- Able to use Python Pandas, NumPy modules for Data analysis, Data scraping and parsing.
- Installed and configured Hadoop MapReduce, HDFS, developed multiple Map Reduce jobs in Java for data cleaning and processing.
- Successfully ran all Hadoop MapReduce programs on Amazon Elastic MapReduce framework by using Amazon S3 for Input and Output.
- Also used Spark SQL to handle structured data in Hive and involved in creating Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
- Extensively work on Redshift database development with copying data from S3, Insert Bulk records, create schema, cluster, tables and tune the queries for better performance.
- Analyzed substantial data sets by running Hive queries and Pig scripts and involved with creating script for data modeling and data import and export and involved in deploying, managing and developing MongoDB clusters.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins and defined the Accumulo tables and loaded data into tables for near real-time data reports.
- Created the Hive external tables using Accumulo connector and written Hive UDF's to sort Structure fields and return complex data type.
- Used distinctive data formats (Text format and ORC format) while stacking the data into HDFS
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using Cloud Formation templates.
- Developed Java RESTful web services to upload data from local to Amazon S3, listing S3 objects and file manipulation operations.
- Collected data using Spark Streaming from AWS S3 bucket in near-real- time and performs necessary Transformations and Actions to build the data model and persists the data in HDFS.
- Responsible for smooth error-free configuration of DWH-ETL solution and Integration with Hadoop.
- Imported the data from different sources like AWS S3, LFS into Spark RDD and involved in utilizing HCATALOG to get to Hive table metadata from MapReduce or Pig code.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
- Creating files and tuned the SQL queries in Hive utilizing HUEand worked with Apache SOLR for indexing and querying.
- Created custom SOLR Query segments to optimize ideal search matching and involved in working with Spark ecosystem using Spark SQL queries on different formats like Text file, CSV file
- Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
- Worked with Kerberos and integrated it to the Hadoop cluster to make it more strong and secure from unauthorized access.
- Driving the application from development phase to production phase using Continuous Integration and Continuous Deployment (CICD) model using Maven and Jenkins.
- Acted for bringing in data under HBase using HBase shell also HBase client API.
- Designed the ETL process and created the high-level design document including the logical data flows, source data extraction process, the database staging, job scheduling and Error Handling.
- Designed and developed ETL Jobs using Talend Integration Suite in Talend 5.2.2.
Environment: Hadoop, Cloudera, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Hbase, Apache Spark, Accumulo, Oozie Scheduler, Kerberos, AWS, Tableau, Java, Talend, HUE, HCATALOG, FlumeSolr, Git, Maven.
Sr. BigData/Hadoop Developer
Confidential - Yonkers, New York
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Loaded data into HDFS and extracted the data from MySQL into HDFS using Sqoop
- Involved in development and design of a 3 node Hadoop cluster using Apache Hadoop for POC and sample data analysis.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Created the Spark streaming code to take the source files as input and developed simple to complex Map Reduce job using Hive.
- Used Sqoop extensively to import data from RDMS sources into HDFS and performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
- Developed PigUDFs to pre-process data for analysis and provisioning of Cloudera Director AWS instance and adding Cloudera manager repository to scale up Hadoop Cluster in AWS
- Encryption Mechanisms using Python and analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Optimized Map/Reduce jobs to use HDFS efficiently by using various compression mechanisms and aintained cluster co-ordination services through ZooKeeper
- Extensively used Pig for data cleansing and developed Pig Latin scripts to extract the data from the web server output files to load into HDFS and developed the Pig UDF's to pre-process the data for analysis.
- Wrote Python migration scripts for web application and MongoDB used as a database to store all data & logs.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and preprocessing with Pig.
Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, Java Eclipse, Pythin, MongoDB, Cloudera, MySQL, Oracle, ZooKeeper, Sqoop, Spark, Scala, AWS, SQL Server, Shell Scripting.
Confidential - Silver Spring, MD
- Exported analyzed data to downstream systems using Sqoop-RDBMS for generating end-user reports, Business Analysis reports and payment reports.
- Worked on developing ETL processes to load data from multiple data sources to HDFS using Flume and Sqoop, perform structural modifications using Map-Reduce, analyze data using Hive and visualizing in dashboards.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables for BI reporting.
- Developed Scripts and Batch Job to schedule various Hadoop Program and co-ordination with development team and AWS support for cloud related issues.
- Processed huge datasets by using MapReduce and Hive and pig and implemented Dynamic /Static Partitions, Bucketing concepts in Hive and designed both and managed and External tables in Hive to optimize performance.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Used Hadoop FS scripts for HDFS (Hadoop File System) data loading and manipulation and control Pig workflow, monitor the running process backend, reformat pig log files and parse the log statics. Extending Hive and Pig core functionality by writing custom UDFs
- Reconciled the imported data using python programming and solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Integrated Oozie with Map-Reduce, Pig, Hive, and Sqoop and developed Oozie workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system
Environment: Python, Java, Map Reduce, ETL, Sqoop, Oozie, Cloudera, HBase, MongoDB, NoSQL, HDFS, R, Pig, Hive and Oozie
Sr. Java Developer
BWI Group - Dayton, OH
- Responsible for backend server development and creation of the RESTful API services
- Developed multiple Map-Reduce jobs for data cleaning and preprocessing. Analyzing/ transforming data with Hive and Pig.
- Worked on development of a web portal using Python & MongoDB and developed Web applications with RDBMS and various No-SSQL databases with Java/J2EE technologies and delivered successfully
- Involved in developing JSP pages using Spring MVC and Integrating Spring MVC with jQuery for validation rules.
- Involved in the DAO layer development that connect the database using Hibernate and ORM Mapping.
- Developed python scripts to automate manual tasks for middleware applications and received praise from users, shareholders and analysts for developing a highly interactive and intuitive UI using JSP, AJAX, JSF and JQuery techniques.
- Utilized Python in the handling of all hits on Django and responsibilities include designing and building the frames based on Java.
- Involved in writing Spring Configuration XML files that contains declarations and other dependent objects declaration.
- Implemented the Python Programming Techniques to read and import the data to the mail server using Django Framework
- Write SQL scripts and PL/SQL code for procedures and responsible resolving change request by performing analysis, preparing approach document, data models, coding and writing JUnit test cases
Confidential - South Orange, NJ
- Gathered requirements from client, analyzing and preparing the Requirement specification document.
- Developed more than 10 web-based software system; used JSP, Ajax, JQuery, CSS to enhance functionality and user experience on web pages.
- Extensively used different kinds of programming language, like Java, PHP, C# .net; extensively used different kinds of databases, like Oracle, SQL Server, MYSQL; also have written SQL procedures.
- Analyzed MVC architecture, Struts framework in view of the application workflow and application development.
- Designed and developed Servlets and developed multi-threading projects, used connection pool to manage concurrency situation.
- Using synchronized method and synchronized variable and front-end development using HTML, CSS, JSP and client side validations performed using Java Script.
- Used CVS for code versioning and involved in writing sql stores procedures
- Have used JSF UI component to develop front-end web pages and worked on creating and updating the Oracle 9i.
- Developed JUnit Test cases for the system and used Hibernate for persistence management.
- Used both Windows and Linux platforms for developing the application and designed the system based on Struts MVC architecture.
- Developed Servlets, JSP, JS, CSS and XHTML facelets front end layer and used transaction attributes in EJB to handle the transactions by the container.
- JavaBeans are used for developing lightweight business components and developed the User Interface using JSP/HTML and used CSS for style setting of the Web Pages.
- Designed XML schema for the system and designed and developed the documentation for the system.
- Used Eclipse in developing J2EE applications and created UML diagrams, forms and services.