- Over 8 years of IT experience in software Development and Big Data Technologies and Analytical Solutions with 2+ years of hands - on experience in development and design of Java and related frameworks, Full stack web development and 2+ years' experience in design, architecture, and data modeling as database developer.
- Over 4 years' experience as Hadoop Developer with good knowledge of Hadoop framework, Hadoop Distributed file system and Parallel processing implementation, Hadoop Ecosystems HDFS, Map Reduce, Hive, Pig, Python, HBase, Sqoop, Hue, Oozie, Impala, Spark.
- Built and Deployed Industrial scale Data Lake on premise and Cloud platforms.
- Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Experienced in handling different file formats like Text file, Avro data files, Sequence files, Xml and Json files.
- Extensively worked on Spark Core, Numeric RDD's, Pair RDD's, Data Frames, and Caching for developing Spark applications
- Expertise in deployment of Hadoop, Yarn, Spark integration with Cassandra, etc.
- Experience and Expertise in ETL, Data analysis and designing data warehouse strategies.
- Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
- Upgraded Hadoop CDH to 5.x, and Hortonworks. Installed, Upgraded and Maintained Cloudera Hadoop-based software, Cloudera Clusters, Cloudera Navigator.
- Industrial experience in creating applications in Python, Java, Scala, Java Script (AngularJS, NodeJS and SQL Server 2017).
- Extensive experience writing custom Map Reduce programs for data processing and UDFs for both Hive and Pig in Java. Extensively worked on MRV1 and MRV2 Hadoop architectures.
- Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
- Experience in using Python's packages like xlrd, numpy, pandas, scipy, scikit-learn and IDEs - PyCharm, Spyder, Anaconda, Jupyter, IPython.
- Extensive experience in working with structured data using Hive QL, join operations, writing custom UDF's and experienced in optimizing Hive Queries.
- Extensive experiences in working with semi/unstructured data by implementing complex map reduce programs using design patterns.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
- Experience in Apache Flume for collecting, aggregating and moving huge chunks of data from various sources such as webserver, telnet sources etc.
- Experience with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs.
- Involved in moving all log files generated from various sources to HDFS and Spark for further processing.
- Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
- Experience in implementing Kerberos authentication protocol in Hadoop for data security.
- Experience in Dimensional modelling, logical modelling and Physical data modelling.
- Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.
- Experience with Testing Map Reduce programs using MRUnit, Junit, ANT, Maven.
- Experienced in working with scheduling tools such as UC4, Cisco Tidal enterprise scheduler, or Autosys.
- Adequate knowledge and working experience in Agile & Waterfall methodologies.
- Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.
JAVA (7 years), APACHE HADOOP HDFS (3 years), APACHE HADOOP SQOOP (3 years), Hadoop(3 years), HADOOP (3 years).
Hadoop/Big Data: Hadoop, Map Reduce, Hive, Yarn, Pig, Flume, Sqoop, Oozie, Hbase, Spark
Java Technologies: Core Java, JSP, JDBC, Eclipse
Programming languages: Java, Python, C, C++, R, Linux shell scripts
Databases: MS-SQL Server, HBase, NoSQL
Operating Systems: Linux, Unix, Windows 8, Windows 7, Windows 10
Confidential, San Francisco, CA
- Worked on spark/Scala programming to create UDFs
- Created and accessed AWS S3 buckets
- Connected to AWS EC2 using SSH and ran spark - submit jobs
- Worked on cloud era environment
- Analyzed existing code and made the bug fixes wherever required
- Ran many test cases in Scala
- Used java in removing an attribute in JSON file where Scala was not supporting to create objects and again converted to Scala
- Worked on master clean-up of data
- Worked on accumulators to count the result after executing the job on multiple executors
- Worked in intellij IDE for the development and debugging
- Wrote a whole set of programs for one of the LOB's in Scala and made unit testing
- Created many SQL schemas and utilized them throughout the program wherever required
- Made enhancements to one of the LOBs using Scala programming
- Ran spark-submit job and analyzed the log files
- Used Maven to build .jar files
- Used Sqoop to transfer data between relational databases and Hadoop
- Worked on HDFS to store and access huge datasets within Hadoop
- Good hands on experience with git and GitHub
- Created a feature node on GitHub
- Pushed the data GitHub and made a pull request
- Experience in JSON and CFF
Environment: Cloudera5.8, Hadoop2.7.2, HDFS2.7.2, AWS s3, AWS EC2, SparkSql1.6.1, Sqoop1.4.6, Spark1.6.3, Scala 2.12, MySQL, Shell Scripting, Java, GitHub, JSON, CFF.
Confidential, Stamford, CT
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Responsible for building scalable distributed data solutions using Hadoop.
- Experienced in loading and transforming of large sets of structured, semi structured and unstructured data.
- Developed Spark jobs and HiveJobs to summarize and transform data.
- Involved in converting Hive/SQL queries into Spark transformations using Sparkdataframes, Scala and Python.
- Expertise in implementing SparkScala application using higher order functions for both batch and interactive analysis requirement.
- Experienced in developing Spark scripts for data analysis in both python and scala.
- Wrote Scala scripts to make spark streaming work with Kafka as part of sparkKafka integration efforts.
- Built on - premise data pipelines using kafka and spark for real time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested native Drill, Impala and Spark connectors.
- Implemented Hive complex UDF's to execute business logic with Hive Queries.
- Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.
- Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
- Evaluated performance of SparkSQL vs IMPALA vs DRILL on offline data as a part of poc.
- Worked on solr configuration and customizations based on requirements.
- Implemented Spark using Scala and utilizing Data frames and SparkSQLAPI for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
- Exporting of result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Responsible for developing data pipeline by implementing Kafka producers and consumers and configuring brokers.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Developed Unit tests for MapReduce Programs using MRUnit testing library.
- Experience in managing and reviewing Hadoop Log files.
- Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability.
- Setup SparkEMR to process huge data which is stored in AmazonS3.
- Developed PIGUDF'S for manipulating the data as per the business requirements and worked on developing custom PIG Loaders.
- Used Gradle for building and testing project
- Fixed defects as needed during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
- Used Mingle and later moved to JIRA for task/bug tracking.
- Used GIT for version control
Environment: Cloudera5.8, Hadoop2.7.2, HDFS2.7.2, AWS, PIG0.16.0, Hive2.0, Impala, Drill1.9, SparkSql1.6.1, MapReduce1.x, Flume1.7.0, Sqoop1.4.6, Oozie 4.1, Storm1.0, Docker1.12.1, Kafka 0.10, Spark1.6.3, Scala 2.12, Hbase0.98.19, ZooKeeper3.4.9, MySQL, Tableau, Shell Scripting, Java.
Confidential, Jessup, PA
- Analyzing the requirement to setup a cluster.
- Worked on analyzing Hadoop cluster and different big data analytic tools including MapReduce, Hive and Spark.
- Involved in loading data from LINUX file system, servers, Java web services using KafkaProducers, partitions.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
- Implemented Storm topologies to pre - process data before moving into HDFS system.
- Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
- Implemented POC to migrate MapReduce programs into Spark transformations using Spark and Scala.
- Migrated complex MapReduce programs into Spark RDD transformations, actions.
- Implemented SparkRDD transformations to mapbusiness analysis and apply actions on top of transformations.
- Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.
- Developed the MapReduce programs to parse the raw data and store the pre Aggregated data in the partitioned tables.
- Loaded and transformed large sets of structured, semi structured, and unstructured data with MapReduce, Hive and pig.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experienced in developing custom input formats and data types to parse and process unstructured and semi structured input data and mapped them into key value pairs to implement business logic in MapReduce.
- Involved in using HCATALOG to access Hive table metadata for MapReduce or Pig code.
- Experience in implementing custom sterilizer, interceptor, source and sink as per the requirement in flume to ingest data from multiple sources.
- Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Exporting of result set from HIVE to MySQL using Sqoop export tool for further processing.
- Evaluated usage of Oozie for Workflow Orchestration.
- Converted unstructured data to structured data by writing Spark code.
- Indexed documents using Apache Solr.
- Set up Solr Clouds for distributing indexing and search.
- Automation of all the jobs starting from pulling the Data from different Data Sources like MySQL and pushing the result dataset to Hadoop Distributed File System and running MR, PIG, and Hivejobs using Kettle and Oozie (Work Flow management)
- Worked on No-SQL databases like Cassandra, MongoDB for POC purpose in storing images and URIs.
- Integrating bulk data into Cassandra file system using MapReduce programs.
- Used Talend ETL tool to develop multiple jobs and in setting workflows.
- Created Talend jobs to copy the files from one server to another and utilized Talend FTP components.
- Worked on MongoDB for distributed storage and processing.
- Designed and implemented Cassandra and associated RESTful web service.
- Implemented Row Level Updates and Real time analytics using CQL on Cassandra Data.
- Used Cassandra CQL with JavaAPI's to retrieve data from Cassandra tables.
- Worked on analyzing and examining customer behavioral data using Cassandra.
- Created partitioned tables in Hive, mentored analyst and SQA team for writing Hive Queries.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Involved in cluster setup, monitoring, test benchmarks for results.
- Involved in build/deploy applications using Maven and integrated with CI/CD server Jenkins.
- Involved in agile methodologies, daily scrum meetings, Spring planning's.
Environment: Hadoop, Cloudera 5.4, HDFS, pig0.15, Hive1.2.1, Flume1.6.0, Sqoop1.4.6, Oozie0.4, AWS Redshift184.108.40.206, Python 3.5.1, Spark1.5.0, Scala2.11, MongoDB3.0, Cassandra2.0.15, Solr6.6.1, ZooKeeper3.4.7, MySQL, Talend6.2., Shell Scripting 7.x, Linux Red Hat, Java.
Confidential, Auburn Hills, MI
- Developed custom data Ingestion adapters to extract the log data and click stream data from external systems and load into HDFS.
- Used Spark as ETL tool to do complex Transformations, De - Normalization, Enrichment and some pre-aggregations.
- Creating Hive tables, loading data and writing hive queries for building Analytical Datasets.
- Developed a working prototype for real time data ingestion and processing using Kafka, Spark Streaming, and HBase.
- Developed Kafka producer and Spark Streaming consumer to read the stream of events as per business rules.
- Designed and developed Job flows using Oozie.
- Developed Sqoop commands to pull the data from Teradata.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Experience in using Pentaho Data Integration (PDI) tool for data integration, OLAP analysis and ETL process
- Used AVRO, Parquet File formats and Snappy compression through the project.
- The data is collected from distributed sources into Avro models. Applied transformations and standardizations and loaded into HBase for further data processing.
Environment: Cloudera CDH5.x, Pentaho, HDFS, Hadoop 2.2.0 (yarn), Eclipse, Hive, PIG Latin, Sqoop, Zookeeper, Apache Kafka, Apache Storm, My SQL.
- Responsible for developing various modules, front - end and back-end components using several design patterns based on client's business requirements.
- Designed and Developed application modules using spring and Hibernate frameworks.
- Used Hibernate to develop persistent classes following ORM principles.
- Deployed spring configuration files such as application context, application resources and application files.
- Used Java-J2EE patterns like Model View Controller (MVC), Business Delegate, Session fa ade, Service Locator, Data Transfer Objects, Data Access Objects, Singleton and factory patterns.
- Used JUnit for Testing Java Classes.
- Used Waterfall methodology.
- Worked with Maven for build scripts and Setup the Log4J Logging framework.
- Involved in the Integration of the Application with other services.
- Involved in Units integration, bug fixing, and testing with test cases.
- Fixed the bugs reported in User Testing and deployed the changes to the server.
- Managing the version control for the deliverables by streamlining and re-basing the development streams of the SVN.