- Overall8+years of professional IT work experience in Analysis, Design, Development, Deployment and Maintenance of critical software and big data applications.
- 4+ years of hands on experience across Hadoop eco system that includes extensive experience into Big Data technologies like MapReduce, YARN, HDFS, Apache Cassandra,HBase, Oozie, Hive, Sqoop, Pig, Zoo Keeper and Flume.
- In depth knowledge in HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming.
- Expertise in working with MRV1 and MRV2 Hadoop architectures.
- Expertise in converting Map Reduce programs into Spark transformations using Spark RDD's.
- Expertise in Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming and Spark MLlib.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming information with the help of RDD.
- Hands on experience in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Experienced in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Experienced in using Pig scripts to do transformations, event joins, filters and some pre - aggregations before storing the data onto HDFS.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python/Java into Pig Latin and HQL (HiveQL).
- Good Experience with NoSQL Databases like HBase, MongoDB and Cassandra.
- Experience on using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
- Hands on experience in querying and analyzing data from Cassandra for quick searching, sorting and grouping through CQL.
- Experience working with MongoDB for distributed storage and processing.
- Good knowledge and experienced in Extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked on importing data into HBase using HBase Shell and HBase Client API.
- Experience in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Good knowledge in workingwith scheduling jobs in Hadoop using FIFO, Fair scheduler and Capacity scheduler.
- Experienced in designing both time driven and data driven automated workflows using Oozie and Zookeeper.
- Experience working on Solr for developing search engine on unstructured data in HDFS.
- Extensively used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
- Experience working with operating systems like Linux, UNIX, Solaris, and Windows 2000/XP/Vista/7.
- Experience in working with Hadoop in Stand-alone, pseudo and distributed modes.
- Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server, and MySQL.
- Skilled in data management, data extraction, manipulation, validation and analyzing huge volume of data.
- Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, XML files, and Databases.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing and
- ETL Tools like IBM DataStage, Informatica and Talend.
- Good working knowledge on EclipseIDE for developing and debugging Java applications.
- Experienced and in depth knowledge of cloud integration with AWS using Elastic Map Reduce (EMR), Simple Storage Service (S3), EC2, Redshift and MicrosoftAzure.
- Detailed understanding of Software Development Life Cycle (SDLC) and strong knowledge in project implementation methodologies like Waterfall and Agile.
- A very Good team player with ability to solve problems, organize and prioritize multiple tasks.
Languages: C,C++,Python, PL/SQL, Java, HiveQL, Pig Latin, Scala, UNIX shell scripting.
Hadoop Ecosystem: HDFS, YARN, Scala, Map Reduce, Hive, Pig, Zookeeper, Sqoop, Oozie, Bedrock, Flume, Kafka, Impala,Nifi, MongoDB, HBase.
Databases: Oracle, MS-SQL Server, MySQL, PostgreSQL, NoSQL(HBase,Cassandra, MongoDB), Teradata.
Tools: Eclipse, NetBeans, Informatica, IBM DataStage, Talend, Maven,Jenkins.
Hadoop Platforms: Hortonworks, Cloudera, Azure, Amazon Web services (AWS).
Operating Systems: Windows XP/2000/NT, Linux, UNIX.
Amazon Web Services: Redshift, EMR, EC2, S3, RDS, Cloud Search, Data Pipeline, Lambda.
Version Control: GitHub, SVN, CVS.
Packages: MS Office Suite, MS Vision, MS Project Professional.
Confidential, Lexington, KY
Sr. Spark Developer
- Actively involved in designing Hadoopecosystem pipeline.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Involved in designing Kafka for multi data center cluster and monitoring it.
- Responsible for importing real time data to pull the data from sources to Kafka clusters.
- Worked with spark techniques like refreshing the table and handling parallelly and modifying the spark defaults for performance tuning.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
- Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL.
- Performed SQLJoins among Hivetables to get input for Spark batch process.
- Worked with data science team to build statistical model with Spark MLLIB and PySpark.
- Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
- Worked on creating data models for Cassandra from Existing Oracle data model.
- Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Used sqoop to import functionality for loading Historical data present in RDBMS to HDFS
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
- Configured Hive bolts and written data to hive in Hortonworks as a part of POC.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Developed oozie workflow for scheduling & orchestrating the ETL process.
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
- Worked extensively on Apache NiFi to build Nifi flows for the existing oozie jobs to get the incremental load, fullload, semi structured data and to get data from rest API into Hadoop and automate all the Nifi flows runs incrementally.
- Created nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Developed shell scripts to periodically perform incremental import of data from third party API to Amazon AWS
- Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Experience in using version control tools like GITHUB to share the code snippet among the team members.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making
- scrum meetings more productive.
Environment: Hadoop, HDFS, Hive, Python,Spark, MYSQL, Oracle, Linux, Hortonworks, Oozie,MapReduce, Sqoop, Shell Scripting, Apache Kafka, Scala, AWS.
Confidential, Austin, Texas
Sr. Spark Developer
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Created end to end Spark applications using Scala to perform various data cleansing, validation,loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
- Used MongoDB to store Bigdata and applied aggregation Match, Sort and Group operation in MongoDB.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real and persist it to MongoDB.
- Worked on the Data ingestion process using Sqoop for MySQL imports and exports, and Flume and Kafka for streaming data ingestion.
- Involved in collecting, aggregating, and moving large amounts of streaming data into HDFS using Flume.
- Maintained multiple copies of data in different database servers using MongoDB Replication concept.
- Developed Collections in Mongo DB and performed aggregations on the collections.
- Implemented automate scripts to back up the old records using Mongo DB export command and transfer these backup files into backup machine using ftplib.
- Written Hive queries structure them in tabular format to facilitate effective querying on the log data to perform business analytics.
- Migrated HiveQL queries on structured data into Spark QL to improve performance.
- Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
- Set up Solr Clouds for distributing indexing and search.
- Worked on solr configuration and customizations based on requirements.
- Worked in development/implementation of Cloudera Hadoop environment.
- Performed Cloudera Hadoop Upgrades and Patches and Installation of Ecosystem Products through Cloudera manager along with Cloudera Manager Upgrade.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Worked on a 40-node live Hadoop cluster running on Cloudera CDH4 and CDH5.
- Processed the web server logs by developing Multi-hop flume agents and loaded into MongoDB for further analysis.
- Unstructured files like XML’s, JSON files are processed using custom built java API and pushed into mongo DB.
- Extensively usedZookeeper as job scheduler for Spark Jobs.
- Worked with cloud services like AZURE and involved in ETL, Data Integration and Migration.
- Wrote Lambda functions in python for AZURE which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
- Exported the processed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Worked in Agile development environment and actively involved in daily Scrum and other design related meetings.
Environment: Hadoop, Python, HDFS, YARN, Scala, Hive, Sqoop, Flume, Zookeeper, Cloudera, Kafka, MongoDB, Linux ShellScripting, Azur0065, Spark-SQL, XML, ETL.
Confidential, San Francisco, CA
Big Data Engineer
- Setup and benchmarked Hadoop/HBase clusters for internal use.
- Worked with business teams and created Hive queries for ad hoc access.
- Loaded daily data from websites to Hadoop cluster by using Flume.
- Involved in loading data from UNIX file system to HDFS.
- Designed and Modified Database tables and used HBASE Queries to insert and fetch data from tables.
- Creating Hive tables and working on them using Hive QL.
- Created complex Hive tables and executed complex Hive queries on Hive warehouse.
- Installed and configured Hive and written Hive UDFs.
- Develop Hive queries for the analysts.
- Wrote MapReduce code to convert unstructured data to semi structured data.
- Implemented mappers and reducers across 24 nodes and distributed the data among the nodes.
- Imported Bulk Data into MongoDB Using Map Reduce programs
- Used Pig to extract, transformation & load of semi structured data.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Design technical solution for real-time analytics using HBase.
- Created HBase tables and used HBase sinks and loaded data into them to perform analytics using Tableau.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Environment: Hadoop, MapReduce, Yarn, Pig, HBase, Oozie, Sqoop, Flume, Core Java, Cloudera HDFS, Eclipse.
Confidential, Chicago, IL
- Involved in installing, configuring and managing Hadoop Ecosystem components like HDFS, Hive, Pig, Sqoop and Flume.
- Designed and implemented MapReduce-based on large-scale, parallel and relation-learning system.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on Linux shell scripts for business processes and with loading the data from different systems to the HDFS.
- Involved in generating analytics data using MapReduce programs written in core java.
- Used Hive data warehouse tool to analyze the data in HDFS and developed Hive queries.
- Involved in creating Hive internal and external tables, loaded them with data and writing hive queries which requires multiple join scenarios.
- Configured and designed Pig Latin scripts to process the data into a universal data model.
- Involved in developing Pig scripts in the areas where extensive coding needs to be reduced.
- Used Pig as ETL tool to do Transformations and some pre-aggregations before storing the data onto HDFS.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Implemented Frameworks using Java and Python to automate the ingestion flow.
- Managed and reviewed Hadoop log files to identify issues when job fails.
- Created ETL (Informatica) jobs to generate and distribute reports from MySQL database.
- Environment: Hadoop, MapReduce, Yarn, Hive, Pig,Oozie, Sqoop, Flume, MySQL, Core Java, Python, ETL.
- Worked closely with Business Analysts in understanding the technical requirements of each project and prepared the use cases for different functionalities and designs.
- Involved in using Agile practices and Test Driven Development techniques to provide reliable, working software early and often.
- Developed Use Case diagrams and Class diagrams.
- System components construction using Java, J2SE, Web Services, spring, jQuery.
- Responsible for using Maven for build framework and Jenkins for continuous build system.
- Develop batch jobs for various functionalities using java multi-threading.
- Experienced in working with AngularJS for Client-side scripting and synchronizing the applications.
- Involved in developing server-side program with Core Java, Servlets, and JSP’s.
- Used JDBC and SQL for database management and wrote SQL queries.
- Developed and Deployed the Application on Eclipse IDE and Tomcat and Web Servers.
- Involved in all the phases of SDLC including Requirements Collection, Design & Analysis of the Customer Specifications, Development and Customization of the Application.
- Communicated with Project manager, client, stakeholder and scrum master for better understanding of project requirements and task delivery by using Agile Methodology.
- Involved in implementing all components of the application including database tables, server-side Java Programming and Clientside web programming.
- Designed and developed Web Services to provide services to the various clients using SOAP and WSDL.
- Involved in preparing technical Specifications based on functional requirements.
- Involved in development of new command Objects and enhancement of existing command objects using Servlets and Core java.
- Identified and implemented the user actions (Struts Action Classes) and forms (Struts Forms Classes) as a part of Struts framework.
- Responsible for coding SQL Statements and Stored procedures for back end communication using JDBC.
- Involved in documentation, review, analysis and fixed post production issues.
Environment: Java, J2EE,JDBC, Struts, JSP, jQuery, SOAP, Servlets, SQL, HTML,CSS, Java Script, DB2.