- Around 8+ years of professional experience in Information Technology and expertise in Big data using Hadoop framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL, Big data Ecosystems, data management and visualization tools.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Sqoop, HBase, Spark, Spark SQL, Oozie, Zookeeper, Hue.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing
- Extensive experience in installing, configuring and using Big Data ecosystem components like Hadoop MapReduce, HDFS, Sqoop, Pig, Hive, Impala, Spark and Zookeeper.
- Expertise in using J2EE application servers such as IBM Web Sphere, JBoss and web servers like Apache Tomcat.
- Experience in different Hadoop distributions like Cloudera (CDH3 & CDH4) and Horton Works Distributions (HDP).
- Experience working on Spark and Google Cloud BigData platforms. Extensively used Spark, Spark Streaming, Kafka, BigQuery, Spanner, DataFlow, PubSub, Apache Beam
- Experience in developing web services with XML based protocols such as SOAP, Axis, UDDI and WSDL.
- Solid understanding of Hadoop MRV1 and Hadoop MRV2 (or) YARN Architecture.
- Worked on Google Cloud and GCP bigdata tools.
- Good knowledge on various API Managers like Azure, Apigee.
- Very well verse in writing and deploying Oozie Workflows and Coordinators. experience working on Spark and Google Cloud BigData platforms. Extensively used Spark, Spark Streaming, Kafka, BigQuery, Spanner, DataFlow, PubSub, Apache Beam
- Good working experience on using Sqoop to import data into HDFS from RDBMS and vice - versa.
- Extensive experience in Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse.
- Exposure to Data Lake Implementation using Apache Spark.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Good working experience in using Spark SQL to manipulate Data Frames in Python.
- In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
- Created modules for streaming in data into Data Lake using Strom and Spark.
- Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
- Working experience in Impala, Mahout, SparkSQL, Storm, Avro, Kafka and AWS.
- Experience with Java web framework technologies like Apache Camel and Spring Batch.
- Experience in version control and source code management tools like GIT, SVN and Bitbucket.
- Hands on experience working with databases like Oracle, SQL Server and MySQL.
- Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
- Hands on experience and commanding knowledge in Databricks, designing and delivering solutions in Azure Data Analytics platform.
- Extracting, Parsing, Cleaning and ingesting the incoming web feed data and server logs into the HDInsight and Azure Data Lake Store by handling structured and unstructured data.
- Proficiency in developing secure enterprise Java applications using technologies such as Maven, Hibernate, XML, HTML, CSS Version Control Systems.
- Developing and implementing Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Excellent understanding of Hadoop and underlying framework including storage management.
- Good Knowledge and experience of functionality of NoSQL DB like Cassandra and Mongo DB.
- Experience in using ANT for building and deploying the projects in servers and also using Junit and log4j for debugging.
- Experience in architecting data workload solutions using Cloud based technologies like Azure Data Factory, Databricks, Snowflake etc.
- Have good experience, excellent communication and interpersonal skills which contribute to timely completion of project deliverable well ahead of schedule.
Big Data Tools & Technologies: Hortonworks HDP, Hive, Apache Spark, SQL, MapReduce, Pig, HBase, Cassandra, NoSQL, Sqoop, Oozie, YARN, Tableau.
Hadoop ECO Systems: Spark-core, Kafka, Spark- SQL, HDFS, YARN, Sqoop, PIG, Hive, Oozie, Flume, Map Reduce, Storm
Development And Building Tools: Eclipse, Net Beans, IntelliJ, ANT, Maven, IVY, TOAD, SQL Developer
Data Bases: HBase, Cassandra, Microsoft SQL Server, MySQL, MongoDB, Oracle 9i/10g/11g, SQL Server 2008 R2/2012, My SQL,ODI, SQL/PL-SQL, MS-SQL Server 2005
Operating Systems: LINUX, Ubuntu, Windows
Security Management Hortonworks: Ambari, Cloudera Manager, SSL/TLS, Kerberos
Data Modeling Tools: Erwin 7.3/7.1/4.1/4.0
IDEs: Eclipse, IntelliJ, Spark Eclipse
Project Management Tools & DevOps: Rally, JIRA, Jenkins, Bitbucket (GIT)
Frameworks: JUnit and Jest, Spring, Hibernate, Kafka, Flask, Django, Android, Zeplin, Akka, ActiveMq, WSO2 ESB, WSO2 CEP, ORC
Confidential - Chicago, IL
Sr. Big data and Hadoop Developer
- Experienced in writing Spark Applications in Scala and Python (PySpark).
- Imported Avro files using Apache Kafka and did some analytics using Spark in Scala.
- Extracting real time data using Kafka and Spark streaming by Creating D streams and converting them into RDD, processing it and stored it into Cassandra.
- Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
- Configured, deployed and maintained multi-node Dev and Test Kafka Clusters.
- Using Spark-Streaming APIs to perform transformations and actions on fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
- Developed script which will Load the data into Spark Data frames and do in memory data computation to generate the output response.
- Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Setup Alerting and monitoring using Stack driver in GCP.
- Design and implement large scale distributed solutions in AWS and GCP clouds.
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Developed the batch scripts to fetch the data from AWS S3storage and do required transformations in Scala using Spark frame work.
- Building the Cassandra nodes using AWS & setting up the Cassandra cluster using Ansible automation tools.
- Wrote complex Hive queries and UDFs in Java and Python.
- Worked and learned a great deal from Amazon Web Services (AWS) cloud services like EC2, S3, EMR, EBS, RDS and VPC.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Involved in executing various Oozie workflows and automating parallel Hadoop MapReduce jobs.
- Developed Oozie Bundles to Schedule Pig, Sqoop and Hive jobs to create data pipelines.
- Experience in using ORC, Avro, Parquet, RCFile and JSON file formats and developed UDFs using Hive and Pig.
- Experience in implementing data solutions in Azure including Azure SQL, Azure Synapse, Cosmos DB, Databricks, ADLS, Blob Storage, ADF and Azure Stream Analytics.
- Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Used spark and spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Migrated existing MapReduce programs to Spark using Scala and Python.
- Design solution for various system components using Microsoft Azure.
- Written generic extensive data quality check framework to be used by the application using impala.
- Worked on Microsoft azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Databricks.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Involved in the process of Cassandra data modelling and building efficient data structures.
- Understanding of Kerberos authentication in Oozie workflow for Hive and Cassandra.
- Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
- Hands on developer for Unix/Linux batch applications.
- Utilized MapReduce, HDFS, Hive, Pig, Spring Batch & MongoDB.
- Worked on NoSQL databases including HBase and MongoDB.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig and Hive.
- Experience with Databricks and using Spark for data processing.
- Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.
Environment: Big Data3.0, SDLC, Azure, HDFS, Scala, SQL, Hive2.3, spark, Kafka1.1, Hadoop3.0, Apache Nifi, ETL, Sqoop1.4, Flume1.8, PySpark, GCP, elastic search, Oozie4.3, Jenkins, XML, MYSQL, GitHub, Hortonworks, Cloudera, MongoDB.
Confidential - Austin, TX
Sr. Big data and Hadoop Developer
- Worked on Big Data infrastructure for batch processing as well as real-time processing.
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved with all the phases of Software Development Life Cycle (SDLC) methodologies throughout the project life cycle.
- Developed a JDBC connection to get the data from Azure SQL and feed it to a Spark Job.
- Configured Spark streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Developed the Sqoop scripts to make the interaction between Hive and vertica Database.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.
- Deployed the application in Hadoop cluster mode by using spark submit scripts.
- Design, develop and maintain Cloudera/Google Cloud platform (GCP) including Data Lake, Data Factory, automated data pipelines, subject areas and BI applications.
- Moved data using Sqoop into the Data Lake, Pig cleansing and Hive tables.
- Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.
- Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive.
- Implemented various Hadoop Distribution environments such as Cloudera and Hortonworks.
- Implemented monitoring and established best practices around usage of elastic search.
- Worked on Apache Nifi as ETL tool for batch processing and real time processing.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.
- Building PySpark pipe line for validating the table in Oracle and Hive.
- Upgraded the Hadoop Cluster from CDH3 to CDH4, setting up High Availability Cluster and integrating HIVE with existing applications.
- Captured the data logs from web server into HDFS using Flume for analysis.
- Involved in developing code to write canonical model JSON records from numerous input sources to Kafka Queues.
- Stored Procedures, to make them work in the data lake.
- Worked with the Spark using Scala for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Pair RDD's, Spark YARN.
- Involved in identifying job dependencies to design workflow for Oozie & YARN resource management.
- Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance.
- Loaded and transformed large sets of structured, semi structured and unstructured data in various formats like text, zip, XML and JSON.
- Performed data validation and transformation using Python and Hadoop streaming.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
- Worked on importing data from HDFS to MYSQL database and vice-versa using SQOOP.
- Developed Python, Shell/Perl Scripts and Power shell for automation purpose and Component unit testing using Azure Emulator.
Environment: Horton works Hadoop, Cassandra, Flat files, Oracle 11g/10g, MySQL, Toad 9.6, Windows NT, Sqoop, Hive, Oozie, Ambari, GCP, SAS, PySpark, SPSS, Unix Shell Scripts, Zoo Keeper, SQL, Map Reduce, Pig.
Confidential - Frisco, TX
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Worked with technology and business groups for Hadoop migration strategy.
- Implemented the project by using Agile Methodology and Attended Scrum Meetings daily.
- Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Worked on migrating legacy Map Reduce programs into Spark transformations using Spark and Scala.
- Configured, deployed and maintained multi-node Dev and Test Kafka Clusters.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Loaded the data into Spark RDD and do in memory data computation to generate the output response.
- Developed Pig scripts to help perform analytics on JSON and XML data.
- Worked with Apache SOLR to implement indexing and wrote Custom SOLR query segments to optimize the search.
- Implemented map-reduce counters to gather metrics of good records and bad records.
- Handled importing of data from machine logs using Flume.
- Used Cloudera distribution for Data transformation and Data preparation.
- Involved in configured the Storm in loading the data from MYSQL to HBASE using Jms.
- Used Maven extensively for building jar files of MapReduce programs and deployed to Cluster.
- Extracted the data from Teradata into HDFS/Databases/Dashboards using Spark Streaming.
- Installed Oozie workflow engine to run multiple Map Reduce, Hive HQL and Pig jobs.
- Collected the log data from web servers and integrated them into HDFS using Flume.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop.
- Import the data from different sources like HDFS/HBase into SparkRDD.
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD, Spark YARN.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
- Used PIG to perform data validation on the data ingested using Sqoop and Flume and the cleansed data set is pushed into MongoDB.
- Extracted files from Couch DB through Sqoop and placed in HDFS and processed.
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
Environment: Spark, Hadoop3.0, Agile, AWS, Oozie4.3, Cassandra, MapReduce, Apache Pig0.17, Scala, Hive2.3, HDFS, Apache Flume1.8, HBase1.2, Apache, MongoDB, Sqoop1.4, Zookeeper.
Confidential - Houston, TX
- Collaborated with the source team to understand the format and delimiters of data files.
- Generated insight for various application teams derived from data to drive business decisions.
- Developed and implemented API services by incorporating Scala in Spark
- Implemented POCs on migrating to Spark-Streaming to process live data.
- Ingested data from RDBMS and performed data transformations, and exported the transformed data to Cassandra
- Reprogrammed existing Map Reduce jobs to use new features and improvements to achieve faster results
- Added, decommissioned, and rebalanced nodes based on cluster activity
- Created POC to store server log data in to Elastic search to generate system alert metrics
- Configured monitoring and management tools and client machines
- Extended the Hadoop ecosystem by installing a Hadoop cluster and integrated it with other systems
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data
- Created data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyze customer behavior.
- Imported metadata into Hive, migrated existing tables and applications to work on Hive and Spark.
- Developed Spark jobs and Hive jobs to summarize and transform parquet and JSON data.
- Used Spark for interactive queries, data streaming, and integration of a high-volume dataset in NoSQL database.
- Worked under Cloudera Distribution to build familiarity with HDFS.
- Processed terabytes of textual data by running Spark streaming jobs.
- Implemented POCs using Kafka, Strom, HBase for processing streaming data.
- Imported and exported data into HDFS, Hive, and Pig using Sqoop
- Populated Big Data Customer Marking data structures
- Developed Spark scripts by using Python.
- Performed complex joins on tables within Hive with various optimization techniques.
- Implemented lateral views in conjunction with UDFs in Hive.
Environment: Hadoop, Map Reduce, HDFS, HBase, HDP Horton, Sqoop, Data Processing Layer, HUE, AZURE, Erwin, MS Visio, Tableau, SQL, MongoDB, Oozie, UNIX, MySQL, RDBMS, Ambari, Solr Cloud, Lily HBase, Cron.
Big data Engineer
- Primary responsibilities included building scalable distributed data solutions using Hadoop ecosystem
- Developed complex MapReduce streaming jobs using Java that were implemented Using Hive and Pig.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Experience in writing MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other file formats.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Teradata into HDFS using Sqoop
- Loaded datasets from Teradata to HDFS and Hive daily.
- Worked on NoSQL database including HBase.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Used Impala to query the Hadoop data stored in HDFS.
- Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS using Sqoop.
- Worked on Oozie workflow engine for job scheduling.
- Experienced in managing and reviewing the Hadoop log files using Linux commands and Shell scripting.
- Worked with Avro Data Serialization system to work with JSON data formats.
- Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.
- Used Enterprise Data Warehouse database to store the information.
- Responsible for preparing technical specification documents, analyzing functional requirements, development and maintenance of code.
- Worked with the Data Science team to gather requirements for various data mining projects
- Worked in Agile SCRUM Team and responsible for coordinating deployment of code to QA and Production environments.
- Extracted feeds form social media sites such as Facebook, Twitter using Python scripts.
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Implemented a prototype for the complete requirements using Splunk, python and Machine learning concepts
Environment: Hadoop, HDFS, Zookeeper, MapReduce, Hive, Pig, Sqoop, JSON, GIT, Teradata, HBase, Linux, Agile.
- Interaction with business team for detailed specifications on the requirements and issue resolution.
- Coded Java Servlets to control and maintain the session state and handle user requests
- Used JDBC to connect to the backend database and developed stored procedures.
- Developed code to handle web requests involving Request Handlers, Business Objects, and Data Access Objects.
- Creation of JSP pages including the use of JSP custom tags and other methods of Java Beam
- Implemented Struts MVC Paradigm components such as Action Mapping, Action class, Action Form, Validation Framework, Struts Tiles and Struts Tag Libraries.
- Involved in the development of the front end of the application using Struts framework and interaction with controller java classes.
- Provided development support for System Testing, User Acceptance Testing and Production and deployed application on JBoss Application Server.
- Wrote and executed efficient SQL queries (CRUD operations), JOINs on multiple tables, to create and test sample test data in Oracle Database using Oracle SQL Developer.
- Developed Style Sheet to provide dynamism to the pages and extensively involved in unit testing and System testing using JUnit and involved in critical bug fixing.
- Utilized the base UML methodologies and Use cases modelled by architects to develop the front-end interface. The class, sequence and state diagrams were developed using Visio.