- 8+ years of IT experience in Analysis, design, development, implementation, maintenance and support with experience in Big Data, Hadoop Development and Ecosystem Analytics, Development and Design of Java based enterprise applications.
- Around 5 years of experiences in Hadoop, Eco - system components HDFS, MapReduce (MRV1, YARN), Pig, Hive, HBase, Scoop, Flume, Kafka, Impala, Oozie and Programming in Spark using Scala and exposure to Cassandra.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Expertise in Data Development in Hortonworks HDP platform &Hadoop ecosystem tools like Hadoop, HDFS, Spark, Zeppelin, Hive, HBase, SQOOP, flume, Atlas, SOLR, Pig, Falcon, Oozie, Hue, Tez, ApacheNiFi, Kafka.
- Expertise in Developing Spark application using SparkCore, SparkSQL and SparkStreaming API's in Scala and Python, deploying in yarn cluster in client, cluster mode using spark-submit.
- Have very good experience in Apache Spark, Spark Streaming, Spark SQL and No SQL databases like Cassandra and Hbase
- Expert in Amazon EMR, Spark, Kinesis, S3, Boto3, Bean Stalk, ECS, Cloud watch, Lambda, ELB, VPC, Elastic Cache, Dynamo DB, Redshit, RDS, Aethna, Zeppelin & Airflow.
- Strong knowledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Hortonworks Data Platform 2.1 & 2.2, CDH3, CDH4Cloudera Manager and Azure HDINSIGHT Distributions on Linux, Ubuntu OS etc.
- Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
- In-depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Spark MLlib.
- Expertise in writing Spark RDD transformations, actions, Data Frame's, case classes for the required input data and performed the data transformations using Spark-Core.
- Good knowledge of Hadoop Architecture and various components such as YARN, HDFS, NodeManager, ResourceManager, JobTracker, TaskTracker, NameNode, DataNode and MapReduce concepts.
- Strong knowledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Experience in installation, configuration, supporting and managing - Hortonworks/Cloudera's Hadoop platform along with CDH3&4 clusters.
- Solid SQL skills, can write complex SQL queries; functions, triggers and stored procedures for Backend testing, Database Testing and End-to-End testing.
- Experienced on Hadoop cluster on Azure HD Insight Platform and deployed Data analytic solutions using tools like Spark and BI reporting tools.
- Very Good understanding of SQL, ETL and Data Warehousing Technologies and Have sound knowledge on designing data warehousing applications with using Tools like Teradata, Oracle and SQL Server
- Experience in build scripts using Maven and do continuous integrations systems like Jenkins.
- Expertise in using Kafka as a messaging system to implement real-time Streaming solutions and implemented Sqoop for large data transfers from RDMS to HDFS/HBase/Hive and vice-versa.
- Experience in migrating ETL process into Hadoop, Designing Hive data model and wrote Pig Latin scripts to load data into Hadoop.
- Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS and Amazon EC2, Amazon EMR.
- Good Knowledge on Object Oriented Analysis and Design (OOAD) and Java Design patterns and good level of experience in Core Java, JEE technologies as JDBC, Servlets, and JSP.
Hadoop/Big Data: HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Pig, Impala, Oozie, Kafka, Spark, Zookeeper, Storm, Yarn, AWS and Azure.
Java & J2EE Technologies: Core Java, Servlets, JSP, JDBC, JNDI, Java Beans IDE's Eclipse, Net beans, IntelliJ
Frameworks: MVC, Struts, Hibernate, Spring
Databases: Oracle … MySQL, DB2, Teradata, MS-SQL Server.
NoSQL Databases: Hbase, Cassandra, MongoDB
Web Servers: Web Logic, Web Sphere, Apache Tomcat
Network Protocols: TCP/IP, UDP, HTTP, DNS, DHCP
ETL Tools: Informatica, Talend
Web Development: HTML, DHTML, XHTML, CSS, Java Script, AJAX
XML/Web Services: XML, XSD, WSDL, SOAP, Apache Axis, DOM, SAX, JAXP, JAXB, XMLBeans.
Methodologies/Design Patterns: OOAD, OOP, UML, MVC2, DAO, Factory pattern, Session Facade
Operating Systems: Windows, AIX, Sun Solaris, HP-UX.
Confidential, Bloomington, IN
SR. BIGDATA DEVELOPER/ENGINEER
- Developed Pig scripts to help perform analytics on JSON and XML data and created Hive tables (external, internal) with static and dynamic partitions and performed bucketing on the tables to provide efficiency.
- Used Hive QL to analyze the partitioned and bucketed data and compute various metrics for reporting and performed data transformations by writing MapReduce and Pig jobs as per business requirements
- Played a Significant role in the development of Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS
- Design Architecture of data pipeline/ingestion as well as optimization of ETL workflows and developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
- Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage.
- Design and development of full text search feature with multi-tenancy elastic search after collecting the real time data through Spark streaming.
- Involved in creating, transforming and actions on RDDs, DataFrames, Datasets using Scala, Python and integrating the applications to Spark framework using SBT and MAVEN build automation tools.
- Enable and configure Hadoop services such as HDFS, YARN, Hive, Ranger, Hbase, Kafka, Sqoop, Zeppeline Notebook and Spark/Spark2 and involved in analyzing log data to predict the errors by using Apache Spark.
- Evaluate deep learning algorithms for text summarization using Python, Keras, TensorFlow and Theano on Cloudera Hadoop system
- Designed Database Schema and created Data Model to store realtime Tick Data with NoSQL store.
- Extracting real time data using Kafka and spark streaming by Creating DStreams and converting them into RDD, processing it and stored it into Cassandra.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping and involved in analyzing log data to predict the errors by using Apache Spark.
- Used Sqoop to ingest from DBMS and Python to ingest logs from client data centers. Develop Python and bash scripts for automation and implemented Map Reduce jobs using Java API and Python using Spark
- Imported data from RDBMS systems like MySQL into HDFS using Sqoop and developed Sqoop jobs to perform incremental imports into Hive tables.
- Involved in loading and transforming of large sets of structured and semi structured data and created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Integrated MapReduce with HBase to import bulk amount of data into HBase using MapReduce programs and used Impala and Written Queries for fetching Data from Hive tables and developed Several MapReduce jobs using Java API.
- Worked with Apache SOLR to implement indexing and wrote Custom SOLR query segments to optimize the search.
- Created kafka spark streaming data pipelines for consuming the data from external source and performing the transformations in scala and contributed towards developing a Data Pipeline to load data from different sources like Web, RDBMS, NoSQL to Apache Kafka or Spark cluster.
- Developed multiple POCs using Scala and Pyspark and deployed on the Yarn cluster, compared the performance of Spark, and SQL.
- Worked with xml's extracting tag information using xpaths and Scala XML libraries from compressed blob data types.
- Developed Pig and Hive UDF's to implement business logic for processing the data as per requirements and developed Pig UDFs in Java and used UDFs from PiggyBank for sorting and preparing the data.
- Developed Spark scripts by using Scala IDEas per the business requirement.
- Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis and involved in querying data using SparkSQL on top of Spark engine for faster data sets processing and worked on implementing Spark Framework, a Java based Web Frame work.
- Created Hive tables, loaded data and wrote Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Use of Docker and Kubernetes to manage micro services for development of continuous integration and continuous delivery.
- Prepare and Present the Metrics for the Team utilization and Environment status in PowerBI, Power Point and SQL Azure and migrate databases to cloud platform SQL Azure and as well the performance tuning.
Environment: Hadoop, Hive, HDFS, Pig, Sqoop, Python, SparkSQL, Machine Learning, MongoDB, Azure HD Insights, Azure DW, Azue SQL, Azure ADLS, Azure Storage Blob, Snowflake, Oozie, ETL, Tableau, Spark, Spark-Streaming, Pyspark, KAFKA, Netezza, Apache Solr, Cassandra, Cloudera Distribution, Java, Impala, Web Server's, Maven Build, MySQL, Agile-Scrum.
Confidential, Chicago IL
SR. BIGDATA ENGINEER/DEVELOPER
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.
- Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
- Worked on Go-cd (ci/cd tool) to deploy application and have experience with Munin frame work for BigData Testing.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
- Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying
- Implement AWS Data Lake leveraging S3, terraform, EC2 and Lambda in performing data processing and storage while writing complex SQL queries, analytical and aggregate functions on views in Snowflake data warehouse to develop near real time visualization using Tableau Desktop and Alteryx.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS and imported the data from different sources like AWS S3, Local file system into Spark RDD
- Worked on data pipeline creation to convert incoming data to a common format, prepare data for analysis and visualization, Migrate between databases, share data processing logic across web apps, batch jobs, and APIs, Consume large XML, CSV, and fixed-width files and created data pipelines in kafka to Replace batch jobs with real-time data.
- Involved in data pipeline using Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.
- Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
- Perform data masking and ETL process using S3, Informatica cloud, Informatica Power Center and Informatica Test Data Management to support Snowflake Data warehousing solution in the cloud.
- Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
- Processed the web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis and worked on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.
- Developed data pipeline using Spark, Hive and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
- Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket
- Worked with different file formats like JSon, AVRO and parquet and compression techniques like snappy and developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Developed shell scripts for dynamic partitions adding to hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location.
- Worked on CICD Automation using tools like Jenkins, Salt stack, Git, Vagrant, Docker, Elastic Search, Grafana.
- Worked with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).
- Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
- Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
- Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snow-Flake.
Environment: Spark, AWS, EC2, EMR, Hive, SQL Workbench, Genie Logs, Snowflake, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS EMR, GIT, Grafana.
- Worked on loading disparate data sets coming from different sources to BDpaas (HADOOP) environment using Spark.
- Developed UNIX scripts in creating Batch load for bringing huge amount of data from Relational databases to BIGDATA platform.
- Delivery experience on major Hadoop ecosystem Components such as Pig, Hive, Spark Kafka, Elastic Search & HBase and monitoring with Cloudera Manager.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Involved with the team of fetching live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Implemented the Machine learning algorithms using Spark with Python and worked on Spark Storm, Apache and Apex and python.
- Involved in analyzing data coming from various sources and creating Meta-files and control files to ingest the data in to the Data Lake.
- Involved in configuring batch job to perform ingestion of the source files in to the Data Lake and developed Pig queries to load data to HBase
- Leveraged Hive queries to create ORC tables and developed HIVE scripts for analyst requirements for analysis.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis and worked extensively on Hive to create, alter and drop tables and involved in writing hive queries.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and parsed high-level design spec to simple ETL coding and mapping standards.
- Created and altered HBase tables on top of data residing in Data Lake and Created external Hive tables on the Blobs to showcase the data to the Hive Meta Store.
- Involved in requirement and design phase to implement Streaming Architecture to use real time streaming using Spark and Kafka.
- Used Spark for interactive queries, processing of streaming data and integration with HBase database for huge volume of data.
- Use Spark API for Machine learning. Translate a predictive model from SAS code to Spark and used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Created Reports with different Selection Criteria from Hive Tables on the data residing in Data Lake.
- Worked on Hadoop Architecture and various components such as YARN, HDFS, NodeManager, Resource Manager, JobTracker, TaskTracker, NameNode, DataNode and MapReduce concepts.
- Deployed Hadoop components on the Cluster like Hive, HBase, Spark, Scala and others with respect to the requirement.
- Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop.
- Implemented the Business Rules in Spark/ SCALA to get the business logic in place to run the Rating Engine.
- Developed Spark code using Scala and Spark -SQL/Streaming for faster testing and processing of data.
- Used Spark UI to observe the running of a submitted Spark Job at the node level and used Spark to do Property Bag Parsing of the data to get the required fields of data.
- Extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.
- Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job and used WINSCP and FTP to view the data storage structure in the server and to upload JARs which were used to do the Spark Submit.
- Developed code from scratch in Spark using SCALA according to the technical requirements.
Environmen t: Hadoop, Map Reduce, Yarn, Hive, Pig, HBase, Sqoop, Spark, Scala, MapR, Core Java, R Language, SQL, Python, Eclipse, Linux, Unix, HDFS, Map Reduce, Impala, Cloudera, SQOOP, Kafka, Apache Cassandra, Oozie, Impala, Zookeeper, MySQL, Eclipse, PL/SQL