- Around 8+ years of technical expertise in all phases of SDLC(Software Development Life Cycle) which includes a major concentration on Big Data analyzing frame works, various Relational Databases, NoSQL Databases and Java/J2EE technologies with highly recommended software practices.
- Worked on various diversified Enterprise Applications concentrating inBanking, Financial,Health Care sectorsas a Big Data Engineer with good understanding of Hadoop framework and various data analyzing tools.
- 4+ years of industrial IT experience in Data manipulation using Big Data Hadoop Eco system components Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, Hbase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, AWS, Spark integration with Cassandra, Solr and Zookeeper.
- Extensive Experience in working with Cloudera (CDH4 & 5), and Hortonworks Hadoop distros and AWS Amazon EMR, to fully leverage and implement new Hadoop features.
- Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume. Good experience in writing Spark applications using Python and Scala.
- Experience in developing Spark Programs for Batch and Real-Time Processing. DevelopedSpark Streamingapplications for Real Time Processing.
- Implemented pre-defined operators in spark such as map, flat Map, filter, ReduceByKey, GroupByKey, AggregateByKey and CombineByKey etc.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDDs and Scala.
- Having experience in developing a data pipeline usingKafkato store data into HDFS.
- Knowledge about unifying data platforms usingKafka producers/ consumers, implement pre-processing using storm topologies
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
- Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
- Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
- Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
- Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design.
- Implemented Ad-hoc query using Hive to perform analytics on structured data.
- Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
- Experienced in optimizing Hive queries by tuning configuration parameters.
- Involved in designing the data model in Hive for migrating the ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
- Compared performance on hive and Big SQL for our data warehousing systems.
- Experience in running large adhoc queries with low latency to reduce response time by implementing IBM Big SQL
- Written and implemented custom UDF's inPigfor data filtering.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Extensively used Apache Flume to collect the logs and error messages across the cluster.
- Worked on NoSQL databases likeHBase,CassandraandMongoDB.
- Experienced in performing real time analytics on HDFS using HBase.
- Used Cassandra CQL with Java API’s to retrieve data from Cassandra tables.
- Extracted and updated the data into MONGODB using MONGO import and export command line utility interface.
- Experience in composing shell scripts to dump the shared information from MySQL servers to HDFS.
- Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
- Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
- Extensive knowledge of utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, Dynamo DB, Elastic Cache Glacier, RRS, Cloud Watch, Cloud Front, Kinesis, Redshift, SQS, SNS, RDS.
- Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Hands on experience in developing the applications with Java, J2EE, J2EE - Servlets, JSP, EJB, SOAP, Web Services, JNDI, JMS, JDBC2, Hibernate, Struts, Spring, XML, HTML, XSD, XSLT, PL/SQL, Oracle10g and MS-SQL Server RDBMS.
- Worked with Oozie and Zoo-keeper to manage the flow of jobs and coordination in the cluster.
- Experience in performance tuning, monitoring the Hadoop cluster by gathering and analyzing the existing infrastructure using Cloudera manager.
- Added security to the cluster by integrating Kerberos.
- Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Worked with BI (Business Intelligence) teams in generating the reports and designing ETL workflows on Tableau.
- Deployed data from various sources into HDFS and building reports using Tableau.
- Experienced writing Test cases and implement unit test cases using testing frame works like J-unit, Easy mock and mockito.
- Worked on Talend Open Studio and Talend Integration Suite.
- Adequate knowledge and working experience with Agile and waterfall methodologies.
- Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
- Extensive experience in implementing/ consume Rest Based Web Services.
- Good knowledge of Web/Application Servers like Apache Tomcat, IBM WebSphere and Oracle WebLogic.
Big Data Ecosystems: Hadoop, Map Reduce, HDFS, Zookeeper, Hive, Pig, Sqoop, Oozie, Flume, Yarn, Spark, NiFi
Database Languages: SQL, PL/SQL, Oracle
Programming Languages: Java, Scala
Frameworks: Spring, Hibernate, JMS
Web Services: RESTful web services
Databases: RDBMS, HBase, Cassandra
IDE: Eclipse, IntelliJ
Platforms: Windows, Linux, Unix
Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss
Methodologies: Agile, Waterfall
ETL Tools: Talend
Confidential, Chicago, IL
Sr. Spark/ Scala developer
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
- Worked on Spark usingPythonand Spark SQL for faster testing and processing of data.
- Implemented Spark sample programs inpythonusing pyspark.
- Analyzed the SQL scripts and designed the solution to implement usingpyspark.
- Developed pysparkcode to mimic the transformations performed in the on-premise environment.
- Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
- Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
- Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data into Cassandra.
- Involved in Cassandra Cluster planning and had good understanding in Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
- Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis, modified Cassandra.yaml and Cassandra-env.sh files to set various configuration properties.
- Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families in Cassandra performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis.
- Worked on Hortonworks-HDP 2.5 distribution.
- Developed efficient MapReduce programs for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and preprocessing on Hortonworks.
- Used Hortonworks Apache Falcon for data management and pipeline process in the Hadoop cluster.
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
- Maintained ELK (Elastic Search, Logstash, and Kibana) and Wrote Spark scripts using Scala shell.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Strong experience in working with ELASTIC MAPREDUCE (EMR) and setting up environments on Amazon AWS EC2 instances.
- Written Oozie workflow to run the Sqoop and HQL scripts in Amazon EMR.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness ofPython into Pig Latin and HQL (HiveQL).
- Developed shell scripts to generate the hive create statements from the data and load data to the table.
- Involved in writing custom Map-Reduce programs using java API for data processing.
- The Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
- Developed Hive queries for the analysts by loading and transforming large sets of structured, semi structured data using hive.
- Got chance working on Apache NiFi like executing Spark script, Sqoop scripts through NiFi, worked on creating scatter and gather pattern in NiFi, Ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a custom NiFi processor for filtering text from Flow files etc.
- Cluster coordination services through Zookeeper.
Environment: HDP 2.3.4, Hadoop, Hive, HDFS, Spark, Spark-SQL, Spark-Streaming, Scala, KAFKA, AWS, Cassandra, Hortonworks, ELK, Java and Agile Methodologies.
Confidential, Mountain View, CA
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Developed Sparkcode using Scala and Spark-SQL for faster processing and testing.
- Worked on Spark SQL for joining multi hive tables and write them to a final hive table and stored them on S3.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Created Spark jobs to do lighting speed analytics over the spark cluster.
- Evaluated Spark's performance vs Impala on transactional data. Used Spark transformations and aggregations to perform min, max and average on transactional data.
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked on MongoDB for distributed storage and processing.
- Used MongoDB to store processed products and commodities data, which can be further down streamed into web application (Green Box/ Zoltar).
- Responsible to store processed data into MongoDB.
- Experienced in migrating Hive QL into Impala to minimize query response time.
- Experience using Impala for data processing on top of HIVE for better utilization.
- Performed querying of both managed and external tables created by Hive using Impala.
- Developed Impala scripts for end user / analyst requirements for adhoc analysis.
- Continuous monitoring and managing the Hadoopcluster through Cloudera Manager.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Collected data using Spark Streaming from AWSS3 bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS.
- Responsible in creating Hive tables, loading with data and writing Hive queries.
- Worked on User Defined Functions in Hive to load the data from HDFS to run aggregation function on multiple rows.
- Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
- Wrote Map Reduce jobs using Java API and Pig Latin.
- Responsible in creating mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems using Talend.
- Fetch and generate monthly reports, Visualization of those reports using Tableau.
- Used Oozie Workflow engine to run multiple Hive and Pig jobs.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr, JUnit, agile methodologies
Big Data Hadoop Consultant
- Collected and aggregated large amounts of web log data from different sources such as web servers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
- Collecting data from various Flume agents that are imported on various servers using Multi-hop flow.
- Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume.
- Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop.
- Acted for bringing in data under HBase using HBase shell also HBase client API.
- Experienced with handling administration activations using Cloudera manager.
- Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Experience working with Apache SOLR for indexing and querying.
- Created custom SOLR Query segments to optimize ideal search matching.
- Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load on variety of sources like web server, RDBMS and Data API’s.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Pig Scripts.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Coordinated with SCRUM Master in delivering agreed user stories on time for every sprint.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.