- 8+ years of professional IT experience wif Hadoop/Spark experience in ingestion, storage, querying, processing and analysis of big data.
- Good experience wif programming languages Scala and Java.
- Exposure to design and development of database driven systems.
- Good knowledge of Hadoop architectural components like Hadoop Distributed File System, Name Node, Data Node, Task Tracker, Job Tracker, and MapReduce programming.
- Experience in developing and deploying of applications using Hadoop based components like Hadoop MapReduce (MR1), YARN (MR2), HDFS, Hive, Pig, HBase, Flume, Sqoop, Spark (Streaming, Spark SQL), Storm, Kafka, Oozie, ZooKeeper and Parquet.
- Good experience on general data analytics on distributed computing cluster like Hadoop using ApacheSpark, Impala, and Scala.
- Experience in implementing OLAP multi - dimensional cube functionality usingAzureSQLDataWarehouse.
- Hands on experience in importing and exporting data into HDFS and Hive using Sqoop.
- Exposure on usage of NoSQL databases column oriented HBase and Cassandra.
- Extensive experienced in working wif structured, semi-structured, and unstructured data by implementing complex MapReduce programs using design patterns.
- Familiar wif data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
- Hands on experience in major Big Data components ApacheKafka, Apache spark, Zookeeper, Avro.
- Experienced in extending Hive and Pig core functionality by writing custom UDFs and MapReduce Scripts using Java&Python.
- Owned teh design, development and maintenance of ongoing metrics, reports, analyses, dashboards, etc. Using tableau, to drive key business decisions and communicate key concepts to readers.
- Worked cross-functionally between 5 different groups to halp drive analytical ad hoc reporting, dashboard creation and built forecasting modes.
- Experience in rendering and delivering reports in desired formats by using reporting tools such asTableau.
- Hands-on experience in working wif AmazonWebServices (AWS) cloud and its services like EC2, S3, AtanaRDS, VPC, IAM, ElasticLoadBalancing, Lambda, RedShift, Auto Scaling, Cloud Front, Cloud Watch, and other services of teh AWS family.
- Experienced in implementing unified data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
- Experience using various HadoopDistributions (Cloudera, Hortonworks, etc.) to fully implement and leverage new Hadoop features.
- Great team player and quick learner wif TEMPeffective communication, motivation, and organizational skills combined wif attention to details and business improvements.
- Experienced in involving complete SDLC life cycle includes requirements gathering, design, development, Testing and production environments.
Hadoop/Big Data: HDFS, MapReduce, Spark, Yarn, Kafka, PIG, HIVE, Sqoop, Storm, Flume, Oozie, Impala, HBase, Hue, Zookeeper.
Programming Languages: Java, PL/SQL, Pig Latin, Python, R, HiveQL, Scala, SQL
Development Tools: Eclipse, SVN, Git, Ant, Maven, SOAP UI
Databases: Greenplum, Oracle 11g/10g/9i, Teradata, MS SQL
No SQL Databases: Apache HBase, Mongo DB
Frameworks: Struts, Hibernate, And Spring MVC.
Distributed platforms: Hortonworks, Cloudera.
Operating Systems: UNIX, Ubuntu Linux and Windows 00/XP/Vista/7/8
Confidential, New Brunswick,NJ
Senior Hadoop/Spark Developer
- Created Spark jobs to see trends in data usage by users.
- Worked towards creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Developed Spark scripts by using Scala shell commands as per teh requirement.
- Involved in developing a MapReduceframework dat filters bad and unnecessary records.
- Designed teh Column families in Cassandra.
- Ingested data from RDBMS and performed data transformations, and tan export teh transformed data to Cassandra as per teh business requirement.
- Developed Spark code to using Scala and Spark-SQL for faster processing and testing.
- Used Spark API over HadoopYARN as execution engine for data analytics using Hive.
- Exported teh analyzed data to teh relational databases using Sqoop to further visualize and generate reports for teh BI team.
- Created various kinds of reports using Power BI and Tableau based on teh client's needs.
- Setting up Kerberos principals and testing HDFS, Hive, Pig, and MapReduce access for teh new users.
- Migrated teh computational code in hql toPySpark.
- Worked wif Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
- Worked in migrating Hive QL into Impala to minimize query response time.
- Responsible for migrating teh code base from Hortonworks Platform to Amazon EMR and evaluated Amazon eco systems components like Redshift.
- Collected teh logs data from web servers and integrated in to HDFS using Flume
- Worked wif NoSQL databases like Hbase in creating Hbase tables to load large sets of semi structured data coming from various sources.
- Developed Pythonscripts to clean teh raw data.
- Developed Hive scripts in Hive QL to de-normalize and aggregate teh data.
- Implemented MapReduce counters to gather metrics of good records and bad records.
- Developed customized UDF's in java to extend Hive and Pig functionality.
- Extracted teh data from Teradata into HDFS/Databases/Dashboards using SPARKSTREAMING.
- Loaded Golden collection to Apache Solr using Morphline code for Business team.
- Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Created applications using Kafka, which monitors consumer lag wifin Apache Kafkaclusters.
- Involved testing in APS Data Loading, Data Seeding & Data Bridging strategy
- Using Spark-Streaming APIs to perform transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real time and Persists into Cassandra.
- Imported data from AWSS3 into SparkRDD, Performed transformations and actions on RDD's
- Maintained Hadoop Cluster on AWSEMR. Used AWS services like EC2 and S3 for small data sets processing and storage
- Design and document REST/HTTP, SOAPAPIs, including JSON data formats and API versioning strategy.
- Used ApacheKafka for collecting, aggregating, and moving large amounts of data from application servers.
- Used HibernateORMframework wif springframework for data persistence and transaction management.
- Used MLlibframework in Spark streaming for auto suggestions on predictive intelligence and maintenance.
- Developed Python code to gather teh data from HBase (Cornerstone) and designs teh solution to implement usingPySpark.
- Performance analysis of Sparkstreaming and batch jobs by using Spark tuning parameters.
- Worked along wif teh Hadoop Operations team in Hadoop cluster planning, installation, maintenance, monitoring and upgrades.
- Used micro services for data visualization and teh functional challenges of planning and implementing some solutions.
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Started using ApacheNiFi to copy teh data from local file system to HDP.
- Used File System check (FSCK) to check teh health of files in HDFS.
- Used Amazon Cloud Watch to monitor and track resources on AWS
- Scheduled teh ETL Jobs in AWS Glue developed through using lambda logics, (boto3), S3 to loaded into DynamoDB and Redshift.
- Designed a data analysis pipeline in Python, using AmazonWebServices such as S3, EC2 and Elastic MapReduce
- Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks. Participated in daily scrum and other design related meetings.
Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, Pyspark, Cassandra, Oozie,Nifi,Solr, Shell Scripting,Hbase, Scala, AWS, Maven, Java, JUnit, agile methodologies, Horton works, Soap, Python, Teradata, MySQL.
Confidential, Los Angeles, CA
Senior Big data developer
- Worked extensively on Scalaprogramming for Sparkdevelopment.
- Worked extensively on designing and building of scalable flexible data solutions around batch, low latency, search and real time data processing requirements using Spark, Kafka, HBase,Elasticsearch and HadoopEco-systems.
- Worked extensively wif business in requirement gathering, analysis and high-level design.
- Worked on teh design and implementation of real time streaming ingestion using Flume, Kafka and Spark Streaming.
- Worked extensively on enrichment/ETL in real time stream jobs using SparkStreaming, SparkSQL and loads into Hbase.
- Working wif management teams on log analysis reports and working wif fellow developers in identifying teh application issues.
- Worked extensively in writing Kafka Producers to ingest data into Kafka topics using Java 8.
- Utilized Apache Hadoop by Hortonworks to monitor and manage teh Hadoop Cluster.
- Completed data extraction, aggregation and analysis in HDFS by usingPySparkand store teh data needed to Hive.
- Built an Ingestion Framework dat would ingest teh files from SFTP to HDFS using Apache NIFI and ingest Financial data into HDFS.
- Working wif DBA to design reports for DB replica latency trends, analyzing teh transaction logs to find teh root cause of teh issues.
- Worked on Transactional logs to process them using Spark and saving them on required formats by applying various ETLtasks on log data and saving teh data.
- Worked on Data ingestion to Kafka and Processing and storing teh data Using Spark Streaming.
- Involved in tuning of Cassandra cluster by changing teh parameters of Read operation, Compaction, Memory Cache, Row Cache.
- Installed and configured ApacheHadoop and Hive/PigEcosystems.
- Created MapReduce Jobs using Hive/Pig Queries.
- Developed teh PigUDF’S to pre-process teh data for analysis.
- Worked on NoSql database Hbase for storing computed results.
- Worked extensively on search engines ElasticSearch, Novus (In house).
- Worked on work flow scheduling using Oozie.
- Developed MapReduce/SparkPython modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Pythonstreaming.
- Worked on Continuous Integration and Automation Testing Job scheduling using Jenkins and TFS.
- Analyzing Audit logs using Splunk, querying and designing views and dashboard on Splunk.
- Production Support by handling production bugs by reproducing in lower environments and fixing and moving them to prod environment by creating Hot Fixes.
- Designed and created Solr Schemas to create Solr Collections.
- Laid teh guidelines for improving teh code quality by implementing TDD and developed integrated test framework using JUnit, Mockito.
- Installed and Configured Hadoop cluster using AWS for POC purposes.
- Implemented CI/CD pipeline using Maven&Jenkins.
- Worked wif CMDB teams on deploying builds to various environments.
Environment: Hadoop (HDFS/Horton Works),Spark, Spark-SQL, Spark-Streaming, Scala, Kafka, JAVA, Nifi, Pig,Hive, Oozie, Stome, Hbase, Cloudera, AWS, Datastax Cassandra, Linux, Splunk, Elastic search, Pyspark, Kibana, TFS, CMDB, Ant, Jenkins.
Confidential, Tampa, FL
- Worked on installing Kafka on VirtualMachine and created topics for different users
- Actively involved in designing Hadoopecosystem pipeline.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Involved in designing Kafka for multi data center cluster and monitoring it.
- Responsible for importing real time data to pull teh data from sources to Kafka clusters.
- Worked wif spark techniques like refreshing teh table and handling parallelly and modifying teh spark defaults for performance tuning.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
- Involved in using SparkAPI over Hadoop YARN as execution engine for data analytics using Hive and submitted teh data to BI team for generating reports, after teh processing and analyzing of data in Spark SQL.
- Performed SQL Joins among Hive tables to get input for Spark batch process.
- Worked wif data science team to build statistical model wif SparkMLLIB and Pyspark.
- Involved in performing importing data from various sources to teh Cassandra cluster using Sqoop.
- Worked on creating data models for Cassandra from Existing Oracle data model.
- Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and tan export teh transformed data to Cassandra as per teh business requirement.
- Used Sqoop to import functionality for loading Historical data present in RDBMS to HDFS
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
- Configured Hive bolts and written data to hive in Hortonworks as a part of POC.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze teh logs produced by teh spark cluster.
- Developed Python script for start a job and end a job smoothly for a UC4 workflow
- Developed Oozie workflow for scheduling & orchestrating teh ETL process.
- Created Data Pipelines as per teh business requirements and scheduled it using Oozie Coordinators.
- Wrote Pythonscripts to parse XML documents and load teh data in database.
- Worked extensively on Apache Nifi to build Nifi flows for teh existing Oozie jobs to get teh incremental load, full load and semi structured data and to get data from RestAPI into Hadoop and automate all teh Nifi flows runs incrementally.
- Created Nifi flows to trigger spark jobs and used put email processors to get notifications if their are any failures.
- Developed shell scripts to periodically perform incremental import of data from third party API to AmazonAWS
- Worked extensively wif importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
- Developed teh batch scripts to fetch teh data from AWSS3 storage and do required transformations in Scala using Sparkframework.
- Used version control tools like GITHUB to share teh code snippet among teh team members.
- Involved in daily SCRUM meetings to discuss teh development/progress and was active in making scrum meetings more productive.
Environment: Hadoop, HDFS, Hive, Python, Hbase, Nifi, Spark, MYSQL, Oracle 12c, Linux, Hortonworks, Oozie, MapReduce, Sqoop, Shell Scripting, Apache Kafka, Scala, AWS.
Confidential, Tampa, FL
- Analyzing Functional Specifications Based on Project Requirement.
- Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.
- Extended Hive core functionality by writing custom UDFs using Java.
- Developing Hive Queries for teh user requirement.
- Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from TeamCenter, SAP, Workday, Machinelogs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Worked on MSSql Server PDW migration for MSBI warehouse.
- Planning, scheduling and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
- Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool.
- Integrated Tableau wif Hadoop data source for building dashboard to provide various insights on sales of teh organization.
- Worked on Spark in building BI reports using Tableau. Tableau was integrated wif Spark using Spark-SQL.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWSHDFS.
- Developed work flows in Live Compare to Analyze SAP Data and Reporting.
- Worked on Java development and support and tools support for in house applications.
- Participated in daily scrum meetings and iterative development.
- Search functionality for searching through millions of files of logistics groups.
Environment: Hadoop, Hive, Sqoop, Spark, Kafka, Scala, MS SQL Server PDW, TFS, Java.
- Developed JMS API using J2EE package.
- Made use of Java script for client-side validation.
- Used Struts Framework for implementing teh MVC Architecture.
- Wrote various Struts action classes to implement teh business logic.
- Involved in teh design of teh project using UML Use Case Diagrams, Sequence Diagrams, Object diagrams, and Class Diagrams.
- Understand concepts related to and written code for advanced topics such as Java IO, serialization and multithreading.
- Used DISPLAY TAGS in teh presentation layer for better look and feel of teh web pages.
- Developed Packages to validate data from Flat Files and insert into various tables in OracleDatabase.
- Provided UNIXscripting to drive automatic generation of static web pages wif dynamic news content.
- Participated in requirements analysis to figure out various inputs correlated wif their scenarios in Asset Liability Management (ALM).
- Assisted design and development teams in identifying DB objects and their associated fields in creating forms for ALM modules.
- Also involved in developing PL/SQL Procedures, Functions, Triggers and Packages to provide backend security and data consistency.
- Responsible for performing Code Reviewing and Debugging.
Environment: Java, J2EE, UML, Struts, HTML, XML, CSS, Java Script, Oracle 9i, SQL*Plus, PL/SQL, MS Access, UNIX Shell Scripting.