- Big data developer with 8+ years of professional IT experience with expertise in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and Implementing enterprise level systems spanning Big Data.
- A skilled developer with strong problem solving, debugging and analytical capabilities, who actively engages in understanding customer requirements.
- Expertise in Apache Hadoop ecosystem components like Spark, Hadoop Distributed File Systems(HDFS), HiveMapReduce, Hive, Sqoop, HBase, Zookeeper, YARN, Flume, Pig, Nifi, Scala and Oozie.
- Hands on experience in creating real - time data streaming solutions using Apache Spark core, Spark SQL & DataFrames, Kafka, Spark streaming and Apache Storm.
- Excellent knowledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node,Data node, Resource manager, Node Manager and Job history server.
- Worked on both Cloudera and Horton works in Hadoop Distributions. Experience in managing Hadoop clustersusing Cloudera Manager tool.
- Well versed in installation, Configuration, Managing of Big Data and underlying infrastructure of Hadoop Cluster.
- Hands on experience in coding MapReduce/Yarn Programs using Java, Scala and Python for analyzing Big Data.
- Exposure to Cloudera development environment and management using Cloudera Manager.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle .
- Implemented Spark using PYTHON and utilizing Data frames and Spark SQL API for faster processing of data and handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.
- Hands on experience in MLlib from Spark which are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming.
- Experience in using Flume to load log files into HDFS and Oozie for workflow design and scheduling.
- Experience in optimizing MapReduce jobs to use HDFS efficiently by using various compression mechanisms.
- Working on creating data pipeline for different events of ingestion, aggregation, and load consumer response data into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Hands on experience in using Sqoop to import data into HDFS from RDBMS and vice-versa.
- In-depth Understanding of Oozie to schedule all Hive/Sqoop/HBase jobs.
- Hands on expertise in real time analytics with Apache Spark.
- Experience in converting Hive/SQL queries into RDD transformations using Apache Spark, Scala and Python.
- Extensive experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS).
- Experience in Microsoft cloud and setting cluster in Amazon EC2 & S3 including the automation of setting & extending the clusters in AWS Amazon cloud.
- Extensively worked on Spark using Python on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Knowledge in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions and on Amazon web services (AWS).
- Experienced in writing Ad Hoc queries using Cloudera Impala, also used Impala analytical functions.
- Experience in creating Data frames using PySpark and performing operation on the Data frames using Python.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS and MapReduce Programming Paradigm, High Availability and YARN architecture.
- Establishing multiple connections to different Redshift clusters (Bank Prod, Card Prod, SBBDA Cluster) and provide the access for pulling the information we need for analysis.
- Generated various kinds of knowledge reports using Power BI based on Business specification.
- Developed interactive Tableau dashboards to provide a clear understanding of industry specific KPIs using quick filters and parameters to handle them more efficiently.
- Well Experience in projects using JIRA, Testing, Maven and Jenkins build tools.
- Experienced in designing, built, and deploying and utilizing almost all the AWS stack (Including EC2, S3,), focusing on high-availability, fault tolerance, and auto-scaling.
- Good experience with use-case development, with Software methodologies like Agile and Waterfall.
- Working knowledge of Amazon's Elastic Cloud Compute( EC2 ) infrastructure for computational tasks and Simple Storage Service ( S3 ) as Storage mechanism.
- Good working experience in importing data using Sqoop, SFTP from various sources like RDMS, Teradata, Mainframes, Oracle, Netezza to HDFS and performed transformations on it using Hive, Pig and Spark .
- Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
- Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Hands on experience in Hadoop Big data technology working on MapReduce, Pig, Hive as Analysis tool, Sqoop and Flume data import/export tools.
Big Data Eco System: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Spark, Zookeeper, NiFi, Amazon Web Services.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
Languages: Python, Scala, Java, R, Pig Latin, HiveQL, Shell Scripting.
Software Methodologies: Agile, SDLC Waterfall.
Design Patterns: Eclipse, Net Beans, IntelliJ, Spring Tool Suite.
Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER.
NoSQL: HBase, MongoDB, Cassandra.
ETL/BI: Power BI, Tableau, Talend, Snowflake, Informatica, SSIS, SSRS, SSAS.
Version control: GIT, SVN, Bitbucket.
Operating Systems: Windows(XP/7/8/10), Linux(Unix, Ubuntu), Mac OS.
Cloud Technologies: Amazon Web Services, EC2, S3. Azure DataBricks.
Sr. Spark/ AWS Developer
- Implemented simple to complex transformation on Streaming Data and Datasets. Worked on analyzing Hadoop cluster and different big data analytic tools including Hive, Spark, Python, Sqoop, flume, Oozie.
- Developed Spark Streaming by consuming static and streaming data from different sources.
- Used Spark Streaming to stream data from external sources using Kafka service and responsible for migrating the code base from Cloudera Platform to Amazon EMR and evaluated Amazon eco systems components like RedShift, Dynamo DB. Having good knowledge in NOSQL databases like Dynamo DB, Mongo DB, Cassandra. Setting up and administering DNS system in AWS cloud using Route53.
- Developed wrapper shell scripts for calling Informatica workflows using PMCMD command and Created shell scripts to fine tune the ETL flow of the Informatica workflows.
- Analyzing, designing and developing ETL strategies and processes, writing ETL specifications, Informatica development, and administration.
- Performed configuration, deployment and support of cloud services in Amazon Web Services (AWS).
- Experienced working on cloud AWS using EMR Performed operations on AWS using EC2 instances, S3 storage, performed RDS, analytical Redshift operations and Wrote various data normalization jobs for new data ingested into Redshift by building multi-terabyte of data frame.
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.
- Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries. Design Develop and test ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage. Built and configured a virtual data center in the Amazon Web Services cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, Elastic Load Balancer.
- Created and configured snowflake warehouse strategy to move a terabyte of data from S3 into Snowflake via PUT scripts. Loaded data from AWS S3 bucket to Snowflake database using snowpipe.
- Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala . Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
- Implemented data ingestion and handling clusters in real time processing using Kafka.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's .
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS.
- Developed Spark application for filtering Json source data in AWS S3 and store it into HDFS with partitions and used spark to extract schema of Json files. Developed Terraform scripts to create the AWS resources such as EC2, Auto Scaling Groups, ELB, Route53, S3, SNS and Cloud Watch Alarms.
- Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS.
- Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive.
- Developed Spark programs with Scala and applied principles of functional programming to process the complex unstructured and structured data sets. Processed the data with Spark from Hadoop Distributed File System.
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Experience analyzing data from Azure data storages using Databricks for deriving insights using Spark cluster capabilities.
- Hands on AWS Cloud Infrastructure setup, assisted in building the VPC, Subnets, Security Groups, WAF for Cloud Formation, EMR cluster with auto-scaling, used terraform to build this infrastructure as cloud.
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit.
- Collaborated with Architects to design Spark model for the existing MapReduce model and migrated them to Spark models using Scala. Worked on writing Scala Programs using Spark-SQL in performing aggregations.
- Developed Web Services in play framework using Scala in building stream data Platform.
- Worked with Apache Spark which provides fast engine for large data processing integrated with Scala.
- Experience working with SparkSQL and creating RDD's using PySpark. Extensive experience working with ETL of large datasets using PySpark in Spark on HDFS. Data Extraction of Adobe data within AWS Glue using PySpark.
- Developed and deployed Spark application using Pyspark to compute popularity score for all the contents using an algorithm and load the data into Elastic Search for App content management team to consume.
- Developed a Data flow to pull the data from the REST API using Apache Nifi with context configuration enabled and developed entire spark applications in Python (PySpark) on distributed environment. Implemented Micro Services architecture using spring boot framework.
- Developed ETL programs to load data from Oracle to Snowflake using Informatica snowflake.
- Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake. Designed physical and logical data models based on Relational (OLTP), Dimensional on snowflake schema using Erwin modeler to build an integrated enterprise data warehouse.
- Analyzed and optimized pertinent data stored in Snowflake using PySpark and SparkSQL.
- Experienced in creating data pipeline integrating Kafka with spark streaming application used Scala for writing applications. Created Kafka broker for structured streaming to get structured data by schema.
- Developed Elastic Search Connector using Kafka Connect API with source as Kafka and sink as elastic search.
Hadoop / ETL Developer
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume.
- Data warehouse, Business Intelligence architecture design and develop. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
- Responsible for validation of Targetdata in Data Warehouse which are Transformed,Loaded using HadoopBigdata.
- Developing Informatica Cloud Jobs to migrate data from legacy Teradata Data Warehouse to Snowflake Cloud.
- Design, develop, test, implement and support of Data Warehousing ETL using Talend and Hadoop Technologies.
- Extensively worked with MySQL for identifying required tables and views to export into HDFS.
- Involved in designing of HDFS storage to have efficient number of block replicas of Data.
- Involved in troubleshooting and performance tuning of reports and resolving issues within Tableau server and generated reports. Developed Tableau data visualization using Cross tabs, Heat maps, Box and Whisker charts, Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
- Created Workbooks and dashboards for analyzing statistical billing data in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information. Scheduled full/incremental refreshes of dashboards depending on the business requirements for data sources on Tableau Server.
- Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
- Staged data by persisting to Hive and connected Tableau with Spark cluster and developed dashboards.
- Implemented UDFS, UDAFS, UDTFS in Java for Hive to process the data that can't be performed using Hive inbuilt functions. Used Hive to analyze the partitioned,bucketed data and compute various metrics for reporting.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.
- Configured the hive tables to load the profitability system in Talend ETL Repository and create the Hadoop connection for HDFS cluster in Talend ETL repository.
- Supported MapReduce Programs those are running on the cluster, wrote MapReduce jobs using JavaAPI.
- Written Java Restful API code in an application to makes use of KAFKA Publisher and Subscribers i.e. Kafka Consumer Client API’s and Kafka Producer Clients of Kafka Topics and Partitions Logging Messages.
- Design, develop & deliver the REST APIs that are necessary to support new feature development and enhancements in an agile environment.
- Expertise in building PySpark, Spark Java and Scala applications for interactive analysis, batch processing, and stream processing.
- Developed Batch processing solutions with Azure Databricks and Azure Event
- Collected LMS data into Hadoop cluster using SQOOP. Imported and exported data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
- Importing and exporting structured data from different relational databases into HDFS and Hive using Sqoop.
- Involved in importing and exporting data from local and external file system and RDBMS to HDFS.
- Designed and Developed jobs that handles the Initial load and the Incremental load automatically using Oozie workflow. Implemented workflows using Apache Oozie framework to automate tasks.
- Distributed Tableau reports using techniques like Packaged Workbooks, PDF to different user community.
- Involved in creating dashboards and reports in Tableau and maintaining server activities, user activity, and customized views on server analysis. Used Tableau to convey the results by using dashboards to communicate with team members and with other data science teams, marketing and engineering teams
- Implemented Kafka for collecting real time transaction Data, which was then processed with spark streaming with Python to gather actionable insights. Provide administration and operations of the Kafka platform like provisioning, access lists Kerberos and SSL configurations.
- Developed various pipelines for Mastercard and Visa card data Integration and continuous stream of data to databases and one of the organizations developed Kafka topics using Stream sets Data collector.
- Provide expertise in Kafka brokers, zookeepers, Kafka connect, schema registry, KSQL, Rest proxy and Kafka Control center. Used Kafka Streaming for data ingestion and cluster handling in real time processing.
- Deployed Instances, provisioned EC2, S3 bucket, Configured Security groups and Hadoop eco system for Cloudera in AWS. Experience in using distributed computing architectures like AWS products (e.g. EC2, Redshift, and EMR ) and working on raw data migration to Amazon cloud into S3 and performed refined data processing.
- Responsible for using GIT for version control to commit the code developed which further used for deployment using build and release tool Jenkins. Developed CI/CD system with Jenkins on Kubernetes container environment, utilizing Kubernetes and Docker for the CI/CD system to build, test and deploy.
- Extensively used Databricks notebooks for interactive analytics using Spark APIs.
- Worked on implementation and maintenance of Cloudera Hadoop cluster. Experience in working with Cloudera (CDH4 &CDH5), Horton Works, Amazon EMR, Azure HDINSIGHT on multi-node cluster.
- Integrated Oozie with Pig, Hive, Sqoop and developed Oozie workflow for scheduling and orchestrating the Extract, Transform, and Load (ETL) process within the Cloudera Hadoop.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Configured Flume for collection, aggregation and transformation of huge log data from various sources to HDFS.
- Importing and exporting structured data from different relational databases into HDFS and Hive using Sqoop.
- Experienced in handling large Datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked with data analysis and visualizing Big Data in Tableau with Spark.
- Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batch and later processed for already trained fraud detection model and error records.
- Developed Terraform scripts to create the AWS resources such as EC2, Auto Scaling Groups, ELB, Route53, S3, SNS and Cloud Watch Alarms. Developed scripts for loading application call logs to S3.
- Performed Installation and configuration of multi-node cluster on Cloud using Amazon Web Services on EC2.
- Involved in reviewing business requirements and analyzing data sources from Excel, SQL Server for design, development, testing, and produce report and analyze projects within Tableau Desktop.
- Efficient Data migration from various relational Data platforms to Hadoop and building Data warehouse on Hadoop ecosystems such as Hive, Oozie and Sqoop. Worked on loading the data into the cluster from dynamically generated files using Flume and sent the cluster to Relational database management systems using SQOOP.
- Involved in creating Hive tables and working on them using HiveQL and perform data analysis using Hive and Pig.
- Expertise in publishing Power BI reports of dashboards in Power BI server and scheduling the dataset to refresh for live data in Power BI server. Experience with Tableau and Power BI on publishing of created visualizations, dashboards, and workbooks from Tableau Desktop to Tableau Servers, and reports using SSRS.
- Performed Power BI Desktop Data modeling, which cleans, transforms, mash up Data from multiple sources.
- Used Pig as ETL tool to do transformations, event joins and pre-aggregations before storing the data onto HDFS.
- Leverage Hadoop ecosystem to design and develop capabilities to deliver our solutions using Spark, Scala, Python.
- Solved performance issues in Hive with understanding of joins, Groups, and aggregation and how does it translate to MapReduce jobs, Control M, Informatica. Extensive experience in working with structured Data using Hive QL, join operations, writing custom UDF's and experienced in optimizing Hive Queries.
- Developed multiple PySpark scripts to perform cleaning, validation and transformations of Data.
- Migration of Teradata and DB2 to Snowflake database using AWS and AWS resources.
- Involved in moving data from HDFS to AWS Simple Storage Service and worked with S3 bucket in AWS.
- Worked on AWS Elastic load balancing for deploying applications in high availability and AWS Auto Scaling for providing high availability of applications and EC2 instances based on the load of applications.
- Developed spark jobs for continuous integration of Error records (read, write and error count) which will pull logs from Kafka Topic to MySQL server Tables in an orderly manner as required.
- Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework. Used Spark processing framework such as Spark SQL, Data Warehousing and ETL processes.
- Designed a custom referential integrity framework on the No SQL Cassandra tables for maintaining data integrity and relations in the data. Developed multiple spark batch jobs using Spark SQL and performed transformations using many APIs and updated master data in Cassandra database as per the business requirement.
- Used Informatica as an ETL tool to create source/target definitions, mappings and sessions to extract, transform and load data into staging tables from various sources.
- Designed and Developed Informatica processes to extract data from internal check issue systems.
- Used Informatica Power exchange to extract data from one of the EIC s operational system called Datacom.
- Extensive experience in Building, publishing customized interactive reports and dashboards, report scheduling using Tableau Desktop and Tableau Server.
- Extensive experience in Tableau Administration Tool, Tableau Interactive Dashboards, Tableau suite.
- Developed Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server.
- Worked extensively in creating dashboards using Tableau that includes tools like Tableau Desktop, Tableau Server and Tableau Reader in various versions of Tableau 9.0, 8.2and 8.1.Also involved in the administration of Tableau server like installations, upgrades, user, user groups creation and setting up security features.
- Involved in Installation and upgrade of Tableau server and server performance tuning for optimization.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 data lake and further processed it using PySpark.
- Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components.
- Installed, upgraded and configured a multi-node Informatica environment and production Power Center systems covering 8.x and 9.x versions to provide the user community with the latest features.
- Used debugger in Informatica Designer to resolve the issues regarding data thus reducing project delay.
- Designed high-level view of the current state of dealer operation, leads, and website activity using Tableau.
- Performed various types of joins in Tableau for demonstrating integrated data purpose and validated data integrity to examine the feasibility of discussed visualization design.
- Leveraged advanced features of tableau such as calculated fields, parameters, and sets to support data analysis and data mining.
- Worked as ETL developer and Tableau developer and widely involved in Designing, development and debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations and complex calculations to manipulate the data using Tableau Desktop.
- Extensively worked on Informatica B2B Data Exchange Setup from Endpoint creation, Scheduler, Partner setup, Profile setup, Event attributes creation, Event status creation, etc.
- Used informatica to parse out the xml data into the DataMart structures that is further utilized for the reporting needs
- Utilized Informatica PowerCenter to accomplish full phases of data flow from source data (Oracle, SQL Server, flat files) being analyzed before extracted to transformation.
- Used Custom SQL feature on Tableau Desktop to create very complex and performance optimized dashboards.
- Connected Tableau to various databases and performed Live data connections, query auto updates on data refresh etc.