- 5+ years of experience as Data Engineer in BIGDATA using HADOOP, Spark framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.
- 2+ years of experience as Snowflake Engineer.
- Well versed in configuring and administering the Hadoop Cluster using Cloudera and Hortonworks.
- Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake.
- Experience with data transformations utilizing SnowSQL and Python in Snowflake.
- Hands - on experience in bulk loading and unloading data into Snowflake tables using COPY command .
- Experience in working with AWS S3 and Snowflake cloud Data warehouse.
- Experience in creating real time data streaming solutions using Apache Spark/ Spark Streaming/ Apache Storm, Kafka and Flume
- Currently working on Spark applications extensively using Scala as the main programming platform
- Processing this data using Spark Streaming API with Scala.
- Used Spark Data Frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building
- Developed RESTful web Services to retrieve, transform and aggregate the data from different end points to Hadoop (Hbase, Solr).
- Created Jenkins Pipeline using Groovy scripts for CI/CD.
- Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark
- Involved converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Hands on experience doing real time on NO-SQL databases like MongoDB, HBase and Cassandra
- Experience in creating MongoDB clusters and hands on experience with complex MongoDB aggregate functions and mapping
- Experience in using Flume to load log files into HDFS and Oozie for data scrubbing and process
- Experience on performance tuning of HIVE queries and Map Reduce programs for scalability and faster execution
- Experienced in handling real time analytics using HBase on top of HDFS data
- Experience in transforming, Grouping, Aggregations, Joins using Kafka Streams API
- Hands on experience deploying KAFKA connect in standalone and distributed mode creating docker containers using DOCKER
- Created TOPICS and written KAFKA producer and consumer in Python as required, developed KAFKA source/sink connectors to store the streaming new data into topics, from topics to required different database by performing ETL tasks also used Akka toolkit with Scala to perform some builds
- Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.
- Has knowledge on Storm architecture, Experience in using data modeling tools like Erwin
- Excellent experience in using scheduling tools to automate batch jobs
- Hands on experience in using Apache SOLR/Lucene
- Expertise using SQL Server, SQL, queries, procedures, functions
- Hands on experience in App Development using Hadoop, RDBMS and Linux shell scripting
- Strong experience in Extending Hive and Pig core functionality by writing custom UDFs
- Experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across big volume of structured and unstructured data.
- Extensive experience in Text Analytics, developing different statistical Machine Learning, Data mining solutions to various business problems and gathering data visualization using Python and R.
- Ability to work as team and individually on many cutting-edge technologies with excellent management skills, business understanding and strong communication skills
Hadoop/Big Data: HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, Splunk, Hortonworks, Cloudera
Programming languages: SQL, Python, R, Scala, Spark, Linux shell scripts
Databases: RDBMS (MySQL, DB2, MS-SQL Server, Terradata, PostgreSQL), NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse
OLAP & ETL Tools: Tableau, Spyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend
Data Modelling Tools: Microsoft Visio, ER Studio, Erwin
Python and R libraries: R-tidyr, tidyverse, dplyr reshape, lubridate, Python - beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, scikit-learn, keras.
Machine Learning: Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.
Data analysis Tools: Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Cloud Computing Tools: Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Amazon Web Services: EMR, EC2, S3, RDS, Cloud Search, Redshift, Data Pipeline, Lambda.
Reporting Tools: JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
Development Methodologies: Agile, Scrum, Waterfall
Confidential, Boston, MA
- Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.
- Developed snowpipes for continuous injection of data using event handler from AWS (S3 bucket).
- Developed SnowSql scripts to deploy new objects and update changes into Snowflake.
- Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake.
- Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.
- Designing and implementing new HIVE tables, views, schema and storing data optimally.
- Performing Sqoop jobs to land data on HDFS and running validations.
- Configuring Oozie Scheduler Jobs to run the Extract jobs and queries in a automated way.
- Querying data by optimizing the query and increasing the query performance.
- Designing and creating SQL Server tables, views, stored procedures, and functions.
- Performing ETL operations using Apache Spark, also using Ad-Hoc queries and implementing Machine Learning techniques.
- Worked on configuring CICD for CaaS deployments (k8's).
- Involved in migrating master-data form Hadoop to AWS.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.
- Developed preprocessing job using Spark Data frames to transform JSON documents to flat file
- Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response
- Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
- Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark
- Developed Apache Spark applications by using Scala for data processing from various streaming sources
- Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume
- Responsible for design and development of Spark SQL Scripts based on Functional Specifications
- Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data.
- Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit
- Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
- Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.
- Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive
- Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system
- Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig
- Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers
- Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging
- Partitioning Data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
- Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of HCatalog for Hadoop based storage management
- Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR
- Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet
- Generated various kinds of reports using Pentaho and Tableau based on Client specification
- Have come across new tools like Jenkins, Chef and Rabbit MQ.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint
Environment: Snowflake, SnowSQL, Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Oozie, Spark, Scala, AWS, EC2, S3, EMR, Cassandra, Flume, Kafka, Pig, Linux, Shell Scripting
Confidential, Westport, CT
- Worked on Snowflake Shared Technology Environment for providing stable infrastructure, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, Deployment Process) utilities.
- Designed ETL process using Pentaho Tool to load from Sources to Targets with Transformations.
- Worked on Snowflake Schemas and Data Warehousing.
- Developed Pentaho Bigdata jobs to load heavy volume of data into S3 data lake and then into Redshift data warehouse.
- Migrated the data from Redshift data warehouse to Snowflake database.
- Build dimensional modelling, data vault architecture on Snowflake.
- Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6)
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using SparkContext, Spark-SQL, PairRDD's
- Serializing JSON data and storing the data into tables using Spark SQL
- Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
- Worked on Spark framework on both batch and real-time data processing
- Hands on experience in MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming
- Developing programs for Spark streaming which takes the data from Kafka and pushes into different sources
- Loading the data from the different Data sources like (Teradata, DB2, Oracle and flat files) into HDFS using Sqoop and load into Hive tables, which are partitioned.
- Created different pig scripts & converted them as shell command to provide aliases for common operation for project business flow.
- Implemented Partitioning, Bucketing in Hive for better organization of the data.
- Created few Hive UDF's to as well to hide or abstract complex repetitive rules.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Developed bash scripts to bring log files from FTP server and then processing it to load into Hive tables.
- All the bash scripts are scheduled using Resource Manager Scheduler.
- Developed Map Reduce programs for applying business rules on the data.
- Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
- Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow
- Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster
- Did Implementation using Apache Kafka replacement for a more traditional message broker (JMS Solace) to reduce licensing and decouple processing from data producers, to buffer unprocessed messages.
- Implemented receiver-based approach, here I worked on Spark streaming for linking with Streaming Context using Python and handle proper closing & waiting stages as well.
- Experience in Implementing Rack Topology scripts to the Hadoop Cluster.
- Implemented the part to resolve issues related with old Hazel cast API Entry Processor.
- Used Akka Toolkit to perform few builds and used Akka with Scala
- Excellent knowledge with Talend Administration console, Talend installation, using Context and global map variables in Talend
- Used dashboard tools like Tableau
- Used Talend Admin Console Job conductor to schedule ETL Jobs on daily, weekly basis
Environment: Hadoop HDP, Linux, MapReduce, HBase, HDFS, Hive, Pig, Tableau, NoSQL, Shell Scripting, Sqoop, Open source technologies Apache Kafka, Apache Spark, Git, Talend.
- Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities
- Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
- Created Spark jobs and Hive Jobs to summarize and transform data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Converted Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
- Used different tools for data integration with different databases and Hadoop.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
- Ingested syslog messages parse them and streams the data to Kafka.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
- Helped Devops Engineers for deploying code and debug issues.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
- Implemented business logic in Hive and written UDF’s to process the data for analysis.
- Addressing the issues occurring due to the huge volume of data and transitions.
- Designed, documented operational problems by following standards and procedures using JIRA.
Environment: Cloudera (CDH), Spark, Apache NiFi, HDFS, Oracle, HBase, MapReduce, Oozie, Sqoop, JIRA