- 0verall 9 plus years of professional experience in IT in with 5+ years of Big Data Engineer in various industries like Banking and Health.
- Experience in building data pipelines for data collection, storage and processing of data.
- Expert skills in HDFS, Kafka, Spark, Hive, Sqoop, MapReduce, YARN, HBase, Oozie and Zookeeper.
- Experience in Realtime data streaming using NiFi and KAFKA.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, and MapReduce concepts
- Strong knowledge of Pyspark and Spark SQL analytical functions and extending functionalities by writing custom UDFs.
- Experience designing and implementing fast and efficient data acquisition using Big Data processing techniques and tools.
- Good Experience in Data Visualization tools like Kibana and Tableau to display graphs.
- Experience in using Amazon Web Services (AWS) in creating EC2 instances and S3 storage.
- Realtime data streaming using Kinesis and Kinesis Data Firehose
- ETL transformations using AWS Glue and AWS Lambda to trigger & process events.
- Working Knowledge on MLIB in Spark using linear regression, navies Bayes and other machine learning algorithms.
- Experience in creating the REST API’s and CRUD operations like post, put and get request using curl.
- Knowledge in both relational databases (RDBMS) such as MySQL, PostgreSQL and NoSQL databases such as MongoDB, Cassandra.
- Good knowledge of SQL process and experienced in building queries.
- Knowledge in SQL database design and development in writing Constraints, Indexes, Views, Stored Procedures and Triggers using MySQL.
- Experience in project management and Bug Tracking tool such as JIRA and Bugzilla.
- Experience with version control tools such as GIT, GitHub and SVN.
- Hands on Experience in Continuous Integration (CI) and Continuous Deployment (CD) using Jenkins. using Autosys and Airflow DAG’s creation and scheduling.
- Good experience in AGILE development environment and Agile Frameworks like SCRUM.
- Ability to handle multiple tasks to work in a team as well as independently, experienced in interacting with Business/Operations/Technology groups.
Web Server Database: LAMP Server, WAMP Server, XAMPP Server MySQL,SQLServer,MongoDB,PostgreSQLNGINX Cassandra
Hadoop Ecosystem Cluster Mgmt. & Monitoring/ Cloud Platforms: Spark, Hive, Sqoop, Oozie, Map reduce, EMR, Cloudera Manager, Horton Works AmbariFlume, Hbase Microsoft Azure, AWS
Visualization Tools Build Management Tools: Tableu, Power BI, Kyvos workbook Gradle, Maven, Apache ant D3.js and Chart.js
Scheduling Tools IDE: Oozie, Autosys, AirFlow, Eclipse, PyCharm, Atom, IntelliJ, PHPStorm Jenkins
Web Service Version Control: REST, SOAP GIT, GitHub, SVN
Confidential, Charlotte, NC
- On High availability Hadoop cluster we process the customer data and to produce the Risk Models using Kafka and Spark.
- Developed Credit Risk models for IFRS9 regulatory requirements using pyspark
- Implementation of IFRS9 Model execution in pyspark, integrating with Python and proprietary C++ Libraries.
- Performance tuning of Spark Applications for Adhoc runs, attribution and sensitivity analysis.
- Stored customer details and Transactions info to Hive for better Business analysis and Marketing.
- Maintain log data with Kafka consumes them and process using pyspark and store the historical data to Datawarehouse Hive.
- Based on the business requirements transform the data using Pyspark and SparkSQL to load the data to Hive.
- Designed and developed file sourcing process which highly reduced processing time to consume data from third - party vendors.
- Split, filter, Map, sort and Aggregate the data using Python, Spark and SQL in distributed and parallelly across the datanodes.
- Defined SDLC and end to end CI with branching strategy and multi-lane deployments.
- Simplified existing code base and written utilities to help fellow teammates to develop in different lane and environments.
- Design Data ingestion pattern and planned dataflows for other envs using NiFi and Kafka.
- For long-term and Analytics stored data to S3 and Amazon Athena for adhoc data queries.
- Orchestrate data workflows using airflow to manage and schedule by creating DAGS using Python.
Environment: Kafka, Pyspark, Zookeeper, cloudera, EMR, Hive, SparkSQL, YARN, Ansible, RedShift, Linux, GIT, Windows 10, JIRA
- Implemented UDFs, UDAFs, UDTFs in java for hive to process the data that can’t be performed using Hive inbuilt functions
- Effectively used Oozie to develop automatic workflows of Sqoop, Mapreduce and Hive jobs.
- ETL transformations using pyspark and Spark SQL and store the data to Hive.
- Written Shell scripts with 2 logging features to automate jobs and scheduled with Autosys
- Deployed and extracted data using Microsoft Azure into Netezza
- Performed regression testing for integral code releases.
- Worked on kyvos to build the cube required for tableu and Power BI dashboard view
Confidential, Piscataway, NJ
Big data/Data Engineer
- Scheduled all jobs in OOZIE calling shell actions.
- Integrated Kafka with spark streaming for data extraction and transformation and created to process the data with Dstreams
- Loaded all the data after transformation into HBASE for analytics and reporting
- Data Visualization using D3.js to display graphs.
- Built the CRUD operations and building REST API’s using web Technologies.
- Preparing Hive SQL Scripts, Procedures and Views to implement the Business logic
- Written custom calculation UDF in Hive
- Demo to Business Users and incorporating the feedback obtained
- Discussing New or Enhancement requirements with the Business Analysts.
- Integrating with php to generate web based reports and Dashboards