- Big data developer with experience in Data Pipeline, Data Mining, Database Management, Machine Learning, and Data Visualization.
- AWS Solution Architecture focus on server/severless data pipeline and infrastructure to implement consodidate delivery. Proficient in Lambda, Redshift, Snowflake, ECS, Dynamodb, S3
- Involved in the data engineer project life cycle, including Data Acquisition, Data Cleansing, Data Manipulation, Feature Engineer, and ETL Pipeline.
- Power user for extraction, transformation and loading (ETL) data from various sources into Data Warehouses, experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Familiar with Hadoop ecosystem and Apache Spark framework such as HDFS, MapReduce, Yarn, Hive, Sqoop, Oozie, SparkSQL, PySpark, SparkStreaming, and Kafka.
- Good understanding of distributed systems, HDFS architecture, internal working details of MapReduce and Spark processing frameworks.
- Experienced in various data sources SQL Server 13.0.5, Oracle 11g, DB2, AWS, and Impala 5.x.
- Good exposure to performance tuning Hive queries, map - reduce jobs, and Spark applications.
- Design, development and system migration of high performances metadata driven data pipeline with Kafka 2.2.1 and Hive 2.4.3, provided data export capability through API and UI.
- Worked in non-relational database such as MongoDB 3.2, Cassandra 2.7and HBase 1.3.
- Extensive use in collection of LogData and JSON into HDFS by Kafka and Hive to process.
- Capable of analyze data using HiveQL, Spark Framework, Pig, MapReduce in Scala and PySpark.
- Knowledge in data architecture to design data ingestion pipeline, Hadoop framework, Spark dataset, Spark SQL Query, data modeling, data process to optimize ETL workflow.
- Developed Spark Applications to help use APIs to handle data from RDBMS (MySQL, Oracle database, DB2) to Hive or Streaming sources (Spark Streaming, Kafka).
- Proficient in Machine Learning algorithms, A/B Test and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, Gradient Boosting, Stacking, SVM, KNN, Neural Networks, and K-means Clustering.
- Utilize data visualization tools such as Tableau 10.5, Python Matplotlib/Seaborn, R ggplot2/Shiny to create visually impactful and actionable interactive reports and dashboards.
- Adept in developing and debugging Stored Procedures, User-defined Functions (UDFs), Triggers, Indexes, Constraints, Transactions and Queries using Transact-SQL (T-SQL).
Confidential, Portland, OR
Big Data Developer
- Designed, created, tested and maintained data pipeline in AWS and Hadoop eco-system to build robust fault-tolerant system
- Worked in Agile framework as an indivivual contributor responsibleies including, interaction with Business team in story grooming, review story/accept criteria
- Deployed, scaled, configured, wrote manifest for varies severless framework in AWS
- Implement Rest microservices and generated metrics with method level granularity and persistence using Cloudwatch
- Generated interactions data pipeline from API to AWS Lambda, ECS, to Kafka; convert Json format to Avro and check it in HBase
- Made replay process in AWS to prevent data loss and created logging to track unique message ID in Cloudwatch used Python
- Applied interactions and opportunities schemas in Kafka with performance optimization using partition; developed code in Scala to migrate data from Kafka to HBase to make sure correct schema, column family stored; checked the condition of work cluster in Yarn UI
- Wrote AWS lambda by Python, unit test accuracy and used Terraform to make deployment in S3, DDB, SNS and SQS automatically
- Used Redshift to implement online analytic query to find better business needs and reports to business teams
- Scheduling Airflow jobs with AWS and check and monitor the health of pipeline
- Wrote PySpark to implement Spark Streaming job and landing records from Kafka to store in HBase
- Ran performance test verify quantity of data flow by Gitlab CI; deployed according to branch strategy-controlled Jenkins in test
Confidential, Piscataway, NJ
Big Data Developer
- Designed and developed scalable automation detection systems leveraging Machine Learning, Hadoop, Apache Spark Streaming, Kafka to identify suspicious activities, possible fraud attempts and study the historical data to predict anomalies
- Administer and maintained Cloudera Hadoop Clusters; provision physical Linux systems
- Extensively involved in Installation and configuration of Cloudera distribution Hadoop Name Node, Secondary Name Node, Resource Manager, Node Manager and Data Nodes; perform stress and performance testing, benchmark for the cluster
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark
- Conduct performance tuning of the Hadoop Cluster and map reduce jobs; fix design flaws
- Developed ETL pipeline for extracting, transformation data from HBase using Sqoop and processed data using Spark and loaded to HDFS
- Involved in migrating MapReduce jobs into Spark cluster by using Spark SQL
- Develop Spark code using Pyspark and Spark-SQL for faster testing and data processing
- Create end-to-end Spark applications using Scala to perform data cleaning, validation, loaded the data into Spark RDD and do in memory data computation to generate output response
- Involved in the development of Spark Streaming application for one of the data-source using Scala, Spark by applying the transformations
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing
- Create PySpark script to perform data analytics and load to HDFS
- Used Pandas, Numpy, Scikit-learn, Tensorflow, PyTorch in PySpark for developing exploratory data analysis and various machine learning methods algorithms by Spark MLlib
- Integrated visualization into Spark applications using Databricks and visualization by Tableau, Python matplotlib/seaborn packages
Confidential, Philadelphia, PA
Big Data Developer
- Responsible for building scalable distributed data solutions using Hadoop; worked on analyzing, writing Hadoop MapReduce jobs using JavaAPI, Pig and Hive
- Involved in loading data from edge node to HDFS using shell scripting; created HBase tables to store variable data formats coming from different risk portfolios
- Experienced in managing IAM(AWS) used by creating new users, giving them a limited access as per needs, assign roles and policies to specific users
- Maintain services hosted in AWS managing EC2 instances; responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files in AWS S3
- Architect Hadoop cluster in Pseudo distributed Mode working with Zookeeper and storing and loading data from HDFS to S3 and backing up to create tables in AWS cluster with S3 storage.
- Used MapReduce and Spark for cleaning, pre-processing, extracting relevant fields, performing pre-aggregations, joins of raw data and convert text data into suitable file formats
- Worked with NoSQL database like HBase for creating HBase tables to load large sets of semi structured data coming from client sources.
- Collected and aggregated large amounts of log data used Apache Flume and staged data in HDFS for further analysis. Used Pig to parse the data and store in Avro format
- Implemented MapReduce programs to handle semi/unstructured data like XML, Json, and sequence files for log files
- Exported the analyzed data to the relational database using Sqoop for visualization and to generate reports for the BI team; used Sqoop to store the data into HBase and Hive
- Created Hive tables, dynamic partitions, buckets as sampling, and working on it by HiveQL
- Worked on install cluster, commission and decommission of DataNode, NameNode high availability, capacity planning, and slots configuration
- Developed Spark SQL code using Scala for faster processing of data.
- Developed Scala to apply Data frames, Data sets, SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS(Oracle)s through Sqoop
- Implemented all the components following test-driven development (TDD) methodology and used Scala for unit testing
- Involved in complete end to end code deployment process in Production
Confidential - Minneapolis, MN
Big Data Engineer
- Structured analysis of portfolio, making recommendations to maximize value creation within card activities and resources behaviors
- Extracted, transformed, and loaded ETL data from multiple federated data sources in Spark
- Utilized Spark SQL to extract and process data by parsing with Datasets, DataFrame or RDD in Hive context, with transformation and action methods (map, flatmap, reducebykey, filter)
- Developed and maintained Workflow scheduling jobs in Oozie for importing data from RMDBS to Hive and Spark Streaming
- Created end to end Spark applications using Scala to perform data cleansing, loading, validation into Spark RDD and manipulated store in memory to generate output response
- Involved in using Spark API over Hadoop Yarn as execution engine for data analysis using Hive to extract features with transformation and action steps. Submitted, processed, analyzed and generated report to partners in Spark SQL
- Worked extensively with Sqoop for importing metadata from Oracle
- Developed Storm Topologies for real-time card-holder profiles where Kafka is used as source for getting customer activities information and stored into HBase
- Used Spark Steaming API for consuming data from Kafka source and processed data with core Spark written in Scala and stored resultant data in HBase tables to generate reports
- Experienced ETL pipeline to ingest data from Kafka source into HDFS as sink using Flume
- Implemented Oracle database and made aggregation query to find valid records for clients
- Made query optimization in Oracle to consider data shredding, partition, and add index in database
- Developed Machine Learning Algorithms to apply it to train, test analytics models: line regression, decision trees, k-nearest neighbors, random forest, neutral networks; evaluated Machine Learning model and computed metrics to identify the fittest model
- Integrated with visualization dashboard linked dataset and use visualization libraries (matplotlib, seaborn) Tableau to present result findings
- Designed User-Based and Item-Based collaborative filtering on Pearson correlation between user/item. Applied Grid Search to tune hyperparameters to evaluate model