Big Data Developer Resume
Piscataway, NJ
SUMMARY
- Extensive experience in a various IT related technology which includes hands - on experience in Big Data technologies
- Proficient in installing, configuring and using Apache Hadoop ecosystems such as MapReduce, Hive, Pig, Flume, Yarn, HBase, Sqoop, Spark, Storm, Kafka, Oozie, Flink, NiFi, Impala and Zookeeper
- Strong understanding of NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase, Cassandra, MongoDB and Elasticsearch
- Hands-on experience in developing the ETL program for Data Extraction, Transformation and Loading and also designing and implementing Data Warehouse applications using ETL tool like Oracle, Amazon Redshift, Snowflake and SAS
- Hands-on experience in capacity planning, monitoring and performance tuning of Hadoop and Spark clusters
- Expertise in distributed programming through spark, specifically Java, Scala and Python
- Proficient knowledge in collecting, aggregating and moving large amounts of real-time data with Flume and Apache Spark and programming Scala to analyze large datasets using Spark Streaming, Kinesis and Kafka to process real time data
- Strong experience with batch processing and workflow tools such as Airflow, NiFi, Luigi and Azkaban
- Experience in writing Pig Latin scripts and HiveSQL, Impala queries for preprocessing and analyzing large volumes of data
- Extensive experience in designing and implementing large scale data warehousing and analytics solutions for working with RDBMS (e.g. Oracle, Teradata, Amazon RDS, PostgreSQL) and understanding of the challenges and limitations of them
- Extensive experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems (RDBMS) and vice versa
- Experience working with Public Cloud platforms like Google Cloud, AWS, and Azure
- Experience in creating AWS computing instance services like EC2 and Amazon Elastic Load Balancing as well as creating and managing AWS Storage services like S3, EBS and Amazon CloudFront
- Hands-on experience on full life cycle implementation using MapReduce, CDH (Cloudera) and HDP (Hortonworks Data Platform)
- Experience of REST APIs (using, designing and development) and creating web application using MEAN stack consisting of React, Node.js, MongoDB, and Express.js along with HTML5/HTML, CSS3/CSS, JavaScript, jQuery, Bootstrap, JSON and AJAX
- Experienced in design patterns such as MVC using web framework such as Django, Flask and deploying application on Heroku, containerizing applications using Docker
- Knowledge of data serialization and familiar with data formats including SequenceFile, Avro, Parquet, XML and JSON
- Strong in core java, data structure, algorithms design, Object-Oriented Design(OOD), Object Oriented Programming (OOP) concepts and Java components like Collections, Exception handling, I/O system
- Demonstrated ability to communicate and gather requirements, partner with enterprise architects, business users, analysts and development teams to deliver rapid iteration of complex solutions
- Proficient in business intelligence reporting tools like Tableau, SAP and Looker
- Experience in Agile, Waterfall, and Scrum Development environments by using Git, Docker and JIRA
TECHNICAL SKILLS
Programming Languages\Development Approach\: C, C++, Java, Python, Scala, Javascript\Agile/SCRUM and waterfall\
Big Data TechnologiesAWS\: MapReduce, Spark, Spark SQL, Elasticsearch,EC2, SNS, SQS, VPC, Lambda, DynamoDB\Spark streaming, Kafka, Sqoop, Flume, \RDS, Kinesis, Redshift, S3, ELB, CloudFront, \Azkaban, Hive, Cassandra, Apache Nifi, Oozie, \EBS, EMR, Glue\Storm, Flink, Zookeeper, Pig, Yarn, Airflow, \Impala \
Web Technologies\NoSQL Database\: HTML5, CSS, JavaScript, XML, Angular, \MongoDB, Cassandra, Redis, HBase, Neo4j, \React, Node.js, Express.js, Restful Services, \Oracle NoSQL, Amazon DynamoDB, Couchbase\Bootstrap, jQuery, JSON, Flask, Django, Spring\
Tools: \RDBMS Database\: GitHub, SVN, Microsoft Office, Eclipse, \MySQL, Oracle, Microsoft SQL Server, \Jupiter, Hue, Docker, Heroku, Looker, Tableau, \PostgreSQL, IBM DB2, Teradata, SQLite\IntelliJ\
PROFESSIONAL EXPERIENCE
Confidential, Piscataway, NJ
Big Data Developer
Responsibilities:
- Designed and implemented data pipeline for audience targeting data management platform (DMP) using Kafka, Flume, Spark, Hive and ingested data into Elasticsearch, HBase and Redis
- Worked with Nginx and Flume NG cluster to import the real-time bidding data from the Server-to-Server integrations between DMP and Demand-Side Platform (DSP) into Kafka for real-time processing and HDFS for batch processing
- Convert raw data with sequence data format Parquet to reduce data processing time and increase data transferring efficiency through the network
- Used Spark Streaming combined with Kafka to do Real-time statistical analysis of business indicators and store the calculated result data in Redis
- Developed Spark programs and created RDDs and DataFrames to do Spark Transformations, Actions and Broadcast with Scala and Spark SQL to process the offline data from a variety of sources for data segmentation and profile building
- Worked closely with data science team to use Spark GraphX to analyze and process data to identify the same user across multiple devices and GeoHash algorithm to solve user's geographical location identification problem
- Use Spark Transformations and Action to merge newly collected user data with unified user data, store the merged user segmentation data in HBase, and dynamically expand
- Use ECharts, ELK stack (Elasticsearch, Logstash and Kibana) to visualize the data in HBase and generate audience profile reports based on the analysis for future research
- Worked in an Agile environment. Effectively communicated with different levels of the management
Environment: Apache Hadoop 2.5, YARN, Spark 2.3.2, Kafka 0.10.0.1, Flume 1.8.0, Hive 3.1.0, HBase 1.3.3, Elasticsearch 6.5.3, Redis 5.0, Logstash 6.5.3, Kibana 6.5.3, ECharts 4.1.0
Confidential, Riverside, CA
Data Engineer
Responsibilities:
- Work closely with the web developer team to implement a web application which allow users to create account, display nearby available bicycles on google map and unlock the bicycles the user want to use and send HTTP request to the servers
- Used NGINX to get the HTTP request from the user side iOS application first and then send those data to the servers with load balance strategies
- Work closely with web develop team to implement a micro-service module with Java Spring Boot to collect user account information from NGINX and store them into MongoDB cluster and MySQL database
- Used Flume to monitor and collect real-time user behavior data like location, riding time length, distance etc after using NGINX as HTTP load balancer and sink log-data into Message Queue of the Kafka and HDFS
- Processed data using Spark Streaming from Kafka in real-time and performs necessary transformations and aggregations for the data and store the result in MySQL database
- Wrote UDFs, UDAFs and UDTFs to do ETL processes which includes data processing and data storage in Spark SQL and Spark Core to transform offline unstructured data in HDFS into structured data
- Utilized Sqoop to transfer data from HBase to HDFS
- Deployed Elasticsearch and Kibana in Docker on AWS to perform data indexing and visualization so as to help the owner of the system to make good business strategies
- Actively participated and provided feedback constructively during daily Stand up meetings and weekly Iterative review meetings with SCRUM development
Environment: Apache Hadoop 2.5, Apache Spark 2.1.3, Kafka 0.10.0.1, Sqoop 1.4.7, NGINX PLUS R8, Flume 1.7.0, AWS, Spring Boot, Zookeeper, iOS 11.4.1, MongoDB 3.4, Elasticsearch 6.2.1, Spring Boot 1.5.17
Confidential, Riverside, CA
Big Data Developer
Responsibilities:
- Implement a real time credit card fraud detection and analysis pipeline with Kafka, batch processing, Spark streaming, Cassandra, Spark SQL, HBase and Airflow
- Developed Scala scripts, UDF's using both Data frames/Spark SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing to transfer large sets of semi structured credit card transaction data which include credit card information, merchant details and whether it is fraud transaction or not to structured data and load it into Cassandra database
- Worked with NoSQL databases Cassandra in creating databases and tables to load the cleaned transaction data
- Worked with Spark ML pipeline to process the structured transaction data in Cassandra and use the data to create machine learning model by using random forest algorithm
- Delivered real time credit card transaction data from multiple sources into Kafka messaging system
- Responsible for collecting incoming real time credit card transaction data from Kafka, processed them with Spark-Streaming and detection for fraud using Spark ML library with the deployed model
- Stored the fraud transaction detection result data into HBase
- Developed a scalable web application with Java Spring Boot and jQuery with a dashboard style with Bootstrap to automate the process of monitoring the HBase and alert via the dashboard once detected fraud based on the real-time & batch data
- Automated the whole pipeline with Airflow scheduling, decreased the pipeline run time by 49.5% and reduced data storage size by 99.7% via substituting intermediate database with Parquet
- Deployed the spark code on EMR
- Worked in an Agile environment. Effectively communicated with different levels of the management
Environment: Apache Hadoop 2.5, Apache Spark 1.6.0/2.1.3 , Kafka 0.10.0.1, Cassandra 2.2, EMR, HBase, Airflow 1.7.1.2, Spring Boot 1.3.0, Zookeeper, Bootstrap 3.3.7
Confidential
Big Data Developer
Responsibilities:
- Implement a pipeline to process and store streaming data from twitter with Pig, Hive, offline processing, Flume, Spark SQL, HBase and Oozie
- Created an application in twitter API developers page and then generate the corresponding keys
- Created Flume agents to handle streaming data from twitter and loaded the data into Hadoop cluster
- Developed Scala scripts, UDF's using both Data frames/Spark SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing to transfer large sets of semi structured tweets data into structured one
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS
- Used Spark MLlib utilities such as classification, regression, clustering and collaborative filtering on tweets data, analyze, identify and remove the non-job ad tweets
- Involved in creating Hive tables, loading data and implementation of various hive optimization techniques like dynamic partitions, buckets, map joins, parallel executions in those job tweets data
- Dumped the data from HDFS to MySQL database and vice-versa using Sqoop
- Used Oozie engine for creating workflow and coordinator jobs that schedule and execute various Hadoop jobs such as Pig jobs, Hive, Spark and automating Sqoop jobs
- Configured Oozie workflow to run multiple Hive and Spark jobs which run independently with time and data availability
- Unit tested a sample of raw data and improved performance and turned over to production
Environment: Apache Hadoop 2.5, Apache Spark 1.1.1, Flume 1.5.2, Pig 0.14.0, Hive 0.14.0, HBase 0.94.22, Oozie 4.1.0, Sqoop 1.4.5, MySQL 5.7.5, twitter API 0.0
Confidential
Data Analyst
Responsibilities:
- Worked on huge amount of flight data to do ETL processing with different big data analytic tools including Spark, Hadoop, Hive, Pig and Impala
- Applied Scala scripts, UDF's using both Data frames/Spark SQL and RDD/MapReduce in Spark to do batch processing with airline data
- Developed Pig Latin scripts to transform the log data files and load into HDFS
- Analyzed large data sets on HDFS with Impala queries and creating views for business processing
- Created Hive tables, analyzed data with Hive Queries, and written Hive UDFs and worked on various performance optimizations like partition, bucketing, clustering, sampling, data compression, tuning and query optimization with Hive and Impala
- Convert raw data with sequence data format Parquet to reduce data processing time and increase data transferring efficiency through the network
- Connected Hive tables with Tableau and performed data visualization for report
- Used Git for version control and Jenkins for continuous integration
Environment: Apache Hadoop 2.2.0, Apache Spark 0.8.0, Hive 0.12.0, Pig 0.12.0, Impala 1.1.1, Tableau 8.0.5, Jenkins 2.4.1
Confidential
Responsibilities:
- Built a fully functional scalable and secure Web Applications for book catalog using the Flask Framework.
- Utilized PostgreSQL database to allow users to register, login, logout and perform CRUD operations
- Designed and styled the web application using Bootstrap.
- Deployed the application on Heroku and restored the PostgreSQL database into Heroku using Amazon S3.
Environment: Flask 0.10, PostgreSQL 9.3, Bootstrap 3.1.0, Heroku, Amazon S3