Big Data Engineer Resume
Edison, NJ
SUMMARY
- Over 7 years of diversified and progressively challenging experience working as a software developer building Object Oriented applications and Web based enterprise applications.
- Experienced as Full Stack Developer in Big Data area inHadoop, Scala, Spark environment, and cloud environment.
- Strong experience in Core Java with strong understanding and knowledge of Object - Oriented Concepts likeCollections, Multi-Threading, Exception Handling and Generics.
- Hands on experience in installing, configuring and using Hadoop ecosystem components like HDFS, MapReduce Programming, Hive, Yarn, HBase, ZooKeeper, Kafka, Spark.
- Experience in data pipeline and data warehouse solutions in AWS using ElasticSearch, Redshift, Lambda, DynamoDB, Kinesis, EMR, S3, RDS.
- Proficient in Scala programming with Spark including SparkSQL, Spark Streaming, MLlib, GraphX.
- Experience in data pipeline and data warehouse solutions in Azure using Data Factory, Steam Analytics, Synapse, Cosmos DB, SQL, Blob storage.
- Proficient in PySpark with different components including SparkSQL, structured streaming and Mllib.
- Excellent experience in designing and implementing Web, Client/Server and N-Tier distributed, cross-platformsystems using Java/JavaEE technology Agile/SCRUM and TDD (Test Driven Development).
- Expertise in various APIs in JavaEE like Servlet, enterprise, EJB, validation, persistence, transaction, security, JMS, batch, resource, JAX-WS, JAX-RS.
- Expertise in various open-source frameworks like Struts2, Spring and ORM Technology like Hibernate, MyBatis.
- Expertise in various components in Spring like MVC, IOC/DI, AOP, Boot, Batch, Rest and Security.
- Experience in NoSQL databases including HBase (with Phoenix), Cassandra and MongoDB.
- Experience in RDBMS including Oracle and MySQL.
- Hands-on experience in cloud platform including AWS and Azure.
- Proficient in ad-hoc queries with Hive, Impala (with Kudu) and Phoenix (with HBase).
- Hands-on experience in building real-time data pipelines using Kafka (with Zookeeper), Spark Streaming, and Hbase.
- Experience in Kafka real-time Change Data Capture (CDC) using Spring XD.
- Experience in AWS including EMR, ElasticSearch, S3, RDS, DynamoDB, Kinesis, Redshift, Lambda.
- In-Depth knowledge of Machine Learning with python.
- Experience in creating tables, partitioning, bucketing, loading and aggregating data using Hive.
- Experience in NoSQL Column-Oriented Databases like HBase, Cassandra and its Integration with Hadoop cluster.
- In-Depth knowledge of Algorithms and Data Structures.
- Excellent understanding of object-oriented programming with Python/Java and functional programming with Scala.
- Experience with Software Development Life Cycle (SDLC) from Waterfall to Agile (SCRUM) models.
- Excellent programming and analyticalskills; quick learner of new technologies, self-motivated, focused and adaptive to new environments with strong technical and business communication skills.
TECHNICAL SKILLS
Programming Language: Python3, Java8, Scala2.x
Hadoop/Spark Ecosystem: Hadoop 2.x, Spark 2.x, Hive 2.x, HBase 2.2.0, Nifi 1.9.2, Sqoop 1.4.6, Flume 1.9.0, Kafka 2.3.0, HBase2.2.0, Cassandra3.11, Yarn1.17.3, Mesos 1.8.0, Zookeeper 3.4.x
Database: Oracle, MySQL, SQL Server 2008
Java/J2EE: Servlet, JSP, Struts2, Spring, Spring Boot, Spring Batch, EJB, JDBC, Hibernate, MyBatis, Web Services, SOAP, Rest, RabbitMQ, MVC, HTML, CSS, JavaScript, jQuery, XML, JSON, Log4j, JUnit, EasyMock, Mockito
AWS: ElasticSearch, Redshift, Lambda, DynamoDB, Kinesis, EMR, S3, RDS
Azure: Data Factory, Steam Analytics, Synapse, CosmosDB, SQL, Blob storage
Linux: Shell, Bash
Dev Tools: Git, JIRA, Maven, sbt, Vim, nano, pip, JUnit, Jenkins
Operating System: Windows, Linux, RedHat, Centos, Ubuntu,MacOS
PROFESSIONAL EXPERIENCE
Confidential, Edison NJ
Big Data Engineer
Responsibilities:
- Implemented ETL ingestion pipeline with Spark SQL from PostgreSQL, MongoDB, Kafka to HDFS and Hive.
- Participated in data warehouse modeling of fact and dimension tables for analytics.
- Performed and optimized HiveQL to achieve comprehensive business analysis.
- Orchestrated the ingestion and transformation pipelines with Airflow.
- Migrated and upgraded existing pipelines with S3, Databricks and Redshift.
- Automated the CI/CD workflow using Jenkins and sbt.
- Optimized spark jobs runtime by 1/5 by improving parallelism, minimizing shuffling and solving data skew.
- Implemented an end-to-end pipeline for logging analytics using Kafka, Spark Structured Streaming and HBase.
- Achieved fault-tolerance and exactly-once delivery semantics via checkpointing and idempotence mechanisms.
- Built a monitoring and alerting dashboard using Spark Metric Systems, Graphite, Grafana for statistics across multiple real-time pipelines.
- Followed the Agile development with Jira as project management and issue tracking tool.
- Built and maintained ETL pipelines serving internal teams (DS/BI) and external consumers (partners/clients), handling 120TB/day.
- (Data) Built real-time ETL ingestion pipelines based on GCP services using Pub/Sub, Dataflow (with Cloud Storage and Cloud Functions trigger) and BigQuery.
- Built and maintained compliance and data quality checks pipelines using Airflow, Pub/Sub, Cloud Functions.
- (DevOps) Automated CI/CD workflow with CircleCI, sbt, Docker, Terraform.
- Built a microservices for internal monitoring platform using Flask on the back-end.
- Contributed to the technical architecture design, documentation, and implementation.
- Gave company-wide trainings to over 50 people in order to drive BigQuery adoption.
- (Infra) Maintain auto-scaling multi-cloud elastic clusters for Spark workloads.
- Supported administration, provisioning, and configuration of CDP (Cloudera Data Platform) Hadoop cluster.
- Drove initiatives to reduce infrastructure costs surrounding Big Query, Google Cloud Storage.
- Authored a library of Python wrappers to perform DDL operations on top of BigQuery's API.
- Deployed and managed cloud services end to end with Terraform infrastructure as code.
Environment: Azure, GCP, AWS,S3, Databricks and Redshift, Python3, Scala 2.11.0, Spark2.11,Spark Metric Systems, Graphite, Grafana, Hive2.3.5, HDFS, Hbase, HiveQL, Kafka2.3.0,Spark SQL, PostgreSQL, MongoDB,, CircleCI, sbt, Docker, Terraform, Cloudera Data Platform
Confidential, Bridgewater, NJ
Big Data Engineer/Java Developer
Responsibilities:
- Developed Azure-based batch processing pipeline using Databricks, Data Factory (ADF) and Synapse.
- Migrated structured data from MySQL servers to Azure SQL.
- Leveraged Synapse as data warehouse solution and analytics platform.
- Developed stream processing pipelines using Azure Event Hubs, Databricks, Azure Stream Analytics and Cosmos DB.
- Migrated high volume data from legacy storage to Azure Data Lake Storage (ADLS) and Blob Storage.
- Used Spring Boot, MVC module and Rest to develop the micro services.
- Reviewed the existed system, analyzed the new requirement, and designed the new services.
- Uniformed request and response of the services, to make the team work efficiently.
- Developed POJO objects and used MyBatis as the Object-Relational Mapping (ORM) tool to access the persistent data from Oracle.
- Monitored and optimized data pipelines using Azure Monitor.
- Developed and modified Spark jobs (in Python and Scala) in multiple environments, including Azure Databricks, AWS EMR, and Hadoop.
- Built ETL batch processing pipeline using Spark SQL (with PySpark) and loaded outputs to HDFS and Hive table.
- Optimized structured streaming data using Structured Streaming (with Scala and Python) with Dataset API.
- Deployed CI/CD automation using Jenkins, Ansible and Terraform.
- Developed Hive queries and UDFS to analyze/transform the data in HDFS.
- Involved in creating Hive tables, loading with data and writing hive queries that will run internally in MapReduce way.
- Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
- Used Hive optimization techniques during joins and best practices in writing hive scripts using HiveQL.
- Applied structure to large amounts of unstructured data, created and loaded tables in Hive.
- Manipulated tables in Hive, performed and optimized ad-hoc queries with HiveSQL to achieve comprehensive data analysis.
- Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
Environment: Azure, AWS, Python3, Scala 2.11.0, Spark2.11, Hive2.3.5, Kafka2.3.0, Zookeeper3.5.5, Spring Boot, Spring Security
Confidential, Chicago, IL
Hadoop Developer
Responsibilities:
- Ingested initial high-volume (billion-level) data using Nifi from relational databases and local file systems to HDFS.
- Leveraged NiFi's REST API to automate the creation and monitoring of new ingestion pipelines.
- Performed complicated ANSI SQL in SnowFlake data warehouse to achive huge amount data analysis.
- Constructed classification and regression model in Anaconda environment.
- Performed data modeling in Python using machine learning frameworks including scikit-learn, Keras and PySpark.
- Performed and optimized ad-hoc queries with HiveQL to achieve comprehensive data analysis.
- Loaded processing outputs to HBase for scalable storage and fast query.
- Leveraged Python (PySpark and Pandas) script to automate local file processing and validation per requirement.
- Validated data in HBase and Titan (with Gremlin).
- Cooperated with data science team extracting insights from the huge amount of data.
- Built real-time ETL pipelines using Kafka (with Zookeeper), Spark Streaming (with Scala)and Hbase.
- Developed Kafka messaging system, collected events and brokered that data to real-time analytics clusters and data persistence clusters (HDFS).
- Utilized Amazon ElasticSearch to analyze and visualize large amount of data.
- Used Spark as a drop-in replacement for Hadoop MapReduce job to get the right answer to our queries in a much shorter amount of time.
- Leveraged Spark (with Scala) and Spark SQL to speed analysis across internal and external data sources, and generated batch processing reports.
- Debugging and identifying issues reported by QA with the Hadoop jobs by configuring to local file system.
- Involved in evaluation and analysis of Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
Environment: Hadoop2.9.2, Hive2.3.5, Spark2.4.3, Scala 2.12.0, Spark, Map Reduce, HiveSql, Amazon Elastic Search, HDFS
Confidential, Mattoon, IL
Spark Developer
Responsibilities:
- Built real-time data pipelines using Kafka (with Zookeeper), Spark Streaming (with Scala) and Hbase.
- Developed Kafka messaging system, collected events generated by data network (e.g., weblogs) and brokered that data to real-time analytics clusters and data persistence clusters (HDFS).
- Developed and optimized high-throughput with low-latency Kafka multi-threaded producer.
- Provided a comprehensive comparison among relational database, NoSQL database, object store and distributed file system based on OLAP and OLTP business requirement.
- Cooperated with data science team, utilized Spark Streaming and MLlib running predictive models on streaming data for real-time analytics.
- Developed HQL queries to implement the select, insert, update and operations to the database by creating HQL named queries.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Developed and maintained large scale distributed data platforms with experienced in data warehouses, data marts and data lakes.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Optimized structured streaming data using Structured Streaming with Dataset API.
- Loaded stream processing outputs to HBase for scalable storage and fast query.
Environment: Scala 2.12, HBase 2.2.0, Hadoop2.9.2, Spark2.4.3, Phoenix5.0.0, Kafka2.3.0, Zookeeper3.5.5
Confidential
Hadoop Developer/Data Analyst
Responsibilities:
- Developed various Big Data workflows using custom MapReduce, Pig, Hive, Sqoop, and Flume.
- Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in for data cleaning and preprocessing; Implemented complex map reduce programs to perform joins on the Map side using distributed cache.
- Developed Map/Reduce jobs using Java for data transformations.
- Develop different components of system like Hadoop process that involves Map Reduce, and Hive.
- Migration of ETL processes from Oracle to Hive to test the easy data manipulation.
- Responsible for developing data pipeline using Sqoop, MR and Hive to extract the data from weblogs and store the results for downstream consumption.
- Worked with HiveQL on big data of logs to perform a trend analysis of user behavior on various online modules.
- Using Sqoop to extract the data back to relational database for business reporting.
- Involved in creating Hive tables, Pig tables, and loading data and writing hive queries and pig scripts.
- Involved in Hadoop Cluster environment administration that includes adding and removing cluster nodes, cluster capacity planning, performance tuning, cluster Monitoring.
- Developed Hive queries and UDFS to analyze/transform the data in HDFS.
- Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
- Implemented Flume to import streaming data logs and aggregating the data to HDFS.
- Involved in HDFS maintenance and loading of structured and unstructured data.
Environment: Hadoop, Java, Cloudera Manager, Linux, RedHat, Centos, Ubuntu Operating System, Map Reduce, HBase, Sqoop, Pig, HDFS, Flume, Pig, Python