Big Data Engineer Resume
Chicago, IL
SUMMARY:
- Big Data Developer with extensive experience across finance, e - commerce, insurance industries.
- In-Depth knowledge of Hadoop including HDFS, Yarn, MapReduce, Hive.
- Proficient in Scala programming with Spark including Spark SQL, Spark Streaming, MLlib, GraphX.
- Hands-on experience in stream processing with S p ark Streaming, Storm, and Flink.
- Experience in NoSQL databases including HBase (with Phoenix), Cassandra and MongoDB.
- Experience in RDBMS including Oracle and MySQL.
- Hands-on experience in object store service Amazon S3 .
- Proficient in ad-hoc queries with Hive, Impala (with Kudu) and Phoenix (with HBase) .
- In-Depth knowledge of data ingestion tools Nifi, Sqoop and Flume.
- Hands-on experience in building real-time data pipelines using Kafka (with Zookeeper), Spark Streaming, and HBase.
- Experience in Kafka real-time Change Data Capture (CDC) using Spring XD.
- Experience in AWS including EMR, Elasticsearch, S3, RDS, DynamoDB, Kinesis, Redshift, Lambda.
- In-Depth knowledge of Machine Learning with python.
- In-Depth knowledge of Algorithms and Data Structures.
- Excellent understanding of object-oriented programming with Java and functional programming with Scala.
- Excellent programming, analytical, communication and interpersonal skills, a fast learner and good team player.
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Database Hadoop 2.x, Spark 2.x, Hive 2.x, HBase 2.2.0 \ Oracle, MySQL, HBase2.2.0, Cassandra3.11 Nifi 1.9.2, Sqoop 1.4.6, Flume 1.9.0, Kafka 2.3.0 Yarn1.17.3, Mesos 1.8.0, Zookeeper 3.4.x \
AWS Programming Language: S3, RDS, DynamoDB, Kinesis, EMR \ Java8, Python3, Scala2.x Elasticsearch, Redshift, Lambda \
Software/Framework: \ Linux Git, JIRA, Maven, sbt, Junit, Jenkins, Spring \ Shell, Bash, Vim, nano, APT, Wget, pip IntelliJ IDEA, Eclipse
PROFESSIONAL EXPERIENCE:
Confidential
Big Data Engineer
Responsibilities:
- Ingested initial high-volume (billion-level) data to from ERP systems to MySQL, NoSQL databases, Amazon S3, HDFS.
- Deployed AWS Data Pipeline and build AWS Lambda functions to activate execution in response to AWS S3 event notifications.
- Generated batch processing reports using MapReduce and Spark and loaded outputs to databases.
- Optimized structured data processing using Spark SQL and Structured Streaming.
- Performed and optimized ad-hoc queries with Hive, Impala (with Kudu) and Phoenix (with HBase) to achieve comprehensive data analysis.
- Built real-time data pipelines using Kafka (with Zookeeper), Spark Streaming and HBase.
- Brokered real-time streaming data to data persistence clusters (mainly HDFS) for further batch processing and ad-hoc queries.
- Cooperated with data science teams running risk detection algorithms on streaming data.
- Loaded stream processing outputs to HBase for scalable storage and fast query.
- Used Git for version control and Maven for project management.
Environment: Scala 2.12.0, Nifi 1.9.2, HBase 2.2.0, Hadoop2.9.2, AWS, Spark2.4.3, Hive2.3.5, Impala3.2.0, Phoenix5.0.0, Kafka2.3.0, Zookeeper3.5.5
Confidential, Chicago, IL
Hadoop Developer
Responsibilities:
- Ingested initial high-volume (billion-level) data using Nifi from relational databases and local file systems to HDFS.
- Leveraged NiFi's REST API to automate the creation and monitoring of new ingestion pipelines.
- Applied structure to large amounts of unstructured data, created and loaded tables in Hive.
- Manipulated tables in Hive, performed and optimized ad-hoc queries with HiveQL to achieve comprehensive data analysis.
- Utilized Amazon Elasticsearch to analyze and visualize large amount of data.
- Used Spark as a drop-in replacement for Hadoop MapReduce job to get the right answer to our queries in a much shorter amount of time.
- Leveraged Spark (with Scala) and Spark SQL to speed analysis across internal and external data sources, and generated batch processing reports.
Environment: Hadoop2.9.2, Hive2.3.5, Spark2.4.3, Scala 2.12.0
Confidential, Chicago, IL
Spark Developer
Responsibilities:
- Built real-time data pipelines using Kafka (with Zookeeper), Spark Streaming (with Scala) and HBase.
- Developed Kafka messaging system, collected events generated by data network (e.g. weblogs) and brokered that data to real-time analytics clusters and data persistence clusters (HDFS).
- Developed and optimized high-throughput with low-latency Kafka multi-threaded producer.
- Provided a comprehensive comparison among relational database, NoSQL database, object store and distributed file system based on OLAP and OLTP business requirement.
- Cooperated with data science team, utilized Spark Streaming and MLlib running predictive models on streaming data for real-time analytics.
- Optimized structured streaming data using Structured Streaming with Dataset API.
- Loaded stream processing outputs to HBase for scalable storage and fast query.
Environment: Scala 2.12, HBase 2.2.0, Hadoop2.9.2, Spark2.4.3, Phoenix5.0.0, Kafka2.3.0, Zookeeper3.5.5
Confidential
Data Analyst
Responsibilities:
- Performed extensive SQL queries to achieve comprehensive data analysis.
- Cleansed and preprocessed the data using Pandas (with Python) for further modeling and analysis.
- Cooperated with data science team, applied machine learning algorithms (classification, clustering, regression) using scikit-learn.
- Interpreted and visualized the analysis result using Tableau.
Environment: Python3, pandas 0.24.0, scikit-learn 0.21.3, MySQL, Tableau