Spark/ Hadoop Engineer Resume
St Louis, MO
SUMMARY
- Spark/ Hadoop developer having 3+ years of experience in IT with 2+ years on Hadoop and have strong experience working with programming languages: Scala, Java.
- Experience with Big Data/ Hadoop Ecosystem: Spark, Hive, Sqoop, Kafka, Oozie, HBase, MapReduce, NIFI.
- In - depth understanding of Spark Architecture and performed several batch and real-time data stream operations using Spark (Core, SQL, Streaming).
- Experienced in handling large datasets using Spark in-memory capabilities, Partitions, Broadcast variables, Accumulators, Effective & Efficient Joins. Used Scala to develop Spark applications.
- Tested and Optimized Spark applications.
- Performed Hive operations on large datasets with proficiency in writing HiveQL queries using transactional and performance efficient concepts: UPSERTS, Partitioning, Bucketing, Windowing,etc.
- Wrote custom UDFs, UDAFs, UDTFs, and generated optimized execution plans for faster performance.
- Imported data from relational databases to HDFS/Hive, performed operations and exported the results back using Sqoop.
- Wrote custom Kafka Consumer programs in Java and implemented a pipeline: Kafka, Spark, HDFS/S3.
- Implemented NIFI data workflow in production and performed streaming and batch processing via micro-batches coming from multiple data sources. Controlled and monitored using Web UI.
- Scheduled jobs and automated workflows using Oozie.
- Experienced working on cloud AWS using EMR. Performed operations on AWS using EC2 instances, S3 buckets, performed RDS, Lambda, analytical Redshift operations.
- Used HBase to work with large sets of structured, semi-structured and unstructured data coming from a variety of sources.
- Used Tableau to generate reports and created visualization dashboards.
- Experienced working with different file formats like Parquet, Avro, CSV, JSON, Text files.
- Worked with Big Data Hadoop distributions: AWS EMR, Cloudera.
- Developed MapReduce jobs using Java to process data sets by fitting the problem into the MapReduce programming paradigm.
- Followed Agile-Scrum model and used DevOps tools like GitLab, JIRA, Confluence, Jenkins.
TECHNICAL SKILLS
Hadoop/ Big data: Spark, Hive, Sqoop, Kafka, YARN, NIFI, HBase, Oozie, MapReduce, Zookeeper
Programming: Scala, Java, SQL
Hadoop Distributions: Cloudera, Amazon EMR
Databases/Datawarehouses: Oracle, MySQL, HBase
Amazon Web Services: EMR, EC2, S3, Lambda, RDS, Redshift, IAM
Other tools & SDLC: Tableau, IntelliJ IDEA, Eclipse, SBT, Maven, Putty, JIRA, Confluence, Agile - Scrum
PROFESSIONAL EXPERIENCE
Confidential - St. Louis, MO
Spark/ Hadoop Engineer
Responsibilities:
- Developed Spark applications and modified existing ones using Scala to meet business needs to process large datasets through Data frames, RDDs and performed several transformations and actions on top.
- Made changes to the existing identity scoring algorithm and JSON configuration files to suite our need.
- Developed Spark SQL applications to perform complex data operations on structured and semi-structured data stored as Parquet, JSON, XML files in S3 buckets.
- Developed Scala scripts, UDFs using Data frames/Data sets in Spark for aggregation, queries and finding similarity on different types of datasets.
- Experienced in performance tuning of Spark Applications by setting correct level of Parallelism and memory tuning and using efficient concepts.
- Implemented schema extraction for Parquet and Avro file Formats in creating Hive tables.
- Used Sqoop to transfer data from EMR to MySQL (S3 -> Sqoop -> Hive (EMR staging)-> Sqoop -> MySQL).
- Performed Unit, Integration testing by mocking data.
- Experienced working with different file formats like Parquet, Avro, JSON, XML and compression tools like Snappy for efficient storage, retrieval, and processing of files.
- Involved in POC in developing a pipeline using Kafka to subscribe messages from necessary topics as client made changes on UI through apache tomcat.
- Performed UPSERTS to the data in the data lake (Linking/ Unlinking patient records).
- Created Entity-Relationship diagrams for the relational database.
- Experienced working on cloud AWS using EMR. Performed operations on AWS using EC2 instances, S3 storage, performed RDS, Lambda, analytical Redshift operations.
- Involved in client meetings, understanding business needs, gathering and analyzing functional requirements, tool selection discussions, attending on/off-shore meetings.
- Experienced with Agile Scrum methodology, GitLab, IntelliJ IDEA, Confluence, JIRA, Jenkins for the project.
Environment: Spark 2.2.0, Scala 2.11.8, Sqoop, Kafka, AWS (EMR, S3, RDS, Lambda, Redshift), IntelliJ IDEA, GitLab, Confluence, JIRA, Jenkins, Agile(Scrum).
Confidential - Dallas, TX.
Spark/ Hadoop Developer
Responsibilities:
- Developed Spark applications using Scala.
- Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets.
- Performed real-time streaming jobs using Spark Streaming to analyze data on a regular window time interval to the incoming data from Kafka.
- Created data pipeline: Kafka-> Spark -> HDFS along with the team.
- Collaborated with Architects to design Spark model for the existing MapReduce model and migrated them to Spark models using Scala.
- Tested and Optimized Spark applications.
- Created Hive tables and had extensive experience with HiveQL.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
- Extended Hive functionality by writing custom UDFs, UDAFs, UDTFs to process large data.
- Performed Hive UPSERTS, partitioning, bucketing, windowing operations, efficient queries for faster data operations.
- Imported and exported data between relational database systems and HDFS/Hive using Sqoop.
- Wrote custom Kafka consumer code and modified existing producer code in Java to push data to Spark-streaming jobs.
- Scheduled jobs and automated workflows using Oozie.
- Automated the movement of data using NIFI dataflow framework and performed streaming and batch processing via micro batches. Controlled and monitored data flow using web UI.
- Worked with HBase database to perform operations with large sets of structured, semi-structured and unstructured data coming from different data sources. - need to add new line
- Exported analytical results to MS SQL Server and used Tableau to generate reports and visualization dashboards.
Environment: Cloudera, Spark 2.0, Hive, Hadoop, Java, Scala, Kafka, Sqoop, MapReduce, Oozie, Zookeeper, Tableau, Agile, Eclipse.
Confidential
Hadoop/Java Developer
Responsibilities:
- Created Hive tables, loaded data, executed HQL queries and developed MapReduce programs to perform analytical operations on data and to generate reports.
- Created Hive internal and external tables, used MySQL to store table schemas. Wrote custom UDFs in Java.
- Moved data between MySQL and HDFS using Sqoop.
- Developed MapReduce jobs in Java for log analysis, analytics, and data cleaning.
- Wrote complex MapReduce programs to perform operations by extracting, transforming, and aggregating to process terabytes of data.
- Designed E-R diagrams to work with different tables.
- Wrote many SQL, Procedures, PL/SQL, Triggers and Views on top of Oracle.
- Developed the application using Core Java, Multi-Threading, Collections, JMS, JSP, Servlet, Maven.
- Developed Java Multi-threading based archival job using executor service for Thread pooling, Callable job and Future task.
- Redesigned and improved Tracking functionality using java Multi-Threading using Servlet, concurrent queue and thread.
- Developed Junit and mocking based test code to test various modules.
- Developed RESTful web service to fetch DB data to be used from UI.
- Deployed the application on Apache Tomcat. Strong skills in OOP and design patterns.
- Involved in the implementation of the Software development life cycle (SDLC) that includes Development, Testing, Implementation, and Maintenance Support.
Environment: Java, Hive, Sqoop, MySQL, Multi-threading, JDK, JSP, JMS, Servlet, HTML, CSS, Eclipse, Tomcat, REST.