We provide IT Staff Augmentation Services!

Sr Hadoop/spark Developer Resume

2.00/5 (Submit Your Rating)

Milwaukee, WI

PROFESSIONAL SUMMARY:

  • 9+ years of experience in Information Technology which includes 5+ years of experience in Big Data technologies including Hadoop and Spark ,
  • Excellent understanding or knowledge of Hadoop architecture and various components such as Spark Ecosystem which includes ( Spark SQL, Spark Streaming, Spark MLib, Spark GraphX), HDFS, MapReduce, Pig, Sqoop, Kafka, Hive, Cassandra, Hbase, Oozie, Zookeeper, Flume, Impala, Hcatalog, Strom, Tez and YARN concepts like Resource Manager, Node Manager (Hadoop 2.x).
  • Designed HIVE queries & Pig scripts to perform data analysis, data transfer and Data distribution by implementing partitioning, bucketing, joints.
  • Expertise in writing custom UDFs in Pig & Hive Core Functionality, Hands on experience dealing with ORC, AVRO and Parquet file format.
  • Developed Spark jobs using Scala in test environment for faster testing and data processing and used Spark SQL for querying and to access hive tables into spark for faster processing of data. Performed map - side joins on RDD, Spark SQL and Data Frames .
  • Extracted Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS .
  • Hands-on experience in Amazon Web Services (AWS) Cloud services like EC2, S3, EMR and involved in ETL , Data Integration and Migration.
  • Exported data to various Databases like Teradata (Data Warehouse), SQL-Server, Cassandra using Sqoop and worked with databases like Snowflake, Teradata, Hbase, Mongo DB, Cassandra, MySQL and Oracle.
  • Experience working with Cloudera and Hortonworks distributions.
  • Involved In working with Maven for build process.
  • Extensive experience on importing and exporting data using stream processing platforms like Flume and Kafka
  • Experience in data workflow scheduler Zoo-Keeper and Oozie to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with the control flows.
  • Worked on Java Concepts like Multithreading and Collections .
  • Worked on creating the User Interfaces using HTML, CSS, JavaScript .
  • Used JDBC drivers to connect to the backend ORACLE database .
  • Involved in Developing Servlets and Java Beans programming to communicate between client and server.
  • Good understanding and Experience with Agile and Waterfall methodologies of Software Development Life Cycle (SDLC).
  • Good analytical, communication, problem solving skills and adore learning new technical, functional skills.

TECHNICAL SKILLS:

Bigdata Technologies: Hadoop, MapReduce, HDFS, Hive, Pig, Spark, Yarn, Zookeeper, Sqoop, Oozie, Flume, Impala, HBASE, Kafka, Storm, Amazon AWS, Cloudera and Hortonworks

Build Tools: Git, Ant, SVN, Maven

Hadoop Distributions: Cloudera, Horton works, Amazon EMR, EC2.

Programming Languages: C, C++, Core Java, shell scripting, Scala.

Databases: RDBMS, MySQL, Oracle, Microsoft SQL Server, Teradata SQL, DB2, PL/SQL, CASSANDRA, MongoDB, Snowflake, Hbase.

IDE and Tools: Eclipse, NetBeans, Tableau, Microsoft Visual Studio

Operating System: Windows, Linux/Unix.

Scripting Languages: JSP & Servlets, JavaScript, XML, HTML, Python, Shell Scripting.

Application Servers: Apache Tomcat, Web Sphere, Web logic.

Methodologies: Agile, SDLC, Waterfall.

Web Services: Restful, SOAP.

ETL Tools: Talend, Informatica.

Others: Solr, Tez, Cloud Break, Atlas, Falcon, Ambari, Ambari Views, Ranger, Knox.

PROFESSIONAL EXPERIENCE:

Confidential, Milwaukee, WI

Sr Hadoop/Spark Developer

Responsibilities:

  • Ingested gigabytes of data from S3 Buckets into tables in Snowflake Database
  • Experience in ingesting data into Teradata DB which is a relational database.
  • Created Sqoop scripts to import/export data from RDBMS to S3 data store.
  • The data is taken from Data lake and the raw data is in JSON format.
  • Developed various spark applications using Scala to perform various enrichment of these data merged with user profile data.
  • Developed Applications for Tokenization using Spark with Java Framework.
  • Developed Spark-Scala Scripts for Absolute Data Quality check.
  • Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting.
  • Used Split Framework which is developed using Spark-Scala scripts.
  • Used MPP loader to ingest data into tables which is written in Python.
  • Worked with Parquet format for storage which is a columnar storage.
  • Utilized Spark Scala API to implement batch processing of jobs
  • Trouble Shooting Spark applications for improved error tolerance.
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines
  • Utilized Spark in Memory capabilities , to handle large datasets.
  • Experienced in working with EMR cluster and S3 in AWS cloud.
  • Creating tables in snowflake DB, loading and analyzing data using Spark-Scala scripts. Implemented Partitioning, Dynamic Partitions.
  • Involved in continuous Integration of application using Jenkins.
  • Git for version control and Maven as build tool.
  • Followed Agile methodologies in analysis, define and document the applications, which will support functional and business requirements.

Environment: AWS Elastic MapReduce, Spark, Scala, Python, Jenkins, Amazon S3, Sqoop, Teradata, Snowflake DB, Jupiter Notebook, Git, Maven.

Confidential -Minneapolis MN,

Sr. Hadoop Spark Developer

Responsibilities:

  • Worked on different file formats like Sequence, XML, JSON files and Map files using Map Reduce Programs.
  • Took important decisions of how much the poll time should be for the stream processing, what type of Hadoop stack component to use for better performance.
  • Proposed and implemented a solution for their long-time issue of ordering the data in Kafka queues.
  • Designed and implemented an ETL framework with the help of Sqoop, pig and hive to be able to automate the process of frequently bringing in data from the source and make it available for consumption.
  • Worked on importing and exporting data into HDFS and Hive using Sqoop, built analytics on Hive tables using Hive Context.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Load and transform large sets of semi-structured and unstructured data on HBase and Hive.
  • Implemented Map-Reduce programming model with XML, JSON, and CSV file formats. Made use of SERDE jars to load json and xml format data onto Hive tables coming from Kafka queues.
  • Implemented UDFs for Hive extending Generic UDF, UDTF and UDAF base classes to change the time zones implement logic actions and extract required parameters according to the business specification.
  • Extensive working knowledge of Partitioning , UDFs , Performance tuning, Compression -related properties on Hive tables. Developed the UNIX shell scripts for creating the reports from Hive data.
  • Implemented Spark scripts in Python to perform extraction of required data from the data sets and storing it on HDFS.
  • Developed spark scripts and python functions that involve performing transformations and actions on data sets.
  • Configuring Spark Streaming in Python to receive real time data from the Kafka and store it onto HDFS.
  • Experienced in building analytics on top of spark using machine learningSpark.ml.
  • Involved in optimizing the Hive queries using Map-side join, Partitioning, Bucketing and Indexing.
  • Involved in tuning the Spark modules with various memory and resource allocation parameters, setting right Batch Interval time and varying the number of executors to meet the increasing load overtime.
  • Continuously monitored and managed the Hadoop cluster using Cloudera Manager.
  • Used Hue for UI based PIG script execution, Tidal scheduling and creating tables in Hive.
  • Created Pig Latin scripts to sort, group, join and filter the enterprise wise data.
  • Involved in planning process of iterations under the Agile Scrum methodology.
  • Extensive knowledge of working on Apache NiFi, used and configured of different processors to pre-process, make the incoming data uniform and format according to the requirement.
  • Implemented unit testing in Java for pig and hive applications.
  • Hands on experience in AWS Cloud in various AWS services such as Red shift cluster, Route 53domain configuration.
  • Extensively used UNIX for shell Scripting and pulling the Logs from the Server and monitor it.
  • Worked with the Data Science team to gather requirements for various data mining projects.

Environment: Cloudera CDH 5.7, Apache Hadoop 2.6.0 (Yarn), Spark 2.1.0, Spark.ml, Flume 1.7.0, Eclipse, Map Reduce, Hive 1.2.2, Pig Latin 0.17.0, Java, SQL, Sqoop 1.4.6, Centos, Zookeeper 3.5.0 and NOSQL database, Apache Nifi, AWS, S3, EMR, Red Shift Cluster.

Confidential, Nashville, TN

Sr. Bigdata Developer/Engineer

Responsibilities:

  • Developed Pig scripts to help perform analytics on JSON and XML data and created Hive tables (external, internal) with static and dynamic partitions and performed bucketing on the tables to provide efficiency.
  • Used Hive QL to analyze the partitioned and bucketed data and compute various metrics for reporting and performed data transformations by writing MapReduce and Pig jobs as per business requirements
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS
  • Design Architecture of data pipeline/ingestion as well as optimization of ETL workflows and developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
  • Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • Involved in architecting, developing and/or maintaining production-grade cloud solutions in virtualized environments such as Amazon Web Services and Azure.
  • Leading, managing, planning, the development and implementation of the wide Geographic Information Systems (GIS) program.
  • Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
  • Enable and configure Hadoop services such as HDFS, YARN, Hive, Ranger, HBase, Kafka, Sqoop, Zeppeline Notebook and Spark/Spark2 and involved in analyzing log data to predict the errors by using Apache Spark.
  • Involved in deploying big data solution in large cloud computing infrastructures such as AWS, GCE and Azure.
  • Evaluate deep learning algorithms for text summarization using Python, Kera’s, TensorFlow and Theano on Cloudera Hadoop system
  • Designed Database Schema and created Data Model to store Realtime Tick Data with NoSQL store.
  • Extracting real time data using Kafka and spark streaming by Creating DStreams and converting them into RDD, processing it and stored it into Cassandra.
  • Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping and involved in analyzing log data to predict the errors by using Apache Spark.
  • Used Sqoop to ingest from DBMS and Python to ingest logs from client data centers. Develop Python and bash scripts for automation and implemented Map Reduce jobs using Java API and Python using Spark
  • Imported data from RDBMS systems like MySQL into HDFS using Sqoop and developed Sqoop jobs to perform incremental imports into Hive tables.
  • Demonstrated experience in managing the collection of geospatial data and understanding of data systems managed policies concerning the compilation of information and coordination of data through the GIS program; coordinating and overseeing the implementation of policies
  • Involved in loading and transforming of large sets of structured and semi structured data and created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
  • Integrated MapReduce with HBase to import bulk amount of data into HBase using MapReduce programs.
  • Used Impala and Written Queries for fetching Data from Hive tables and developed Several MapReduce jobs using Java API.
  • Worked with Apache SOLR to implement indexing and wrote Custom SOLR query segments to optimize the search.
  • Created kafka spark streaming data pipelines for consuming the data from external source and performing the transformations in Scala and contributed towards developing a Data Pipeline to load data from different sources like Web, RDBMS, NoSQL to Apache Kafka or Spark cluster.
  • Developed multiple POCs using Scala and Pyspark and deployed on the Yarn cluster, compared the performance of Spark, and SQL.
  • Worked with xml's extracting tag information using xpaths and Scala XML libraries from compressed blob data types.
  • Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, Mongo DB, Cassandra, HBase, Teradata, Netezza and also log data from servers
  • Define data governance rules and administrating the rights depending on job profile of users.
  • Developed Pig and Hive UDF's to implement business logic for processing the data as per requirements and developed Pig UDFs in Java and used UDFs from Piggybank for sorting and preparing the data.
  • Developed Spark scripts by using Scala IDEas per the business requirement.
  • Configured and optimized the Cassandra cluster and developed real-time java based application to work along with the Cassandra database.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
  • Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis and involved in querying data using SparkSQL on top of Spark engine for faster data sets processing and worked on implementing Spark Framework, a Java based Web Frame work.
  • Created Hive tables, loaded data and wrote Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Use of Docker and Kubernetes to manage micro services for development of continuous integration and continuous delivery.
  • Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
  • Creating Nagios, Grafana and Graphite dashboard for infrastructure monitoring.

Environment: Hadoop, Hive, HDFS, Pig, Sqoop, Python, SparkSQL, Machine Learning, MongoDB, AWS, AWS S3, AWS EC2, AWS EMR, Oozie, ETL, Tableau, Spark, Spark-Streaming, Pyspark, KAFKA, Netezza, Apache Solr, Cassandra, Cloudera Distribution, Java, Impala, Web Server's, Maven Build, MySQL, Grafana, AWS, Agile-Scrum.

Confidential, Arizona

Hadoop Developer

Responsibilities:

  • Experience in creating data pipeline for different events of web and mobile applications, to filter and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location.
  • Involved in working with different file formats like Json, AVRO and parquet and compression techniques like snappy.
  • Constructed Impala scripts for end user / analyst requirements for ad hoc analysis.
  • Worked with various Hive optimization techniques like partitioning, bucketing Map and join.
  • Worked with shell scripts for dynamic partitions adding to hive stage table, verifying Json schema change of source files, and verifying duplicate files in source location.
  • Developed UDF's in spark to capture values of a key-value pair in encoded Json string.
  • Developed spark application for filtering Json source data in AWS S3 location and store it into HDFs with partitions and used spark to extract schema of Json files.
  • Used Jenkins for continuous integration and continuous testing.
  • Used SQL for querying data from the tables which are in HDFS.
  • Used Amazon S3 buckets for data staging.
  • Worked with Sqoop for ingesting data into HDFS from other databases.
  • Worked with impala for massive parallel processing of queries and using HDFS as underlying storage for imapala.
  • Worked with Elastic Map Reduce for data processing and used HDFS for data storage.
  • Extensive experience working with different Hadoop distributions like Cloudera and Apache distributions.

Environment: Hive, Spark, AWS S3, EMR, SQL, Cloudera, Jenkins, Shell scripting, Hbase, Intellij IDE, Sqoop, spark, Impala.

Confidential, Cincinnati, OH

Hadoop Developer

Responsibilities:

  • Worked on loading disparate data sets coming from different sources to BDpaas (HADOOP) environment using Spark.
  • Developed UNIX scripts in creating Batch load for bringing huge amount of data from Relational databases to BIGDATA platform.
  • Delivery experience on major Hadoop ecosystem Components such as Pig, Hive, Spark Kafka, Elastic Search & HBase and monitoring with Cloudera Manager.
  • Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
  • Implemented the Machine learning algorithms using Spark with Python and worked on Spark Storm, Apache and Apex and python.
  • Involved in analyzing data coming from various sources and creating Meta-files and control files to ingest the data in to the Data Lake.
  • Involved in configuring batch job to perform ingestion of the source files in to the Data Lake and developed Pig queries to load data to HBase
  • Leveraged Hive queries to create ORC tables and developed HIVE scripts for analyst requirements for analysis.
  • Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis and worked extensively on Hive to create, alter and drop tables and involved in writing hive queries.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and parsed high-level design spec to simple ETL coding and mapping standards.
  • Created and altered HBase tables on top of data residing in Data Lake and Created external Hive tables on the Blobs to showcase the data to the Hive Meta Store.
  • Involved in requirement and design phase to implement Streaming Architecture to use real time streaming using Spark and Kafka.
  • Use Spark API for Machine learning. Translate a predictive model from SAS code to Spark and used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Created Reports with different Selection Criteria from Hive Tables on the data residing in Data Lake.
  • Worked on Hadoop Architecture and various components such as YARN, HDFS, Node Manager, Resource Manager, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
  • Deployed Hadoop components on the Cluster like Hive, HBase, Spark, Scala and others with respect to the requirement.
  • Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop.
  • Implemented the Business Rules in Spark/ SCALA to get the business logic in place to run the Rating Engine.
  • Used Spark UI to observe the running of a submitted Spark Job at the node level and used Spark to do Property Bag Parsing of the data to get the required fields of data.
  • Extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.
  • Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job and used WINSCP and FTP to view the data storage structure in the server and to upload JARs which were used to do the Spark Submit.
  • Developed code from scratch in Spark using SCALA according to the technical requirements.

Environmen t: Hadoop, Map Reduce, Yarn, Hive, Pig, HBase, Sqoop, Spark, Scala, MapR, Core Java, R Language, SQL, Python, Eclipse, Linux, Unix, HDFS, Map Reduce, Impala, Cloudera, SQOOP, Kafka, Apache Cassandra, Oozie, Impala, Zookeeper, MySQL, Eclipse, PL/SQL

Confidential, Louisville, KY

Java Developer

Responsibilities:

  • Involved in complete Software Development Life Cycle (SDLC) of the application development like Designing, Developing, Testing and implementing scalable online systems in Java, J2EE, JSP, Servlets and Oracle Database.
  • Created UML Diagrams like Class Diagrams, Sequence Diagrams, Use Case Diagrams using Rational Rose.
  • Implemented MVC architectur e using Java Spring Core.
  • Implemented java J2EE technologies on the server side like Servlets, JSP and JSTL.
  • Worked in Implementing Hibernate by creating hbm.xml file to configure the Hibernate to the Oracle Database.
  • Involved in writing SQL Queries , Stored Procedures and PL/SQL for the back-end server.
  • Used HTML, JavaScript for creating interactive User Interfaces.
  • Extensively used Custom JSP tags to separate presentation from application layer.
  • Developed JSP Pages and implemented AJAX in them for a responsive User Interface.
  • Involved in developing presentation layer using JSP and Model layer using EJB Session Beans.
  • Implemented Unit test cases by using Junit and Implemented Log4J for logging and debugging the application.
  • Implemented Maven Build Scripts for building the application.
  • Deployed the application in IBM Web Sphere and tested for and server related issues.
  • Used Git as the repository and for Version Control. Used Intellij as the IDE for the development.

Environment: java, J2EE, EJB, Servlet, JSP, JSTL, Spring Core, Spring MVC, Hibernate, HTML, CSS, JavaScript, AJAX, Oracle, Stored Procedures, PL/SQL, Junit, Log4J, Maven, WebSphere, Git, Intellij

We'd love your feedback!