We provide IT Staff Augmentation Services!

Senior Hadoop/spark Developer Resume

4.00/5 (Submit Your Rating)

Sanjose, CA

PROFESSIONAL EXPERIENCE

  • 11+ years of professional IT experience in Bigdata Environment , Hadoop Ecosystem and good experience in Spark, SQL, Scala, Python Development .
  • Hands on experience across Hadoop Eco System that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zoo - Keeper and HBase.
  • Experience in using various tools like Sqoop , Flume , Kafka , Apache airflow and Pig to ingest structured, semi-structured and unstructured data into the cluster.
  • Proficient with Apache spark ecosystem such as Spark , Spark Streaming using Scala and Python .
  • Designing both time driven and data driven automated workflows using Oozie and used Zookeeper for cluster co-ordination.
  • Experience in Hadoop cluster using Google cloud platform GCP Cloudera’s CDH, Hortonworks HDP.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
  • Data pipeline consists Spark , Hive and Sqoop, and custom build Input Adapters to ingest, transform and analyze operational data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
  • Experience in working with structured data using HiveQL , join operations, Hive UDFs , partitions, bucketing and internal/external tables.
  • Expertise in writing MapReduce Jobs in Java , Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS .
  • Experience working with Python , UNIX and shell scripting .
  • Experience in Extraction, Transformation and Loading ( ETL ) of data from multiple sources like Flat files and Databases.
  • Good knowledge of cloud integration with AWS using Elastic MapReduce ( EMR ), Simple Storage Service ( S3 ), EC2 , Redshift
  • Experience with complete Software Development Life Cycle (SDLC) process which includes
  • Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
  • Hands on Experience in Spark architecture and its integrations like Spark SQL , Data Frames and Datasets APIs.
  • Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark Context, Spark SQL, Data Frames and RDD’s.
  • Involved in converting Hive / SQL queries into Spark transformations using Spark RDDs , Spark SQL and Python.
  • Hands on experience Using Hive Tables by Spark , performing transformations and Creating Data Frames on Hive tables using Spark .
  • Used Spark-Structured-Streaming to perform necessary transformations.
  • Expertise in converting MapReduce programs into Spark transformations using Spark RDD's
  • Strong understanding of AWS components such as EC2 and S3
  • Performed Data Migration to GCP
  • Experience in Implementing Continuous Delivery pipeline with Maven , Ant , Jenkins and AWS .
  • Exposure to CI/CD tools - Jenkins for Continuous Integration, Ansible for continuous deployment.
  • Worked with waterfall and Agile methodologies.

TECHNICAL SKILLS

Big data Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Nifi, Airflow, Flume, Snowflake, Ambari, Hue

Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR

Database: Confidential 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2

Language: C, C++, Java, Scala, Python

AWS and GCP Components: S3, EMR, EC2,Google Cloud Storage,Bigquery

Methodologies: Agile, Waterfall

Build Tools: Maven, Gradle, Jenkins.

Databases: NO-SQL, HBase,Big query,Redshift,SQL, Confidential

IDE Tools: Eclipse, Net Beans, Intellij

Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML

BI Tools: Tableau

Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X

PROFESSIONAL EXPERIENCE

Confidential, SanJose, CA

Senior Hadoop/Spark Developer

Responsibilities:

  • Experienced in writing Spark Applications in Scala and Python .
  • Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark .
  • Used Kafka consumer’s API in Scala for consuming data from Kafka topics
  • Designed and implemented MapReduce based large-scale parallel relation-learning system.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.
  • Implemented Kafka model which pulls the latest records into Hive external tables.
  • Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
  • Loaded all data-sets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.
  • Implemented a Continuous Delivery pipeline with Docker , and Git Hub and AWS
  • Wrote Pig scripts to store the data into HBase
  • Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
  • Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store ( HBase ).
  • Migrated the computational code in HQL to PySpark .
  • Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.
  • Exposure to usage of Apache Kafka develops a data pipeline of logs as a stream of messages using producers and consumers.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Written Complex MapReduce programs.
  • Experience in making conditional tasks and to trigger rules to implement joins at specific points in an Apache airflow DAG.
  • Installed and configured Hive and also written Hive UDFs .
  • Involved in HDFS maintenance and administering it through Hadoop - Java API
  • Loaded data into the cluster from dynamically generated files using Flume and from RDBMS using nSqoop.
  • Sound knowledge in programming Spark using Scala .
  • Experience in Apache Airflow to run tasks in Parallel to create a database in Postges or MySQL and to configure it.
  • Involved in writing Java API’s for interacting with HBase
  • Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
  • Involved in writing Flume and Hive scripts to extract, transform, and load data into Database
  • Used HBase as the data storage
  • Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous datan sources. Processed metadata files into AWS S3 and Elastic search cluster.
  • Installed and configured Hadoop MapReduce , HDFS , developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Experienced in Importing and exporting data into HDFS and Hive using Sqoop .
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Used Amazon web services ( AWS ) like EC2 and S3 for small data sets.
  • Building Apache airflow data pipelines in docker container environment in development phase.
  • Implemented AWS services to provide a variety of computing and networking services to meet the nneeds of applications
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka .
  • Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
  • Installed and configured Hive and also written Hive UDFs and Used MapReduce and Junit for unit testing.
  • Experienced in working with various kinds of data sources such as Teradata and Confidential . Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to Hive and Impala.
  • Worked on airflow to run multiple Hive,spark,Bigquery and Pig jobs which run independently with time and data availability.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.
  • Deploying Spark jobs and running the job on GCP dataproc clusters.
  • Strong Experience in implementing Data warehouse solutions in GCP Bigquery
  • Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as GCP
  • Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
  • Involved in implementing security on Hortonworks Hadoop Cluster using with Kerberos by working along with operations team to move non secured cluster to secured cluster.

Environment: Java, Hadoop, Hive, Pig, Sqoop, Scala, Kafka, Scala, Flume, HBase, Python, Apache airflow, Pyspark, GCP, Hortonworks,Bigquery, Confidential 10g/11g/12C, Teradata, HDFS, Data Lake, Spark, MapReduce, Ambari, Cloudera, Tableau, Snappy, Zookeeper, NoSQL, Shell Scripting, Ubuntu, Solar.

Confidential, LongBeach, CA

Hadoop Developer

Responsibilities:

  • Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
  • Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
  • Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
  • Implement POC with Hadoop. Extract data with Spark into HDFS .
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive .
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Implemented applications with Scala along with Akka and Play framework .
  • Optimized the code using Pyspark for better performance
  • Worked on Spark streaming using Apache Kafka for real time data processing.
  • Developed MapReduce jobs using MapReduce Java API and HIVEQL .
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Implemented Snowpipe, Stage and file upload to Snowflake database using copy command
  • Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie .
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in ETL , Data Integration and Migration by writing Pig scripts.
  • Integrated Hadoop with Solr and implement search algorithms.
  • Experience in Storm for handling real-time processing.
  • Hands on Experience working in Hortonworks distribution.
  • Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs .
  • Designed and implemented HBase and associated RESTful web service.
  • Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
  • Developed Sqoop scripts to extract the data from MYSQL and load into HDFS .
  • Very capable at using AWS utilities such as EMR , S3 and Cloud watch to run and monitor Hadoop/Spark jobs on AWS.
  • Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate Terabytes of data and stored it in AWS HDFS.
  • Used Talend tool to create workflows for processing data from multiple source systems.

Environment: Map Reduce, HDFS, Sqoop, Java, Pyspark, LINUX, TWS, Hadoop, Pig, Hive, Solr, Spark Streaming, Kafka, Hortonworks,Hue, Spark, Scala, Python, Hadoop Cluster, Amazon Web Services, Talend.

Confidential, TX

Spark Developer

Responsibilities:

  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake .
  • Created multi-node Hadoop and Spark clusters to generate terabytes of data and stored it in HDFS.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS .
  • Developed data pipeline using Flume , Sqoop , Pig and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, NameNode recovery, capacity planning, and slots configuration.
  • Implementing large scale data intelligence solutions around Snowflake Data Warehouse.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/Hbase into SparkRDD.
  • Involved in converting Hive / SQL queries into Spark transformations using Spark RDD , Scala and Python.
  • Involved in migration of ETL processes from Confidential to Hive to test the easy data manipulation and worked on importing and exporting data from Confidential and DB2 into HDFS and HIVE using Sqoop .
  • Worked on Installing Cloudera Manager , CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
  • Developed Spark jobs using Scala and Python on top of Yarn / MRv2 for interactive and Batch Analysis.
  • Experience with Snowflake Multi-Cluster Warehouses
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Developed and analyzed the SQL scripts and designed the solution to implement using Pyspark
  • Developed a data pipeline using Kafka , Spark and Hive to ingest, transform and analyzing data.
  • Supported MapReduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully-automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
  • Created Hive External tables and loaded the data in to tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Monitoring Hadoop cluster using tools like Nagios , Ganglia , and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala , initially done using python ( PySpark )
  • Experienced in building stream Pipeline on Kafka nodes integrated with Spark , Kafka , Postgres

Environment: Hadoop, MapReduce, Python, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Python, Scala, Pyspark, MapR, Java, Oozie, Flume, HBase, Hue, Hortonworks, Cloudera Manager, Zookeeper, Cloudera, Confidential, Kerberos and RedHat 6.5

Confidential, Denver, CO

Hadoop/Spark Developer

Responsibilities:

  • Involved in Requirements Analysis and design an Object-oriented domain model
  • Implemented test scripts to support test driven development and continuous integration
  • Experience in Importing and exporting data into big data, HDFS and Hive using Sqoop
  • Developed MapReduce programs to clean and aggregate the data
  • Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
  • Development of new listeners for producers and consumer for both Rabbit MQ and Kafka
  • Used Micro service with Spring Boot interacting through a combination of REST and Apache Kafka message brokers.
  • Involved in writing queries in SparkSQL using Scala. Worked with Splunk to analyze and visualize data.
  • Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
  • Worked in complete SDLC phase like Requirements, Specification, Design, Implementation and Testing
  • Developed the mechanism for logging and debugging with Log4j
  • Involved in developing database tractions through JDBC
  • Used GIT for version control
  • Created RESTful services like Dropwizard framework for various web-services involving both JSON and XML
  • Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Used Confidential as Database and used load for queries execution and involved in writing SQL scripts , SQL code for procedures and functions
  • Developed Front-end applications which will interact the mainframe applications using J2C connectors
  • Hands on experience in exporting the results into relational databases using Sqoop for visualization and to generate reports for BI team
  • Designing, Development and implementation of JSPs in presentation layer for submission, Application, implementation
  • Deployed Web, presentation and business components on Apache Tomcat Application Server.
  • Involvement in post-production support, Testing and used JUNIT for unit testing of the module
  • Worked in Agile methodology

Environment : HDFS, Hive, Sqoop, pig, Core Java, Maven, HTML, Java Script, GIT, Map R, JUNIT, Agile, Log4j, SQL,Python

Confidential

Hadoop Developer

Responsibilities:

  • Developed multiple MapReduce jobs in Java for data cleaning and preprocessing and assisted with data capacity planning and node forecasting.
  • Involved in design and ongoing operation of several Hadoop clusters and Configured and deployed Hive Meta store using MySQL and thrift server
  • Implemented and operated on-premises Hadoop clusters from the hardware to the application layer including compute and storage.
  • Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (Cloudera) using Sqoop and Flume.
  • Designed custom deployment and configuration automation systems to allow for hands-off management of clusters via Cobbler, FUNC, and Puppet.
  • Prepared complete description documentation as per the Knowledge Transferred about the Phase-II
  • TalenD Job Design and goal and prepared documentation about the Support and Maintenance work to be followed in TalenD .
  • Deployed the company's first Hadoop cluster running Cloudera's CDH2 to a 44-node cluster storing 160TB and connecting via 1 GB Ethernet.
  • Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team .
  • Modified reports and TalenD ETL jobs based on the feedback from QA testers and Users in development and staging environments.
  • Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive , MapReduce and then loading data into HBase tables.
  • Involved in Cluster Maintenance and removal of nodes using Cloudera Manager .
  • Collaborated with application development teams to provide operational support, platform expansion, and upgrades for Hadoop Infrastructure including upgrades to CDH3 .
  • Participated in Hadoop development Scrum and installed, Configured Cognos8.4/10 and TalenDETL on single and multi-server environments.

Environment: Cloudera Hadoop, Cloudera, Pig, Hive, TalenD, Map-reduce, Sqoop, UNIX, Cassandra, Java, LINUX, Confidential 11gR2, UNIX Shell Scripting, Kerberos

Confidential

Java Developer

Responsibilities:

  • Implemented Multi-Threaded Environment and used most of the interfaces under the collection framework by using Core Java Concepts .
  • Developed Graphical User Interfaces by using JSF, JSP, HTML, DHTML, Angularjs, CSS, and JavaScript and developed scripts in python for Financial Data coming from SQL Developer based on the requirements specified.
  • Implemented several Java/J2EE design patterns like Spring MVC, Singleton, Spring Dependency Injection and Data Transfer Object.
  • Used JAX-WS (SOAP) for producing web services and involved in writing programs to consume the web services using SOA with CXF framework and developed few web pages using JSP, JSTL, HTML, CSS, Java script, Ajax and JSON.
  • Implemented business logic, data exchange, XML processing and created graphics using Python and Django.
  • Wrote code to fetch data from Web services using JQUERY AJAX via JSON response and updating the HTML pages and developed high traffic web applications using HTML, CSS, and JavaScript, jQuery, Bootstrap, Ext JS, AngularJS, Node.js and react.js.
  • Write SQL queries and create PL/SQL functions/procedures/packages that are optimized for APEX and improve performance and response times of APEX pages and reports
  • Used JQuery library, NodeJS and AngularJS for creation of powerful dynamic WebPages and web applications by using its advanced and cross browser functionality.
  • Used Java Server Pages for content layout and presentation with Python and Extracted and loaded data using Python scripts and PL/SQL packages
  • Worked with various frameworks of JavaScript like BackboneJS, AngularJS, and EmberJS etc.
  • Written with object-oriented Python, Flask, SQL, Beautiful Soup, httplib2, Jinja2, HTML/CSS,Bootstrap, jQuery, Linux, Sublime Text, GIT.
  • Developed GUI using JSP, Struts, HTML3, CSS3, XHTML, JQuery, Swing and JavaScript to simplify the complexities of the application.
  • Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQLdb package and generated Python Django forms to record data of online users and used PyTest for writing test cases.

Environment: Python, Django, Java, JSF MVC, Spring IOC, APEX, Ruby on Rails, Spring JDBC, Hibernate, ActiveMQ, Log4j, Ant, MySQL, JDK 1.6, J2EE, JSP, Servlets, HTML, LDAP, Salesforce, ESB Mule, JDBC, MongoDB, DAO, EJB 3.0, PL/SQL, react.js, Web Sphere, Eclipse, Angular.JS, and CVS.

Confidential

Java Developer

Responsibilities:

  • Involved in the analysis, design, and development and testing phases of Software Development Life Cycle (SDLC)
  • Designed and developed framework components, involved in designing MVC pattern using Struts and spring framework.
  • Responsible for developing Use case, Class diagrams and Sequence diagrams for the modules using UML and Rational Rose.
  • Developed the Action Classes, Action Form Classes, created JSPs using Struts tag libraries and configured in Struts-config.xml, Web.xml files.
  • Involved in Deploying and Configuring applications in Web Logic Server.
  • Used SOAP for exchanging XML based messages.
  • Used Microsoft VISIO for developing Use Case Diagrams, Sequence Diagrams and Class Diagrams in the design phase.
  • Developed Custom Tags to simplify the JSP code. Designed UI screens using JSP and HTML.
  • Actively involved in designing and implementing Factory method, Singleton, MVC and Data Access Object design patterns.
  • Web services used for sending and getting data from different applications using SOAP messages. Then used DOM XML parser for data retrieval.
  • Wrote JUNIT test cases for Controller, Service and DAO layer using MOCKITO, DBUNIT.
  • Developed unit test cases using proprietary framework which is similar to JUNIT.
  • Used JUnit framework for unit testing of application and ANT to build and deploy the application on WebLogic Server.

We'd love your feedback!