We provide IT Staff Augmentation Services!

Spark Developer Resume

Santa Clara, CaliforniA

SUMMARY

  • Over 5 years of IT experience as a Developer, Designer & QA Test Engineer with cross - platform integration experience using Hadoop Ecosystem, Java and Software Functional Testing.
  • Hands on experience in installing, configuring and using Hadoop Ecosystem - HDFS, MapReduce, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie.
  • Strong understanding of various Hadoop services, MapReduce and YARN architecture.
  • Responsible for writing Map Reduce programs.
  • Experienced in importing-exporting data into HDFS using SQOOP
  • Experience loading data to Hive partitions and creating buckets in Hive.
  • Developed Map Reduce jobs to automate transfer the data from HBase.
  • Expertise in analysis using PIG, HIVEand MapReduce.
  • Experienced in developing UDFs for Hive, PIG using Java.
  • Strong understanding of NoSQL databases like HBase, MongoDB & Cassandra.
  • Scheduling all Hadoop/hive/Sqoop/HBase jobs using Oozie
  • Experience in setting cluster in Amazon EC2 & S3 including the automation of setting & extending the clusters in AWS Amazon cloud.
  • Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
  • Experience in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope.
  • Experience in gathering and defining functional and user interface requirements for software applications.
  • Experience in real time analytics with Apache Spark (RDD, Data Frames and Streaming API).
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.
  • Experience in integrating Hadoop with Kafka. Expertise in uploading Click stream data from Kafka to HDFS.
  • Expert in utilizing Kafka for messaging and publishing subscribe messaging system

TECHNICAL SKILLS

Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, Pig, Sqoop, AirflowKafka, Yarn, HBaseNo SQL Databases HBase,Cassandra, mongo DB

Languages: Scala, Java, Python

Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL

Frameworks: MVC, Struts, Spring, Hibernate

Operating Systems: Red Hat Linux, Ubuntu Linux and Windows XP/Vista/7/8

Web Technologies: HTML, DHTML, XML

Web/Application servers: Apache Tomcat, WebLogic, JBoss

Databases: SQL Server, MySQL

Tools: and IDE: Anaconda, PyCharm, JuPyter Eclipse, IntelliJ

PROFESSIONAL EXPERIENCE

SPARK DEVELOPER

Confidential, Santa Clara, California

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop
  • Worked on data migration with MapReduce programs intoSpark (Python +PySpark+ SparkSQL)transformations usingSparkand Python.
  • Exploring DAG's, their dependencies and logs usingAirFlowpipelines for automation
  • Tracking operations using sensors until certain criteria is met using AirFlow technology.
  • UsingSpark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • DevelopedSparkscripts by using Python shell commands as per the requirement.
  • UsedSparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Python scripts, UDFFs using both Data frames/SQL and RDD/MapReduce inSpark1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop; And Developed enterprise application using Python
  • Expertise in performance tuning ofSparkApplications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Loaded the data intoSparkRDD and do in memory data Computation to generate the Output response.
  • Experience and hands-on knowledge in Akka and LIFT Framework.
  • Used PostgreSQL and No-SQL database and integrated with Hadoop to develop datasets on HDFS
  • Involved in creating partitioned Hive tables, and loading and analyzing data using hive queries, Implemented Partitioning and bucketing in Hive.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Developed Hive queries to process the data and generate the data cubes for visualizing
  • Implemented schema extraction for Parquet and Avro file Formats in Hive.
  • Good experience with Talend open studio for designing ETL Jobs for Processing of data. Experience designing, reviewing, implementing and optimizing data transformation processes in the Hadoop and Talend and Informatica ecosystems.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Coordinated with admins and Technical staff for migrating Teradata to Hadoop and Ab Initio to Hadoop
  • Configured Hadoop clusters and coordinated with Big Data Admins for cluster maintenance.

Environment: Hadoop YARN,Spark-Core,Spark-Streaming,Spark-SQL, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Informatica, Cloudera, Oracle 10g, Linux.

HADOOP DEVELOPER

Confidential, Detroit, Michigan

Responsibilities:

  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Sqoop.
  • Created POC on Hortonworks and suggested the best practice in terms HDP, HDF platform
  • Experience in understanding the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, managing. Management and support of Hadoop Services including HDFS, Hive, Impala, and SPARK.
  • Installing, Upgrading and Managing Hadoop Cluster on Cloudera.
  • Troubleshooting many cloud related issues such as Data Node down, Network failure, login issues and data block missing.
  • Worked as Hadoop Admin and responsible for taking care of everything related to the clusters total of 100 nodes ranges from POC (Proof-of-Concept) to PROD clusters on Cloudera(CDH 5.5.2) distribution.
  • Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
  • Day to day responsibilities includes solving developer issues, deployments moving code from one environment to other environment, providing access to new users and providing instant solutions to reduce the impact and documenting the same and preventing future issues.
  • Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
  • Migrated Flume with Spark for real time data and developed the Spark Streaming Application with java to consume the data from Kafka and push them into Hive.
  • Configured Kafka for efficiently collecting, aggregating and moving large amounts of click stream data from many different sources to HDFS. Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
  • Interacting with Cloudera support and log the issues in Cloudera portal and fixing them as per the recommendations.
  • Imported logs from web servers with Flume to ingest the data into HDFS.
  • Using Flume and Spool directory loading the data from local system to HDFS.
  • Retrieved data from HDFS into relational databases with Sqoop.
  • Parsed cleansed and mined useful and meaningful data in HDFS using Map-Reduce for further analysis Fine tuning hive jobs for optimized performance.
  • Scripting Hadoop package installation and configuration to support fully-automated deployments.
  • Involved in chef-infra maintenance including backup/security fix on Chef Server.
  • Deployed application updates using Jenkins. Installed, configured, and managed Jenkins
  • Triggering the SIT environment build of client remotely through Jenkins.
  • Deployed and configured Git repositories with branching, forks, tagging, and notifications.
  • Experienced and proficient deploying and administering GitHub
  • Deploy builds to production and work with the teams to identify and troubleshoot any issues.
  • Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication, schema design.
  • Consulted with the operations team on deploying, migrating data, monitoring, analyzing, and tuning MongoDB applications.
  • Viewing the selected issues of web interface using SonarQube.
  • Developed a fully functional login page for the company's user facing website with complete UI and validations.
  • Installed, Configured and utilized AppDynamics (Tremendous Performance Management Tool) in the whole JBoss Environment (Prod and Non-Prod).
  • Reviewed OpenShift PaaS product architecture and suggested improvement features after conducting research on Competitors products.
  • Migrated data source passwords to encrypted passwords using Vault tool in all the JBoss application servers
  • Participated in Migration undergoing from JBoss 4 to Web logic or JBoss 4 to JBoss 6 and its respective POC.
  • Responsible for upgradation of SonarQube using upgrade center.
  • Resolving tickets submitted by users, P1 issues, troubleshoot the error documenting, resolving the errors.
  • Installed and configured Hive in Hadoop cluster and help business users/application teams fine tune their HIVE QL for optimizing performance and efficient use of resources in cluster.
  • Conduct performance tuning of the Hadoop Cluster and map reduce jobs. Also, the real-time applications with best practices to fix the design flaws.
  • Implemented Oozie work-flow for ETL Process for critical data feeds across the platform.
  • Configured Ethernet bonding for all Nodes to double the network bandwidth
  • Implementing Kerberos Security Authentication protocol for existing cluster.
  • Built high availability for major production cluster and designed automatic failover control using Zookeeper Failover Controller (ZKFC) and Quorum Journal nodes.

Environment: HDFS, Map Reduce, Hive 1.1.0, Kafka, Hue 3.9.0, Pig, Flume, Oozie, Sqoop, Apache Hadoop 2.6, Spark, SOLR, Storm, Cloudera Manager, Red Hat, MySQL, Prometheus, Docker, Puppet.