Spark Developer Resume
Hartford, CT
SUMMARY:
- Extensive experience in the field of IT, with fine emphasis in designing and implementing statistically significant analytic solutions on Hadoop and Scala based enterprise applications.
- 6 years of implementation and extensive working experience in writing Hadoop Jobs for analyzing data using wide array of tools in Big Data like Hive, Pig, Flume, Oozie, Sqoop, Kafka, ZooKeeper and HBase.
- An accomplished Hadoop/Spark developer experienced in ingestion, storage, querying, processing and analysis of Big data.
- Extensive experience in developing applications that perform Data Processing tasks using Oracle, SQL Server and MySQL database.
- Hands on expertise in working and designing of Row keys and Schema Design with NOSQL databases like Mongo DB, HBase, Cassandra and DynamoDB (AWS).
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop, performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Excellent Programming skills at a higher level of abstraction using Scala.
- Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera, good knowledge Amazon’s EMR (Elastic MapReduce).
- Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.
- Strong experience and knowledge of real time data analytics using Spark Streaming and Flume.
- Well - versed in Spark components like Spark SQL, MLib, Spark streaming and GraphX.
- Extensively worked on Spark streaming and Apache Kafka to fetch live stream data.
- Used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Expertise in writing Spark RDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations using Spark-Core.
- Experience in integrating Hive queries into Spark environment using Spark SQL.
- Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS. Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics.
- Great familiarity with creating Hive tables, Hive joins and HQL for querying the databases eventually leading to complex Hive UDFs.
- Worked on GUI Based Hive Interaction tools like Hue for querying the data.
- Experienced in migrating data from different sources using PUB-SUB model in Redis, and Kafka producers, consumers and preprocess data using Storm topologies.
- Worked on data warehousing and ETL tools like Informatica, Talend.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Generated various kinds of knowledge reports using Power BI and Neo4j based on Business specification.
- Experience in automated scripts using Unix shell scripting to perform database activities.
- Good analytical, communication, problem solving skills and adore learning new technical, functional skills.
TECHNICAL SKILLS:
Big Data Eco Systems: Hadoop (HDFS & Map Reduce), PIG, HIVE, HBASE, Zookeeper, Sqoop, Flume, Kafka, Apache Spark, Impala, Oozie.
Databases: Oracle, SQL server, My SQL.
No SQL Databases: HBase, Cassandra, Mongo DB.
Hadoop Distributions: Cloudera, Horton works, Apache.
Cloud: AWS, AZURE.
Languages: Scala, Python, C.
Web Technologies: JavaScript, J-Query, Boot Strap, AJAX, XML, CSS, HTML, AngularJS.
Web Services: REST, SOAP, JAX-WS, JAX-RPC, JAX-RS, WSDL, Axis2, Apache HTTP, CVS, SVN.
IDE: Eclipse, Net beans, IntelliJ.
Operating Systems: MacOS, Linux, UNIX and Windows.
Source Code Control: GitHub, CVS, SVN
ETL Tools: Talend, Informatica.
Development Methodologies: Agile.
PROFESSIONAL EXPERIENCE:
Confidential, Hartford, CT
Spark developer
Responsibilities:
- Successfully developed code required for generating reports end-to-end for quick reporting platform by executing Spark - Scala jobs.
- As a part of development team was responsible for acquiring, analyzing and documenting business requirement.
- Worked extensively with Oracle DB and developed sqoop jobs for data ingestion into NoSQL database MongoDB.
- Written Programs in Spark using Scala and Python for Data quality check.
- Analyzed xml’s and developed file-mapping for xpaths tag information extraction.
- Extensively worked with Scala / Spark SQL for data cleansing and generating Data Frames to transform them into row DF’s to populate the aggregate tables in MongoDB.
- Adept at developing generic Spark-Scala methods for transformations, aggregations and designing schema for rows.
- Designed and developed key spaces and tables in MongoDB for E-commerce Insight Analytics reporting platform.
- Successfully delivered reports on different E-commerce tracks by developing Spark - Scala jobs.
- Worked closely with architects and front-end developers to design data models and coding optimizations to build ingestion and aggregation tables.
- Co-ordinated with off-shore team members to write and generate test scripts, test cases for numerous user stories.
- Improving the performance and optimization of existing algorithms in Hadoop using Spark Context, Spark-SQL and Spark Yarn using Scala.
- Experience in ingesting real-time data using Flume, Kafka.
- Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
- Responsible for trouble shooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Developed simple to complex MapReduce programs to analyze the datasets as per the requirement.
- Configured periodic incremental imports of data from MySQL into HDFS using Sqoop.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Used Hive to do transformations, joins, filter and some pre-aggregations after storing the data to HDFS.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
Environment: Cloudera, Datastax DSE distribution, Spark, Scala, MongoDB, maven and. sbt build, solr, Hadoop, AWS, Java, JavaScript, MapReduce, HDFS, Hive, Pig, Impala, Kafka, Cassandra, Clouera Manager, ETL, Sqoop, Flume, Oozie, Zookeeper, MySql, Eclipse, Tableau and Unix.
Confidential, Bloomfield, CT
Hadoop Administrator
Responsibilities:
- Worked with development teams to deploy Oozie workflow jobs to run multiple Hive and Pig jobs which run independently with time and data availability.
- Responsible for Cluster maintenance, Adding and removing cluster nodes, Cluster Monitoring, troubleshooting, manage and review data backups, and manage and review Hadoop log files.
- Designing a multi-node clusters for production environment based on the future data growth.
- Upgraded Ambari from 2.2.1 to 2.4.0.
- Upgraded HDP from 2.3 to 2.4 for production cluster and upgraded HDP to v2.5.3.0 on POC Lab
- Setting up Cross Realm for DistCP for the inter cluster data transfer.
- As a part of DTV migration used Falcon for migrating data from CDH to HDP
- Sizing of the cluster exercise performed along with stake holders to understand the data ingestion pattern and provided recommendations.
- Designing and implementing the non-production multi node environments.
- Commissioning and decommissioning of data nodes when required.
- Installed and configured HDP cluster and other Hadoop ecosystem components.
- Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Deployed a Kafka cluster with a separate zookeeper to enable processing of data using spark streaming in real-time and storing it in HBase.
- Implemented Capacity scheduler to securely share the available resources among multiple groups.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Provided support for all the maintenance activities like OS patching, Hadoop upgrades, configuration changes.
- Developed automated scripts using Unix Shell for running Balancer, file system health check and User/Group creation on HDFS.
Environment: Hortonworks (HDP 2.4), Ambari, Map Reduce 2.0(Yarn), HDFS, Hive, HBase, Pig, Oozie, Sqoop, Spark, Flume, Kerberos, Zookeeper, Smart sense, Airflow, Falcon, DB2, SQL Server 2014, RHEL 6.x, python.
Confidential, Hartford, CT
Hadoop Developer
Responsibilities:
- Worked on analyzing Hadoop stack and different big data analytic tools including Pig, Hive, HBase database and Sqoop.
- In depth understanding of Classic MapReduce and YARN architectures.
- Developed Map Reduce programs for some refined queries on big data.
- Created Azure HDInsight and deployed Hadoop cluster in could platform
- Used HIVE queries to import data into Microsoft AZURE cloud and analyzed the data using HIVE scripts.
- Using Ambari in Azure HDInsight cluster recorded and managed the data logs of name node and data node
- Creating Hive tables and working on them for data analysis to cope up with the requirements.
- Developed a framework to handle loading and transform large sets of unstructured data from UNIX system to HIVE tables.
- Worked with business team in creating Hive queries for ad hoc access.
- Implemented Hive Generic UDF's to implement business logic.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Configured, deployed and maintained multi-node Dev and Test Kafka Clusters.
- Deployed Cloudera Hadoop Cluster on Azure for Big Data Analytics.
- Deployed the data in Hadoop Cluster on Azure for data lake.
- Started using apache NiFi to copy the data from local file system to HDFS.
- Used sbt to compile and package the Scala code into jar and deployed the same in cluster using spark-submit.
- Involved in continuous monitoring of operations using Storm.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Implemented indexing for logs from Oozie to Elastic Search.
- Design, develop, unit test, and support ETL mappings and scripts for data marts using Talend.
Environment: Hortonworks, Hadoop, Map Reduce, HDFS, Hive, Pig, Sqoop, Apache Kafka, AZURE, Apache Storm, Oozie, SQL, Flume, Spark1.6.1, HBase and GitHub.