Sr. Spark/hadoop Developer Resume
Grapevine, TX
SUMMARY
- 8+ years of professional IT experience in analyzing requirements, designing, building, highly distributed mission critical products and applications.
- Highly dedicated and results oriented Hadoop Developer with 4+ years of strong end - to-end experience on Hadoop Development with varying level of expertise around different BIGDATA Environment projects.
- Expertise in core Hadoop and Hadoop technology stack which includes HDFS, MapReduce, Oozie, Hive, Sqoop, Pig, Flume, HBase, Spark, Kafka, and Zookeeper.
- Having experience on RDD architecture and implementing spark operations on RDD and also optimizing transformations and actions in spark.
- Reviewing and managing Hadoop log files by consolidating logs from multiple machines using flume.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume
- Experience in importing and exporting data usingSqoop from HDFS to Relational Database Systems and vice-versa.
- Experience in installation and setup of various Kafka producers and consumers along with the Kafka brokers and topics.
- Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
- Experienced in managing Hadoop cluster using Cloudera Manager Tool.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Experience with Oozie Workflow Engine in running workflow jobs with actions that run Java MapReduce and Pig jobs.
- Great hands on experience withPysparkfor using Spark libraries by using python scripting for data analysis.
- Implemented data science algorithms like shift detection in critical data points using Spark, doubling the performance.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
- Extending Hive and Pig core functionality by writing customUDFs.
- Experience in analyzing data using HiveQL, Pig Latin, and custom Map Reduce programs in Java.
- Experience in Apache Flume for efficiently collecting, aggregating, and moving large amounts of log data.
- Involved in developing web-services using REST, HBase Native API Client to query data from HBase.
- Experienced in working with structured data using Hive QL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python/Java into Pig Latin and HQL (HiveQL).
- Involved in converting Cassandra/Hive/SQL queries intoSparktransformations usingSparkRDD's in Scala and Python.
- Used highly available AWS Environment to launch the applications in different regions and implemented Cloud Front with AWSLambda to reducelatency.
- Implemented CRUD operations using CQL on top of Cassandra file system.
- Used Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
- Set up Solr for distributing indexing and search
- Used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
- Excellent working Knowledge in Spark Core, Spark SQL, Spark Streaming.
- Real time exposure to Amazon Web Services, AWS command line interface, and AWS data pipeline.
- Work experience with cloud infrastructure like Amazon Web Services (AWS).
- Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera(CDH4/CDH5), Hortonworks and good knowledge on MAPRdistribution, IBMBigInsights and Amazon’sEMR (Elastic MapReduce).
- Experience in design and develop the POC in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
- Expertise in developing responsive Front-End components with JavaScript, JSP, HTML, XHTML,Servlets, Ajax, and AngularJS.
- Experience as a Java Developer in Web/intranet, client/server technologies using Java, J2EE, Servlets, JSP, JSF, EJB, JDBC and SQL.
- Experience in setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
- Good Understanding in Apache Hue.
- Techno-functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, leading developers, producing documentation, and production support.
- Good in using version control like GITHUB and SVN.
TECHNICAL SKILLS
Hadoop Distribution: Horton works, Cloudera (CDH3, CDH4, CDH5), Apache, Amazon AWS(EMR),MapR and Azure.
Hadoop Data Services: Hadoop HDFS, Map Reduce, Yarn,HIVE, PIG, Pentaho, HBase, ZooKeeper, Sqoop, Oozie, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro,Parquet,Snappy,Nifi.
Hadoop Operational Services: Zookeeper, Oozie
NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis
Cloud Services: Amazon AWS
Languages: SQL, PL/SQL, Pig Latin, HiveQL, Unix Shell Scripting,HTML,XML (XSD, XSLT,DTD) C, C++, Java, JavaScript Python, Scala
ETL Tools: Informatica, IBM DataStage, Talend
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JSP, JDBC, EJB
Application Servers: Web Logic, Web Sphere, Tomcat.
Databases: Oracle,MySQL,DB2, Teradata,MS SQL Server,SQL/NOSQL,HBase,Cassandra,Neo4j
Operating Systems: UNIX, Windows, iOS, LINUX
Methodologies: Agile(Scrum), Waterfall
Other Tools: Putty, WinSCP, Stream Weaver.
PROFESSIONAL EXPERIENCE
Confidential, Grapevine, TX
Sr. Spark/Hadoop Developer
Responsibilities:
- Extensively migrated existing architecture toSpark Streaming to process the live streaming data.
- Responsible forSparkCore configuration based on type of Input Source.
- Executed Spark code using Scala forSpark Streaming/SQL for faster processing of data.
- Performed SQL Joins among Hive tables to get input forSparkbatch process.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Developed Python code to gather the data from HBase and designs the solution to implement usingPySpark.
- DevelopedPySparkcode to mimic the transformations performed in the on-premise environment.
- Analyzed the Sql scripts and designed solutions to implement using pyspark. created custom new columns depending up on the use case while ingesting the data into Hadoop lake using pyspark.
- Analyze Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suites the current requirement.
- Integrated Cassandra as a distributed persistent metadata store to provide metadata resolution for network entities on the network.
- Implemented Spark using Scala and also used Pyspark using Python for faster testing and processing of data.
- Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
- Involved in converting Hive/Sql queries into Spark transformations using Spark RDD’s.
- Loading data from Linux file system to HDFS and vice-versa
- Developed UDF’s using both Data Frames/Sql and RDD in SparkforData Aggregation queries and reverting back into OLTP through sqoop.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Installed and monitored Hadoop ecosystems tools on multiple operating systems like Ubuntu, CentOS.
- Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing theHadoopcluster through Cloudera Manager.
- Continuously monitored and managed theHadoopCluster using ClouderaManager.
- Participated in development/implementation of Cloudera ImpalaHadoopenvironment.
- Utilized ApacheHadoopenvironment by Cloudera.
- Collect the data using SparkStreaming and dump into Cassandra Cluster
- Developed Scala scripts using both Data frames/SQL/Datasets and RDD/MapReduce in Spark for Data aggregation, queries and writing data back into OLTP system throughSqoop.
- Extensively use Zookeeper as job scheduler for SparkJobs.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Wrote Java code to format XML documents; upload them toSolrserver for indexing.
- Used AWS to export MapReduce jobs into Spark RDD transformations.
- Writing AWS Terraform templates for any automation requirements in AWS services.
- Used Spark API over Hortonworks HadoopYARN to perform analytics on data in Hive.
- Deploy and configured cloud AWS EC2 for client websites moving from self-hosted services for scalability purposes.
- Work with multiple teams to provision AWSinfrastructure for development and production environments.
- Experience in designing Kafka for multi data center cluster and monitoring it.
- Designed number of partitions and replication factor for Kafka topics based on business requirements.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- Experience on Kafka and Spark integration for real time data processing.
- Developed Kafka producer and consumer components for real time data processing.
- Hands-on experience for setting up Kafka mirror maker for data replication across the cluster’s.
- Experience in Configure, Design, Implement and monitor Kafka Cluster and connectors.
- Oracle SQL tuning using explain plan.
- Manipulate, serialize, model data in multiple forms like JSON, XML.
- Involved in setting up map reduce 1 and map reduce 2.
- Prepared Avro schema files for generating Hive tables.
- Used Impala connectivity from the User Interface(UI) and query the results using ImpalaQL.
- Worked on physical transformations of data model which involved in creating Tables, Indexes, Joins, Views and Partitions.
- Involved in Analysis, Design, System architectural design, Process interfaces design, design, documentation.
- Used Jira for bugtracking and BitBucket to check-in and checkout code changes.
- Involved in CassandraData modelling to create key spaces and tables in multi Data Center DSECassandraDB.
- Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review sessions.
Environment: Cloudera, Spark, Impala, Sqoop, Flume,Cassandra,Kafka,Hive, Zookeeper,Oozie,RDBMS,AWS.
Confidential
Sr.Spark/HadoopDeveloper
Responsibilities:
- Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
- Responsible to manage data coming from different sources.
- Developed Batch Processing jobs using Pig and Hive.
- Involved in gathering the business requirements from the Business Partners and Subject Matter Experts.
- Worked with different File Formats like TEXTFILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Implemented Elastic Search on Hive data warehouse platform.
- Good experience in analyzing Hadoop cluster and different analytic tools like Pig, Impala.
- Experienced in managing andreviewingHadooplog files.
- Extracted files from CouchDB through Sqoop and placed in HDFS and processed.
- Experienced in runningHadoopstreaming jobs to process terabytes of xml format data.
- Experienced in working with spark eco system using SparkSQL and Scala queries on different formats like Text file, CSV file.
- Created concurrent access for Hive tables with shared and exclusive locking that can be enabled in Hive with the help of Zookeeper implementation in the cluster.
- Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data into NFS.
- Implemented Name Node backup using NFS. This was done for High availability.
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2).
- Responsible for building scalable distributed data solutions usingHadoopcluster environment with Horton works distribution.
- Integrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.ni
- Troubleshooting, Manage and review data backups, Manage and reviewHadooplog files. Hortonworks Cluster.
- Used PIG to perform data validation on the data ingested using Sqoop and Flume and the cleansed data set is pushed into MongoDB.
- Ingested streaming data with Apache NiFi into Kafka.
- Worked with Nifi for managing the flow of data from sources through automated data flow.
- Designed and implemented the MongoDB schema.
- Wrote services to store and retrieve user data from the MongoDB for the application on devices.
- Used Mongoose API to access the MongoDB from NodeJS.
- Created and Implemented Business validation and coverage Price Gap Rules in Talend on Hive, using TalendTool.
- Wrote shell scripts for rolling day-to-day processes and it is automated.
- Written the shell scripts to monitor the data ofHadoopdaemon services and respond accordingly to any warning or failure conditions.
Environment: Apache Flume, Hive, Pig, HDFS, Zookeeper, Sqoop, RDBMS, AWS, MongoDB, Talend, Shell Scripts, Eclipse, WinSCP, Hortonworks.