Sr. Spark/scala Developer Resume
East Greenwich Rhode, IslanD
SUMMARY
- An accomplished Hadoop/Spark developer experienced in ingestion, storage, querying, processing and analysis of big data.
- Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL database.
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB 3.0.1, HBase, Cassandra and DynamoDB (AWS).
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Working knowledge of Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Experienced in implementing scheduler using Oozie, Airflow, Crontab and Shell scripts.
- PreparingJILscripts for scheduling the workflows usingAutosysand automated jobs with Oozie.
- Good working experience in importing data using Sqoop, SFTP from various sources like RDMS, Teradata, Mainframes, Oracle, Netezza to HDFS and performed transformations on it using Hive, Pig and Spark.
- Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark.
- Hands - on-Experience in Analytical tools like SAS, R, RStudio, Python, NumPy, Sci-Kit learn, Spark MLLib, Neo4J andGraphDB
- Expertise in using Kafka as a messaging system to implement real-time Streaming solutions.
- Extensively worked on Spark streaming and Apache Kafka to fetch live stream data.
- Used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Expertise in writingSparkRDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations usingSpark-Core.
- Worked on GUI Based Hive Interaction tools like Hue, Karmasphere for querying the data.
- Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Working knowledge in installing and maintaining Cassandra by configuring the cassandra.yaml file as per the business requirement and performed reads/writes using Java JDBC connectivity.
- Experience in writing Complex SQL queries, PL/SQL, Views, Stored procedure, triggers, etc.
- Involved in maintaining the Big Data servers using Ganglia and Nagios.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
- Good experience in optimizing MapReduce algorithms using Mappers, Reducers, combiners and partitioners to deliver the best results for the large datasets.
- Expert in CodingTeradataSQL,TeradataStored Procedures, Macros and Triggers.
- Installed, Configured and Administered PostgreSQL Databases in the Dev, Staging and Prod environment.
- Extracted data from various data source including OLE DB, Excel, Flat files and XML.
- Experienced in using build tools like Ant, SBT, Log4j, Maven to build and deploy applications into the server.
- Experienced in migrating data from different sources using PUB-SUB model in Redis, and Kafka producers, consumers and preprocess data using Storm topologies.
- Experienced in writing Ad Hoc queries using Cloudera Impala, also used Impala analytical functions. Good understanding of MPP databases such as HP Vertica.
- Experience in Enterprise search using SOLR to implement full text search with advanced text analysis, faceted search, filtering using advanced features like dismax, extended dismax and grouping.
- Worked on data warehousing and ETL tools like Informatica, Talend, and Pentaho.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Working experience on Test Data Management tools HP Quality Center, HPALM, Load Runner, QTP and Selenium.
- Worked on ELK stack like Elastic search, Logstash, Kibana for log management.
- Experience in managing and reviewing Hadoop log files.
- Hands-on knowledge in Core Java concepts like Exceptions, Collections, Data-structures, I/O. Multi-threading, Serialization and deserialization of streaming applications.
- Used various Project Management services like JIRA for tracking issues, bugs related to code and GitHub for various code reviews and Worked on various version control tools like CVS, GIT, PVCS, SVN.
- Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
- Good analytical, communication, problem solving skills and adore learning new technical, functional skills.
TECHNICAL SKILLS
Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Solr, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.
NO SQL Databases: HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache.
Languages: C, C++, Java, Scala, Python, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting
Java & J2EE Technologies: Core Java, JAVA 7, JAVA 8, Hibernate, Spring framework, JSP, Servlets, Java Beans, JDBC, EJB 3.0, Java Sockets & Java Scripts, jQuery, JSF, Prime Faces, SOAP, XSLT and DHTML Messaging Services JMS, MQ Series, MDB, J2EE MVC, Struts 2.1, Spring 3.2, MVC, Spring Web, JUnit, MR-Unit.
Source Code Control: GitHub, CVS, SVN, ClearCase
Application Servers: WebSphere, WebLogic, JBoss, Tomcat
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, Cloud Front), Microsoft Azure
Databases: Teradata, Oracle 10g/11g, Microsoft SQL Server, MySQL, DB2
DB languages: MySQL, PL/SQL, PostgreSQL & Oracle
Build Tools: Jenkins, Maven, ANT, Log4j
Business Intelligence Tools: Tableau, Splunk
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans
ETL Tools: Talend, Pentaho, Informatica, Ab Initio
Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral
PROFESSIONAL EXPERIENCE
Confidential, East Greenwich, Rhode Island
Sr. Spark/Scala Developer
Responsibilities:
- Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
- DevelopedSparkapplications using Scala for easy Hadoop transitions.
- Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
- UsedSpark andSpark -SQL to read the parquet data and create the tables in hive using the Scala API.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities ofSpark using Scala.
- DevelopedSpark code using Scala andSpark -SQL for faster processing and testing.
- ImplementedSpark sample programs in python using pySpark.
- Analyzed the SQL scripts and designed the solution to implement using pySpark.
- Developed pySpark code to mimic the transformations performed in the on-premise environment.
- ImplementedSpark sample programs in python using pySpark.
- Analyzed the SQL scripts and designed the solution to implement using pySpark.
- Developed pySpark code to mimic the transformations performed in the on-premise environment.
- UsedSpark -Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka andSpark Streaming API.
- Developed Kafka producer and consumers, Cassandra clients andSpark along with components on HDFS, Hive.
- Populated HDFS and HBase with huge amounts of data using Apache Kafka.
- Used Kafka to ingest data intoSpark engine.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Used Apache Spark with ELK cluster for obtaining some specific visualization which require more complex data processing/querying.
- Worked with ELK Stack cluster for importing logs into Logstash, sending them to Elasticsearch nodes and creating visualizations in Kibana
- Indexed Documents using to Elastic Search.
- Implemented Elastic Search on Hive data warehouse platform.
- Managing and schedulingSpark Jobs on a Hadoop Cluster using Oozie.
- Experienced with different scripting language like Python and shell scripts.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Tested Apache TEZ, an extensible framework for building high performance batch & interactive data processing applications, on Pig and Hive jobs.
- Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
- Configured various views in Ambari such as Hive view, Tez view, and Yarn Queue manager.
- Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
- Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
- Experienced in ApacheSpark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written in Scala.
- Developed Solr web apps to query and visualize and solr indexed data from HDFS.
- Involved in converting Hive/SQL queries intoSpark transformations usingSpark RDD, Scala and Python.
- Worked onSpark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- UsingSpark -Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL)
- Extensively worked on Text, ORC, Avro and Parquet file formats and compression techniques like Snappy, Gzip and Zlib.
- Implemented Hortonworks NiFi (HDP 2.4) and recommended solution to inject data from multiple data sources to HDFS and Hive using NiFi.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement and used Cassandra through Java services.
- Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
- Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using java and Talend.
- Along with the Infrastructure team, involved in design and developed Kafka and Storm based data pipeline
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement and utilizing HiveSerDes like REGEX, JSON and AVRO.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Worked totally in agile methodology and developedSpark scripts by using Scala shell.
- Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
- Used Hibernate ORM framework with Spring framework for data persistence and transaction management.
Environment: Hortonworks, Hadoop, Hive, MapReduce, Sqoop, Kafka,Spark, Yarn, Pig, Cassandra, Oozie, shell Scripting, Scala, ELK, Tez, Maven, Java, JUnit, Agile methodologies, NIFI, MySQL, Tableau, AWS, EC2, S3, power BI, Solr.
Confidential, St Louis, MO
Sr. Hadoop/Big Data Engineer
Responsibilities:
- Worked on analyzingHadoopcluster using different big data analytic tools including Pig, Hive, Oozie, Zookeeper, Sqoop,Spark, Kafka and Impala with Cloudera distribution.
- Responsible for building scalable distributed data solutions using ApacheHadoopandSpark.
- Deployed ScalableHadoopcluster onAWSusingS3as underlying file system forHadoop.
- Worked in the cluster disaster recovery plan for theHadoopcluster by implementing the cluster data backup inAmazonS3buckets.
- Worked with AWS ELASTIC MAPREDUCE (EMR) and setup Hadoop environment in AWS EC2 Instances.
- DevelopedSparkscripts by using Scala IDE as per the business requirement.
- Collected the JSON data from HTTP Source and developedSparkAPIs that helps to do inserts and updates in Hive tables.
- UsedSpark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kinesis in near real time and Persists into HBase.
- Worked with xml's extracting tag information using xpaths andScalaXML libraries from compressed blob datatypes.
- Written several Restful API's in Scala functional language to implement the functionality defined.
- Involved in making changes in spark API when migrating from spark 1.6 to spark 2.1.
- UsedSparkAPI over ClouderaHadoopYARN to perform analytics on data in Hive.
- Worked on the integration ofKafka messagingservice for near live stream processing.
- Partitioning data streams usingKafka, designed and configuredKafkacluster to accommodate heavy throughput of 1 million messages per second. UsedKafkaproducer API's to produce messages.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Built and maintained scalable data pipelines using theHadoopecosystem and other open source components like Hive and HBase.
- Captured data from existing databases that provide SQL interfaces using Sqoop.
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into theHadoopDistributed File System (HDFS).
- Used BulkLoad to import data into HBase to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Used Impala and Tableau to create various reporting dashboards
- Used the external tables in Impala for data analysis.
- Configured and Implemented Jenkins,Mavenand Nexus for continuous integration.
- Worked with variousAWSComponents such asEC2,S3, IAM, VPC, RDS, Route 53, SNS and SQS.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
- Experienced working in Dev, staging & prod environment.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
- Actively involved in code review and bug fixing for improving the performance.
Environment: CDH, Spark, Spark-Streaming, Spark SQL, AWS EMR, HDFS, Hive, Apache Kafka, Sqoop, Java (JDK SE 6, 7), Scala, Impala, Tableau, Shell scripting, Maven, Eclipse, Oracle, Bit Bucket, Oozie, MySQL, Soap, Cassandra.
Confidential, Houston,TX
Sr. Hadoop/Spark Developer
Responsibilities:
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Flume in Near real time and persist it to MongoDB.
- Imported weblogs and unstructured data using the Apache Flume and store it in Flume channel.
- Loaded the CDRs from relational DB using Sqoop and other sources toHadoop cluster by Flume.
- Developed business logic in Flume interceptor in Java.
- Implementing quality checks and transformations using Flume Interceptor.
- Consumed XML messages using Flume and processed the xml file using Spark Streaming to capture UI updates.
- Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage in maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Experienced in Creating data-models for Client’s transactional logs, analyzed the data from MongoDB tables for quick searching, sorting and grouping.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDFs in Hive and Pig.
- Worked with Log4j framework for logging debug, info & error data.
- Implemented ETL standards utilizing proven data processing patterns with open source standard tools like Talend and Pentaho for more efficient processing.
- Well versed on Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys.
- Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using Java API, also Used Jenkins for Continuous integration.
- Setting up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Modified ANT Scripts to build the JAR's, Class files, WAR files and EAR files.
- Developed application using Eclipse and used build and deploy tool asMaven.
- Generated various kinds of reports using Power BI and Tableau based on Client specification.
- Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, CDH, HDFS, Hive, Pig, Apache Kafka, Sqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.
Confidential, Mattoon, IL
Hadoop Developer
Responsibilities:
- Experienced in migrating and transforming of large sets of Structured, semi structured and Unstructured RAW data from HBase through Sqoop and placed in HDFS for further processing.
- Working with Cloudera Support Team to Fine Tune Cluster.
- Extracted data of everyday transaction of customers from DB2 and export to Hive and setup Online analytical processing.
- Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other codec file formats
- Written Java program to retrieve data from HDFS and providing it to REST Services.
- Implemented business logic by writing UDFs in Java and used various UDFs from other sources.
- Implemented Sqoop for large data transfers from RDMS to HDFS/HBase/Hive and vice-versa.
- Implemented partitioning, bucketing in Hive for better organization of the data.
- Involved in using HCATALOG to access Hive table metadata from Map Reduce or Pig code
- Created HBase tables, used HBase sinks and loaded data into them to perform analytics using Tableau.
- Installed, configured and maintained Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Created multiple Hive tables, running hive queries in those data, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access
- Experienced in running batch processes using Pig Latin Scripts and developed Pig UDFs for data manipulation according to Business Requirements.
- Hands on experience in Developing optimal strategies for distributing the web log data over the cluster, importing and exporting of stored web log data into HDFS and Hive using Scoop.
- Developed several REST web services which produces both XML and JSON to perform tasks, leveraged by both web and mobile applications.
- Developed Unit test cases for Hadoop M-R jobs and driver classes with MR Testing library.
- Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
- Designed the logical and physical data model, generated DDL scripts, and wrote DML scripts for Oracle 10g database.
- Managed and scheduled several jobs to run over a time on Hadoop cluster using oozie.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Used MAVEN for building jar files of MapReduce programs and deployed to cluster.
- Worked on various compression techniques like GZIP and LZO.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Performed Cluster tasks like adding, removing of nodes without any effect on running jobs.
- Installed Qlik Sense Desktop 2.x and developed applications for users and made reports using Qlik view.
- Configured different Qlik Sense roles and attribute-based access control.
- Maintained System integrity of all sub-components (primarily HDFS, MR, HBase, and Hive).
- Helped in design of Scalable Big Data Clusters and solutions and involved in defect meetings.
- Followed Agile Methodology for entire project and supported testing teams.
Environment: Apache Hadoop, MapReduce, HDFS, HBase, CentOS 6.4, Unix, REST web Services, ANT 1.6, Elastic Search, Hive, Pig, Oozie, Java (jdk 1.5), JSON, Eclipse, Qlik view, Qlik Sense, Oracle Database, Jenkins, Maven, Sqoop.
Speck Systems
Java Developer
Responsibilities:
- Developing rules based on different state policy using Spring MVC, iBatis ORM, spring web flow, JSP, JSTL, Oracle, MSSQL, SOA, XML, XSD, JSON, AJAX, Log4j
- Involved in various phases of Software Development Life Cycle (SDLC) such as requirements gathering, modeling, analysis, design, development and testing.
- Generated the use case diagrams, Activity diagrams, Class diagrams and Sequence Diagrams in the design phase using Star UML tool.
- Worked on the agile methodology basis in the project.
- Used Maven as build tool and deploying the application.
- Developed the User Interface using Spring framework, JQuery and Ajax.
- Used Spring framework AOP features and JDBC module features to persist the data to the database for few applications. Also used the Spring IOC feature to get hibernate session factory and resolve other bean dependencies.
- Involved in SSH key hashing and SFTP transfer of files.
- Extensively worked on Apache and Apache libraries for developing custom web services.
- Developed the persistence layer using Hibernate Framework by configuring the mappings in hibernate mapping files and created DAO and PO.
- Developed various Java beans for performance of business processes and effectively involved in Impact analysis and Developed test cases using Junit and Test Driven Development.
- Developed application service components and configured beans using Spring IOC, creation of Hibernate mapping files and generation of database schema.
- Created RESTful web services interface to Java-based runtime engine and accounts.
- Done thorough code walk through for the team members to check the functional coverage and coding standards.
- Actively involved in writing SQL using SQL query builder.
- Actively used the defect tracking tool JIRA to create and track the defects during QA phase of the project.
- Used Tortoise SVN to maintain the version of the files and took the responsibility to do the code merges from branch to trunk and creating new branch when new feature implementation starts.
- Used DAO pattern to retrieve the data from database.
- Worked with WebSphere application server that handles various requests from Client.
Confidential
Java Developer
Responsibilities:
- Participated in all the phases of the Software development life cycle (SDLC) which includes Development, Testing, Implementation and Maintenance.
- Involved in collecting client requirements and preparing the design documents.
- Implemented Spring MVC architecture and Spring Bean Factory using IOC, AOP concepts.
- Developed the JAVA classes to execute the business logic and to collect the input data from the users using JAVA, Oracle.
- Involved in creation of scripts to create, update and delete data from the tables.
- Followed Agile Methodology in analyze, define, and document the application which will support functional and business requirements.
- Wrote JSP using HTML tags for designing UI for different pages.
- Extensively used OOD concepts in overall design and development of the system.
- Developed user interface using Spring JSP to simplify the complexities of the application.
- Responsible for Development, unit testing and implementation of the application.
- Used Agile methodology to design, develop and deploy the changes.
- Extensively used tools like AccVerify, check style and Clockworks to check the code.
Environment: Java/J2EE, JSP, XML, Spring Framework, Hibernate, Eclipse (IDE), Micro Services, Java Script, Struts, Tiles, Ant, SQL, PL/SQL, Oracle, Windows, UNIX, Soap, Jasper reports.