Sr Application Developer(spark) Resume
Seattle, WA
SUMMARY
- 8+ years of total IT experience which includes Java Application Development, Database Management& on Big Data technologies using Hadoop Ecosystem
- 4 years of experience in BigData Analytics using various Hadoop eco - system tools and SparkFramework.
- Solid understanding of Distributed Systems Architecture, MapReduce and Sparkexecutionframeworks for large scale parallel processing.
- Worked extensively on Hadoop eco-system components Map Reduce, Pig, Hive, HBase, Flume, Sqoop, Hue, Oozie, Spark and Kafka.
- Experience working with all major Hadoop distributions like Cloudera (CDH), Horton works(HDP) and AWS EMR.
- Developed highly scalable Spark applications using SparkCore,Data frames, Spark-SQL and SparkStreaming API's in Scala.
- Gained good experience troubleshooting and fine-tuningSpark Applications.
- Experience in working with D-Streams in Streaming , Accumulators , Broadcastvariables , various levels of caching and optimization techniquesin Spark.
- Worked on real time data integration using Kafka, Sparkstreaming and HBase.
- In-depth understanding of NoSQL databases such as HBase and its Integration with Hadoop cluster.
- Strong working experience in extracting, wrangling, ingestion, processing, storing, querying and analyzing structured, semi-structured and unstructured data.
- Solid understanding of Hadoop MRV1 and Hadoop MRV2 (or) YARN Architecture.
- Developed, deployed and supported several MapReduce applications in Java to handle semi and unstructured data.
- Sound Knowledge in Map side join, Reducer side join, Shuffle & Sort, Distributed Cache, Compression techniques, Multiple Hadoop Input & output formats.
- Solid experience in working with csv, text, sequential, Avro, parquet, orc, Jason formats of data.
- Expertise in working with Hive data warehouse tool - creating tables, data distribution by implementing static and dynamic partitioning, bucketing and optimizing the HiveQL queries.
- Involved in ingestion of structured data from SQL Server, MySql, Teradata to HDFS and Hiveusing Sqoop.Experience in writing AD-hoc Queries in Hive and analyzing data using HiveQL.
- Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
- Expertise in moving structured schema data between Pig and Hive using HCatalog.
- Proficient in creating Hive DDL’s and Hive UDF’s.Designed and implemented Hiveand Pig UDF's using Python, java for evaluation, filtering, loading and storing of data.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Experienced in working with Confidential Web Services (AWS) using EC2 for computing and S3 as storage mechanism. Have awareness about Kerberos.
- Experienced in job workflow scheduling and monitoring tools like Oozie.
- Proficient knowledge and hands on experience in writing shell scripts in Linux.
- Developed core modules in large cross-platform applications using JAVA , JSP , Servlets , Hibernate , RESTful , JDBC , JavaScript , XML , and HTML .
- Extensive experience in developing and deploying applications using WebLogic , ApacheTomcat and JBOSS . Worked on Podium and Talend.
- Development experience with RDBMS, including writing SQL queries, views, stored procedure, triggers, Data lake etc.
- Strong understanding of Software Development Lifecycle (SDLC) and various methodologies (Waterfall, Agile).
TECHNICAL SKILLS
BigData Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Hue,Ambari, Zookeeper,Kafka,ApacheSpark,Spark Streaming, Impala, HBase, Flume
Hadoop Distributions: Cloudera, Horton Works, Apache, AWS EMR, Databricks
Languages: C, Java, PL/SQL, Python, PigLatin, HiveQL, Scala, Regular Expressions
IDE&Build Tools, Design: Eclipse, NetBeans, IntelliJ, JIRA, Microsoft Visio, PyCharm
Web Technologies: HTML, CSS, JavaScript, XML, JSP, RESTful, SOAP
Operating Systems: Windows (XP,7,8,10), UNIX, LINUX, Ubuntu, CentOS
Reporting Tools: Tableau, Powerview for Microsoft Excel, Talend, MicroStrategy
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata, IBM DB2
Build Automation tools: SBT, Ant, Maven
PROFESSIONAL EXPERIENCE
Confidential - Seattle, WA
Sr Application Developer(Spark)
Responsibilities:
- Development and Review of spark code containing Airflow DAG’s, Databricks Notebooks, Delta Tables in DDL’s and Metadata SQL’s, other SQL scripts.
- Deploying the Code to Dev, QA, PreProd and Prod Environments by adhering to GIT process flow and following the standards mentioned by the release management process.
- Creating Technical Design Documentation and Support/OPS Turnover documentation by following the OPS checklist.
- Raising Change Request once the code is PreProd.
- Airflow Orchestration especially configuring the DAG start date and scheduled time and other parameters.
- Worked on mainly developing Pyspark code in Databricks code using existing load patterns(Full, Incremental and Backfill) for forecasting(Region and Country) rawCustomerSales and pubCustomerSales.
- Wrote Spark Dataframesthat uses mainly CSV files, Parquet, Delta file formats. Used Spark SQL, Joins, views, partitioning extensively.
- Validating the source data and generating the output data in the required format using Pyspark transformations
- Submitting Jobs for cluster administered by other Linux teams.
Environment: Used Databricks, Azure Data Lake storage(Gen1), Oracle EDW, PySpark mainly&Spark SQL, Scala Spark occasionally, Jenkins, PyCharm, Git, Spark BDA server, Putty for Tunneling into Airflow environments etc.
Confidential - Irving, Texas
Sr Application Developer
Responsibilities:
- Responsible for Mapping of data before ingesting according to business problem.
- Responsible for ingesting large volumes of data into Spark Cluster from IBM DB2 databases using Queries. Also used HDFS, S3 along with IBM DB2 .
- Developed Spark Script withPySpark, Javausing PyCharm Spring Boot IDE that performs the internalization process.
- Worked on mainly developing Pyspark code using existing resources like QA code written in python, Hanweck BRD to eliminate the previous flaws in design along with performance improvement.
- Wrote Spark Dataframes, Datasets and RDD’sthat uses mainly PSV files, Avro & parquet files format also. Used Spark SQL extensively.
- Good experience with Performance tuning of Spark application using Spark Performance Tuning Techniques.
- Done POC using Kafka and Spark Streaming to fetch data from ONCORE application into our analytics application.
Environment: Used HDFS, S3, IBM DB2, PySpark mainly&Java Spark occasionally, Docker, Maven,Git, kubernetes, Unix etc.
Hadoop/Kafka Developer
Confidential
Responsibilities:
- Responsible for ingesting large volumes of IOT data to Kafka.
- Developed Microservices withJavausing Spring Boot IDE.
- Worked on identifying present Scripted syntax Jenkins pipeline style and suggested to changing to Declarative style for reducing deployment time.
- Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.
- Experience working for Security groups in AWS cloud and working with S3.
- Good experience with continuous Integration of application using Jenkins.
- Used chef, Terraform as Infrastructure as code (IaaS) for defining Jenkins plugins.
- Responsible for maintaining inbound rules of a security group(s)and preventing duplication of EC2 instances.
- Used git and docker for Build.
Environment: Shell Scripting, Git, AWS EMR, Kafka, AWS S3,AWS EC2,Java, Spring Boot Eclipse IDE, Maven, chef, Jenkins, Terraform, Docker and Infrastructure as a service (IaaS), Cloudera (CDH) .
Confidential - Chicago, IL
Hadoop/Spark Developer
Responsibilities:
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Developed many Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning exercise.
- Worked on troubleshootingspark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the over-all processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient joins, transformations and other capabilities.
- Worked extensively with Sqoop for importing data from Oracle.
- Experience working for EMR cluster in AWS cloud and working with S3.
- Involved in creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Good experience with continuous Integration of application using Jenkins.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
Environment: Spark, Hive, S3, Sqoop, Shell Scripting, AWS EMR,Kafka, AWS S3, Map Reduce, Scala, Eclipse, Maven, Cloudera (CDH)
Confidential -Seattle, WA
Hadoop developer
Responsibilities:
- Worked closely with Business Analysts to gather requirements and design a reliable and scalable data pipelinesusing AWS EMR.
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement
- Data pipeline consists Spark, Hive and Sqoop and custom-builtInputAdapters to ingest, transform and analyze operational data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Used Spark for interactive queries, processing of streaming data and integration with NoSQL database DynamoDB.
- Involved in converting Hive queries into Spark transformations using Spark Data Frames in Scala.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
- Handled importing data from relational databases into S3 using Sqoop and performing transformations using Hive and Spark.
- Exported the processed data to the redshift using redshift load utilities, to further visualize and generate reports for the BI team.
- Used Hive to analyze the partitioned and bucketed data and computed various metrics for reporting.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
Environment: AWS EMR, S3, Spark, Hive,Sqoop, Eclipse, Java, SQL, Sqoop, Linux-Centos, DynamoDB, Maven.
Confidential - Denver, CO
Hadoop Developer
Responsibilities:
- Worked with the business team to gather the requirements and participated in the Agile planning meetings to finalize the scope of each development.
- Responsible for building scalable distributed data solutions on Cloudera distributedHadoop.
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
- Implemented data pipelines developing multiple mappers by using Chained Mappers API.
- Developed multiple MapReduce batch jobs in java for loading the data to HDFS in sequential format.
- Ingested structured data from wide array of RDBMS to HDFS as incremental import using Sqoop.
- Involved in writing Pig scripts to wrangle the raw data and store it to HDFS, load the data to hive tables using HCatalog.
- Configured Flume agents on different data sources to capture the streaming log data from the web servers.
- Implemented Flume (Multiplexing) to steam data from upstream pipes in to HDFS.
- Created Hive external tables with clustering and partitioning on the date for optimizing the performance of ad-hoc queries.
- Involved in writing HiveQL scripts on beeline, impala, hive cli for the consumer data analysis to meet business requirements.
- Exported data in HDFS to DWH using Sqoop export in allow insert mode through staging table.
- Worked with different file formats and compression techniques to ensure optimal performance of hive queries.
- Involved in creating Hive tables from wide range of data formats like csv, text, sequential, avro, parquet, orc, Jason and custom formats using SerDe .
- Transformed the semi-structured log data to fit into the schema of the Hive tables using Pig.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
- Involved in testing and designing low level and high-level documentation for the business requirement.
Environment: Cloudera Hadoop, Eclipse, java, Sqoop, Pig, Oozie, Hive, Flume, Cent OS, MySQL, Oracle DB.
Confidential -Denver, CO
Hadoop Developer
Responsibilities:
- Responsible for developing efficient MapReduce programs for more than 20 years’ worth of claim data to detect and separate fraudulent claims.
- Developed Map-Reduce programs from scratch of medium to complex.
- Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS using Sqoop and Flume.
- Played a key-role is setting up a 100 node Hadoop cluster utilizing MapReduce by working closely with the Hadoop Administration team.
- Worked with the advanced analytics team to design fraud detection algorithms and then developed MapReduce programs to run efficiently the algorithm on the huge datasets.
- Developed Java programs to perform data scrubbing for unstructured data.
- Responsible for designing and managing the Sqoop jobs that uploaded the data from Oracle to HDFS and Hive.
- Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team
- Used Flume to collect the logs data with error messages across the cluster.
- Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
- Played a key role in installation and configuration of the various Hadoop ecosystem tools such as, Hive, Pig, andHBase.
- Successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE
- Experience in using Zookeeper and Oozie for coordinating the cluster and scheduling workflows
- DevelopedOozie workflows and scheduled it to run data/time dependent Hive and Pig jobs
- Designed and developed Dashboards for Analytical purposes using Tableau.
- Analyzed the Hadoop log files using Pig scripts to oversee the errors.
- Actively updated the higher management with daily updates on the progress of project that include the classification levels in the data.
Confidential, Mechanicsburg, PA
Java Developer
Responsibilities:
- Developed web applications by coordinating requirements, user stories, use cases, screen mockups, schedules, and activities.
- Work closely with client business stakeholders on agile development teams.
- Support users by developing documentation and assistance tools.
- Developed presentation using Spring Framework and used multiple modules in Spring like, Spring MVC, JDBC
- Implemented Web-Services to integrate between different applications components using RESTful using Jersey.
- Developed RESTful Web services for transmission of data in JSON/XML format.
- Involved in writing SQL queries, functions, views, triggers and stored procedures and also using Oracle relational database.
- Used Sqoop to ingest structured data from Oracle database to HDFS.
- Involved in writing and running MapReduce batch jobs using java for data wrangling on the cluster.
- Developed map side, reduceside joins using DistributedCache on various data sets.
- Developed PigLatin scripts to transform the data according to the business requirement.
- Developed Pig UDFs extending eval, filter functions using java to filter semi structured data.
Environment: Java, J2EE, Eclipse, JSP, Servlets, spring, JavaScript, HTML, RESTful, shell scripting, XML, Oracle 10g, Cloudera Hadoop, Map Reduce, Pig, HDFS.