We provide IT Staff Augmentation Services!

Big Data Developer/cloudera/amazon Emr/spark/scala Resume

0/5 (Submit Your Rating)

Wilmington, DelawarE

SUMMARY

  • Over 9.5 + years of Progressive experience as a Big Data and BI Developer. Extensive experience in Big Data, IBM Netezza, Hadoop, Hive, HBase, Impala, R, Scala, Spark, Spark Sql, SSIS, SSAS, SSRS, BI, ScalaTest, Junit, FunSuite, Mahout, MDX, Scalding Map Reduce,Oozie and core java.
  • Certified Apache Spark developer from Oreilly (License: - 1.x-0344).
  • Created Spark Scala apps using test driven approach using Scalatest, Junit, FunSuite and FunSpec.
  • Built Enterprise ingestion Spark framework to ingest data from different sources (s3, Salesforce,Excel,SFTP,FTP and JDBC Databases) which is 100% metadata driven and 100% code reuse which lets Junior developers to concentrate on core business logic rather an spark/Scala coding.
  • Built spark + r clusters for data scientists for ML under 20 minutes in aws using amplab spark ec2 ( https://github.com/amplab/spark-ec2) cluster setup software.
  • Built a spark ETL tool for data scientists to migrate data from enterprise cluster to spark + r + r shiny cluster to run and build large ML models and data analysis jobs using several algorithms in R + spark.
  • Used different transformations (map, join, reduceByKey, flatmap, distinct, filter) and actions (count, collect).
  • Used narrow and wide dependencies transformations in spark.
  • Reduced data shuffle across nodes using Hash Partition and persist for wide dependency transformations.
  • Parsed JSON data using Hive (Hive Context) in SPARK SQL and ran sql query’s.
  • Parsed xml using Scala xml package.
  • Used Databricks Spark CSV package to create data frames using spark sql.
  • Automated spark submit jobs using UNIX Bash scripts and autosy’s scheduler.
  • Submitted spark jobs using spark submit in local mode for development and testing and YARN mode (client and cluster) in production.
  • Built spark fat jars (uber) and slim jars using Maven build tool.
  • Built spark publisher batch jobs to on-board data to Kafka topics from source systems.
  • Used broadcast variables for better performance while joining two RDD’s.
  • Created spark batch subscriber jobs to pull data from Kafka topics.
  • Optimized JVM garbage collection by changing Spark.storage.memoryFraction.
  • Used spark speculation true to avoid data skew due to slow running node.
  • Analysed spark execution plan using RDD.toDebugString option.
  • Used spark streaming check pointing logic (write ahead logs) for any spark streaming fail over recovery.
  • Created Kafka direct connection jobs using Kafka Offset for failover recovery and check pointing.
  • Used updatestatebykey and check pointing folder in spark streaming jobs to aggregate streaming data by keys.
  • Used spark streaming window and sliding duration function.
  • Integrated Spark streaming, Kafka and Cassandra database for POC.
  • Created Scalding map reduce jobs for very large datasets.
  • Created Lambda architecture using Kafka, Spark as ETL framework and sql Server as target DB.
  • Used SQOOP commands to load Hive, HBase, IMPALA Parquet tables from IBM Netezza, SQL Server and Oracle and automated using Unix Bash and autosys.
  • Migrated 5 TB+ static data (legacy) from IBM Netezza to Impala Parquet tables.
  • Migrated over 50+ SSIS ETL Packages to Java and Bash Scripts.
  • Partitioned data in IMPALA parquet tables for better query performance.
  • Generated compute stats on all impala tables for better memory management.
  • Built different recommendation application models using Mahout User Based Recommender, Item-Based Recommender and Mahout SlopeOne Recommender on very large data sets.
  • Built Stochastic Gradient Descent (SGD), Naïve Bayes and Random forest classification using Mahout also calculated Area under the curve to evaluate classification models.
  • Tested data mining models using clustering, association rules, classification, time series, text mining and text classification using Mahout and R.
  • Built different time series techniques like freehand technique, least square technique and moving average technique to predict future data points using R.
  • Built Batch Indexing (Map reduce) SOLR Jobs to index data from Netezza and Hadoop (HDFS) and built facets filters, response writer in SOLR Admin console and in Cloudera search.
  • Good Exposure of Kimball Methodology and Kimball method of slowly changing dimension.
  • Experience in building OLAP Cubes, Data warehouse, ETL, applying filters, and Data Mining Techniques and algorithms (Decision Trees, Clustering, Time series and Naive Bayes).
  • Built Packages in BIDS (Business Intelligence Development Studio) to import or to export data and to handle updated changes using Incremental load and dimensional changes using Slowly Changing Dimension and Ralph Kimball Method of Slowly Changing Dimension methodology and Merge Join.

TECHNICAL SKILLS

Languages: SQL, PL/SQL, MDX, DMX, C, C#, NZSQL, Unix, bash, R, SCALA,SPARK, SPARK SQL, Mahout, R, Rattle and core java.

BIG DATA: Hadoop, HDFS, Hive, IMPALA, Spark, Scalding, Scala, Mahout and HBase.

Analytics: Mahout, R and Rattle.

Testing: ScalaTest, Junit with FunSuite

Analytics Algorithms: Clustering (KMeans, Hierarchical), Classification (Stochastic Gradient Descent (SGD), Naïve Bayes and Random forest classification), Time series, Association Rules (User Based Recommender, Item-Based Recommender and Mahout SlopeOne Recommender) and Text Mining.

Databases: Microsoft SQL Server, Oracle and Sybase.

MPP Databases: IBM Netezza, SQL Server PDW, Cassandra and Hadoop (Hive, IMPALA).

Scripting Languages: UNIX and Bash.

PROFESSIONAL EXPERIENCE

Confidential, Wilmington, Delaware

Big Data Developer/Cloudera/Amazon EMR/Spark/Scala

Responsibilities:

  • Responsible for building data pipelines to copy data from aws S3 (amazon web services Simple Storage Service) to Cloudera Hadoop HDFS Apache Spark 1.6 and 2.2.1 using spark csv data bricks parser and aws sdk.
  • Built Enterprise ingestion Spark framework to ingest data from different sources (s3, Salesforce,Excel,SFTP,FTP and JDBC Databases) which is 100% metadata driven and 100% code reuse which lets Junior developers to concentrate on core business logic rather an spark/Scala coding.
  • Built spark + r clusters for data scientists for ML under 20 minutes in aws using amplab spark ec2 ( https://github.com/amplab/spark-ec2) cluster setup software.
  • Built a spark ETL tool for data scientists to migrate data from enterprise cluster to spark + r + r shiny cluster to run and build large ML models and data analysis jobs using several algorithms in R + spark.
  • Built a POC clustering algorithm model to create groups of patients who has stage 1 diagnosis for bladder cancer and having symptoms of stage 2 who might be potential candidate for Imfinzi stage 2 bladder cancer drug, this information is sent to AstraZeneca sales rep with HCP (DR) information to follow up with DR’s and to sell and suggest Imfinzi drug for patients.
  • Used spark scopt for passing command line getopts C style arguments to Spark application.
  • Used Zeppelin and Jupyter in hue to write POC Spark code and to test before packaging Jar for production run applications.
  • Used Livy for connecting Spark from R studio with in same VPN to use SparkR,Sparklr,dplyr and DBI packages.
  • Migrated spark 1.6 application to spark 2.2.1.
  • Automated spark code build and test cases using Scala Funsuite/FunSpec for testing and maven for code build and Jenkins and Hudson for continuous integration.
  • Built Amazon EMR Hadoop cluster and migrated data from cloudera cluster to S3 using cloudera BDR (Big Data Replication) and created external tables in EMR using hive to point to S3 location.
  • Ingested and extracted data from Salesforce (Veeva Application) using spark ML Salesforce Library ( https://github.com/springml/spark-salesforce).
  • Responsible for building ETL transformations to transform data from hdfs to data warehouse and then to Analytical layer using impala which is automated using metadata driven bash scripts.
  • Automated ETL jobs using Oozie shell action and responsible for creating work flows, coordinator also migrated existing Autosys jobs to Oozie.
  • Built email engine using (Java,Hive JDBC,Impala JDBC and Apache POI) to send excel report, html table reports, recon alerts etc … which can be reused for several projects.
  • Fine-tuned impala query using query hint for larger table joins using ( shuffle|broadcast ).
  • Built incremental ingestion from S3 using S3 Last Modified column and aws cli query option and spark.
  • Built outbound data pipeline using spark to read data from hive table and output as csv on S3 for the downstream systems.
  • Built POC using Kite CLI to convert local CSV files to parquet.
  • Wrote Java MapReduce application for POC to read data from S3 and convert to parquet and land it to hdfs which is later migrated to Spark.
  • Helped data scientists to implement R + spark using SparkR and Sparklyr.

Confidential, SanJose CA

Contract Apache Spark Developer

Responsibilities:

  • Responsible for migrating Teradata analytical sql reporting quires to spark sql and Hive.
  • Migrated spark cluster from 1.3 to 1.6.1 locally.
  • Downloaded and installed spark 1.6 locally and submitted and tested jobs using YARN client and cluster mode.
  • Configured spark configurations like (dynamic memory allocation, shuffle partitions etc...) in spark-defaults conf file.
  • Automated spark sql jobs using Confidential in house built framework schedule tool and UNIX crontab.
  • Migrated Teradata OLAP, window sql functions to Spark sql and HiveQL.
  • Submitted spark jobs using spark-submit.
  • Migrated Teradata DW data to hdfs using Sqoop with Teradata connectors.
  • Migrated over 400TB of data from various source systems to HDFS using Sqoop.
  • Created external tables, ORC tables and parquet tables in Hive DW.
  • Create spark sql automated tool for reusability and code reuse to run spark sql across different projects.

Confidential

Application Developer

Responsibilities:

  • Responsible for on boarding new clients in to Ocean (IBM Netezza) and impala.
  • Integrating data from different sources systems (Foreign exchange, Cash, Fixed Income, Derivatives, Rates and Equities) to IBM Netezza Data Warehouse tables and to Hadoop IMPALA (PARQUET), HBase and SOLR.
  • Built ETL and control flow components using java, UNIX bash and metadata.
  • Created Spark Scala apps using test driven approach using Scalatest, Junit, FunSuite and FunSpec.
  • Created Spark Scala code based ETL tool to extract data from different databases and to write to target tables ( http://tinyurl.com/odkbpua ).
  • Parsed xml using Scala xml package.
  • Migrated Java FTL transformations to Spark and Spark SQL for performance gain (Parallel Processing).
  • Migrated PERL scripts to Spark and Spark SQL for performance gain (Parallel Processing).
  • Submitted spark jobs using spark submit in local mode for development and testing and YARN (client and cluster) mode in production.
  • Optimized JVM garbage collection by changing Spark.storage.memoryFraction.
  • Used spark speculation true to avoid data skew due to slow running node.
  • Analysed spark execution plan using RDD.toDebugString option.
  • Wrote spark test cases using Scalatest, Junit with FunSuite.
  • Migrated 5 TB+ static data (legacy) from IBM Netezza to Impala Parquet tables.
  • Loaded data in to Hadoop HDFS using SQOOP commands and then built external table on top of it and loaded in to IMPALA Parquet tables.
  • Built Batch Indexing (Map reduce) SOLR Jobs to index data from Netezza and Hadoop (HDFS) and built facets filters, response writer in SOLR Admin console and in Cloudera search.
  • Partitioned data in IMPALA parquet tables for better query performance.
  • Tuned several impala quires to achieve better performance.
  • Generated compute stats on all impala tables for better memory management.

Confidential

Business Intelligence Developer

Responsibilities:

  • Business Intelligence Developer responsible for building Data Warehouse for A&E (Accident and Emergency), Inpatient and outpatient from scratch.
  • Designed and Built Data Warehouse for accident and Emergency (A&E) from scratch using VISO 2010 for Data warehouse modelling and SSAS.
  • Built SSIS Packages for ETL (Extract Transform Load) to transfer data from source system data base to staging then to warehouse and then finally to SSAS OLAP cube, transformations using expression derived column, Data type Transfer and look ups with Dimensions tables, SCD (Kimball) type 2 are done.
  • Built SSAS Cube with calculations and named sets, calculated members, sets and KPI using MDX language.
  • Build Hierarchies for Dimensions such as Y-M-D, Y-S-Q-M-D, Y-M-W-D etc in time Dimension which are used for drill down analysis of data.
  • Built reports using SSRS, report builder and shared them in share point to different departments of trust.
  • Tuned SQL Queries using indexing, SQL server profiler and execution plan.
  • Creating SQL Server Jobs and scheduling it to run different ETL Tasks.
  • Checked performance of Data Warehouse queries using DMV and rebuilt and reorganise indexes.

Confidential

BI Developer

Responsibilities:

  • Database and BI support role using SQL Server 2008 and BIDS (Business Intelligence Development Studio).
  • Building MI (Management Information) reports using Excel and fetching data using Macros.
  • Building Dashboards reports using NUQLEUS 3D Software suite and posting them to Dashboards.
  • Database support for FilenetMI, Nuqleus3D, InternetUsage and PCMON.

Confidential

BI Developer

Responsibilities:

  • Developed dashboards for executives to capture key performance metrics in regards to Compliance and Risk Management
  • Created drill down reports using SSRS and deployed to SharePoint server.
  • Created MDX reports and KPI’s using SSAS and SSRS to predict risk and compliance across business.
  • Created dimensions, degenerate dimension, time dimension and fact tables in SSAS.
  • Partitioned SSAS cubes based on time periods, improved SSAS performance using usage based optimization wizard.
  • Tuned several slow running queries’ using DMV’s, SQL Profiler and management studio execution plans.
  • Built cluster and non-cluster index in SQL Server tables when required.

We'd love your feedback!