We provide IT Staff Augmentation Services!

Hadoop Spark Scala Developer Resume

2.00/5 (Submit Your Rating)

Mc Lean, VA

SUMMARY

  • Over 16 years of IT experience in analysis, design, development, documentation, implementing and testing of software systems in Python, Scala, Java, J2EE, REST API, Oracle and AWS technologies.
  • Experience in importing and exporting multi terabytes of data using Sqoop from RDBMS to HDFS and vice versa.
  • Experience in meeting expectations wif Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
  • Experienced in implementing Hadoop in AWS EMR clusters and Microsoft Azure Big Data wif Databricks.
  • 7 years of experience in Administration, Configuration and managing open - source technologies like Spark, Kafka, Zookeeper, Docker, Kubernetes on RHEL.
  • Good experience in writing Spark applications using Scala/Python/Java.
  • Experience in creating Resilient Distributed Datasets and Dataframes wif appropriate push down predicate filtering.
  • Experience in supporting and monitoring Spark Jobs through Spark web UI & Grafana.
  • Involved in performance tuning the spark applications through fixing right batch interval time and memory tuning.
  • Experienced in fine tuning the Spark jobs using dataframe repartitioning & coalesce techniques.
  • Experienced in Scala functional programming using Closures, Partial Functions, Currying, and monads.
  • Experience in developing applications using Hadoop echo system MapReduce, Hive, Pig, Oozie, HBase, Flume.
  • Implemented pre-defined operators in spark & pyspark such as map, flatMap, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
  • Experienced in Git repository and Maven builds.
  • Experienced in using Kerberos authentication wifin Hadoop application framework.
  • Experienced in Apache Hive creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing HiveQL queries and analyzing large datasets.
  • Experienced in creating Hive Transaction table and using merge Upsert operation to add the data from Staging to Final tables.
  • Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
  • Experience using Impala for data processing on top of HIVE for better utilization.
  • Experienced in using Spark MongoDB connect to grab the MongoDB collections and stored the transformed the data into the Hive tables.
  • Experience in Kafka Stream generation and Spark DStreams consuming the data from Kafka topic.
  • Extensive hands-on experience in Hadoop file system commands for file handling operations.
  • Experience in using Spark over Map Reduce for faster and efficient data processing and to perform analytics on data.
  • Experience in creating spark dataframe transformations using wifColumn, wifColumnRenamed and drop operations to modify dataframe columns.
  • Experience in generating spark sql queries using anti joins for the upsert operations.
  • Hands on experience in generating spark submit pipeline and generating DAG using Airflow.
  • Experience in AWS Glue spark jobs generation and scheduling the job execution.
  • Worked on developing ETL processes to load data from multiple data sources to HDFS using Sqoop, perform structural modifications using Hive and analyze data using visualization/reporting tools.
  • Experience in analyzing large datasets and deriving actionable insights for process improvement.
  • Worked on loading and transforming of large sets of structured, semi structured and unstructured data.
  • Experience in working wif different file formats like Avro, Parquet, ORC, Sequence, and JSON files.
  • Background wif traditional databases such as Oracle, MySQL, MS SQL Server, PostgreSQL.
  • Good understanding of Web Services like SOAP, REST and build tools like SBT, Maven, and Gradle.
  • Experience in Jenkins and JFrog Artifactory Images for the deployment automations.
  • Good analytical, interpersonal, communication, problem solving skills wif ability to quickly master new concepts and capable of working in group as well as independently.

TECHNICAL SKILLS

Hadoop Distribution: Cloudera (CDH 4 and 5), Hortonworks

Hadoop Ecosystem: HDFS, Sqoop, Hive, Pig, Impala, Map Reduce, Spark Core, Spark SQL

Databases: Oracle, MySQL, MS SQL Server, PostgreSQL

NoSQL Database: HBase, Cassandra, Mango

Data warehouse: Redshift

Cloud: AWS, Azure, Google

AWS: S3, EMR, EC2, Athena, Glue

Languages: Scala, Java, Python

Operating System: Windows, UNIX / Linux, Mac

PROFESSIONAL EXPERIENCE

Confidential

Hadoop Spark Scala Developer

Responsibilities:

  • Involved in generating the Pyspark framework for generating the Dataframes in Palantir Foundry.
  • Responsible for the contour graphs using spark dataframes wif in Palantir Foundry.
  • Involved in scheduling the Palantir Foundry jobs to run as trigger event or on schedule time.
  • Involved in performance tuning the Spark Sql and analyzing the Spark logs and Dag on Palantir Foundry.
  • Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
  • Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
  • Experience working wif Spark SQL and creating RDD’s using PySpark sparkContext & SparkSession.
  • Experience in implementing efficient storage formats like Avro, Parquet and ORC.
  • Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
  • Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
  • Involved in spark structured streaming and sending the Data from SqlServer to Cassandra Db for the spark MLib activity.
  • Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
  • Actively involved in interacting wif team members for the issues and problem resolution.

Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Palantir Foundry, Python, Spark, Scala, Python, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, MongoDB, and Redshift.

Confidential, Mc Lean, VA

Senior Bigdata Spark/Hive Developer

Responsibilities:

  • Developed shell scripts and scheduled the Spark jobs using Autosys.
  • Developed PySpark Framework to consume the MongoDB collection and saved into Hive tables.
  • Developed Spark SQL and Tuned the queries for the overall Spark Job performance.
  • Developed Hive Internal tables and used Bloom Filters to improve the Hive queries performance.
  • Used Various Hive parameters for the hive tables Append & Upsert operation out of memory errors.
  • Used ORC, Parquet and Avro file formats for the Hive tables.
  • Developed Shell scripts for the Sqoop exports to migrate the data from Netezza and DB2 database and store into HDFS.
  • Developed AWS Glue Jobs & Crawlers to generate Glue Tables from Mysql DB and transform the data and saved into S3 location.
  • Monitored the AWS Glue Jobs using AWS Cloud Watch and Spark Web UI.

Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), AWS Glue, Hive, Sqoop, Python, Spark, Mongo, Git, Jenkins, Jfrog Artifactory, Autosys.

Confidential, Sterling, VA

Senior Bigdata Spark Developer

Responsibilities:

  • Involved in generating the spark dataframes from mongo DB source collections and saved the transformed dataframes into hive db.
  • Involved in loading the hive tables incremental loads wif merge command and fine tuning the hive tables wif bloom filters.
  • Involved in generating Kafka Streams producing from the SqlServer & http request, Spark Streaming to consume Kafka streams and write into Cassandra DB.
  • Responsible for generating Spark ETL Jobs, generating Dataframes from Cassandra tables and writing the transformed tables back into Cassandra DB.
  • Responsible for monitoring the Kafka Stream activity (Kafka Connects & Topics) and spark DStream activity.
  • Involved in Generating the Cassandra table wif respect to Spark Dataframe pushdown filtering (Primary, Cluster, & Secondary Keys)
  • Responsible for developing the Spark Sql queries for the better Spark Job performance.
  • Responsible for the monitoring the spark job using spark UI.
  • Responsible for scheduling the spark jobs through shell script using Autosys.
  • Responsible for the Spark jobs optimization for the better timing and out of memory failures.
  • Experienced in implementing storage formats like Parquet, Avro, and ORC.
  • Experienced in extracting the json dump to generate Spark Dataframes.
  • Experienced in Git repository, maven builds and deployments using Jenkins.

Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), Scala, Python, Spark, Kafka, Cassandra, Mongo, Airflow, Jenkins, MS-SqlServer, Autosys and Redshift.

Confidential, McLean, VA

Hadoop Spark Scala Developer

Responsibilities:

  • Involved in generating the Scala spark framework for generating the Dataframes from HDFS and write the Dataframes to S3 buckets.
  • Responsible for mapping the hive tables and designing the data transformations to move to Redshift.
  • Involved in copying the files of various formats (json, cvs etc..) to HDFS using Nify.
  • Involved in running and scheduling spark submits through Oozie, Airflow and Korn shell.
  • Involved in performance tuning the spark sql and spark submit jobs.
  • Developed Hive tables on data using different storage format and compression techniques.
  • Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
  • Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
  • Experience working wif Spark SQL and creating RDD’s using pyspark sparkContext & SparkSession.
  • Experience working wif collecting and moving logs from exec source to HDFS using Flume.
  • Experience in implementing efficient storage formats like Avro, Parquet and ORC.
  • Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files
  • Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
  • Involved in spark structured streaming and sending the from Hive to Redshift Datawarehouse for the spark MLib activity.
  • Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
  • Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports.
  • Actively involved in interacting wif team members for the issues and problem resolution.

Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Python, Spark, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, datastage and Redshift.

Confidential, Augusta, ME

Hadoop Spark Bigdata Engineer

Responsibilities:

  • Responsible for importing data from Oracle & Mysql database to HDFS using Sqoop for further transformation.
  • Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
  • Implemented AWS EMR clusters for generating Hadoop POC environment.
  • Actively involved in SQL and Azure SQL DW code development using T-SQL
  • Developed Hive tables on data using different storage format and compression techniques.
  • Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
  • Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
  • Consumed the data from Kafka queue to Spark DStreams. Configured different topologies for Spark cluster and deployed them on regular basis.
  • Monitored the Spark jobs in production environment and taken appropriate steps to fine tune the Spark Jobs.
  • Extensive hands-on experience in Hadoop file system commands for file handling operations.
  • Loading data from local file systems to HDFS and vice versa using hadoop fs commands.
  • Design & Develop ETL workflow using Oozie which includes automating the extraction of data from different database into HDFS using Sqoop scripts, Transformation and Analysis in Hive/Pig, Parsing the raw data using Spark.
  • Extensively worked on the core and Spark SQL modules of Spark for faster testing and processing of data.
  • Experience working wif SparkSQL and creating RDD’s using Scala SparkContext & SparkSession.
  • Experience working wif collecting and moving logs from exec source to HDFS using Flume.
  • Actively involved in SQL and Azure SQL DW code development using T-SQL
  • Troubleshooting Azure Data Factory and SQL issues and performance.
  • Component unit testing using Azure Emulator.
  • Analyze escalated incidences wifin the Azure SQL database. Implemented test scripts to support test driven development and continuous integration.
  • Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
  • Involved in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes. Strong experience in Data Warehousing and ETL using Datastage.
  • Worked on MicroStrategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
  • Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
  • Leveraged wif the design and development of MicroStrategy dashboards and interactive documents using MicroStrategy web and mobile.
  • Extracted data from SQL Server 2008 into data marts, views, and/or flat files for Tableau workbook consumption using T-SQL. Partitioned and queried the data in Hive for further analysis by the BI team.
  • Managed Tableau extracts on Tableau Server and administered Tableau Server.
  • Extensively worked in data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata
  • Extensively used the Teradata fast load/Multiload utilities to load data into tables
  • Used Teradata SQL Assistant to build the SQL queries
  • Did data reconciliation in various source systems and in Teradata.
  • Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
  • Worked extensively on date manipulations in Teradata.
  • Extracted the data from oracle using sql scripts and loaded into teradata using fast/multi load and transformed according to business transformation rules to insert/update the data in data marts.
  • Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and certifying environments for production readiness.
  • Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS, pre-processing and transformations wif Sqoop script, Pig script, Hive queries.
  • Created Scala programs to develop the reports for Business users
  • Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports.
  • Actively involved in interacting wif team members for the issues and problem resolution.

Environment: CDH (CDH4 & CDH 5), Hadoop 2.4, Azure, Spark, pyspark, Hive, Pig, Oozie, Flume, Databricks, Sqoop, Cloudera manager, Jenkins, Cassandra, MS SqlServer, Oracle RDBMS.

We'd love your feedback!