Hadoop Spark Scala Developer Resume Mc Lean, VA - Hire IT People

SUMMARY

Over 16 years of IT experience in analysis, design, development, documentation, implementing and testing of software systems in Python, Scala, Java, J2EE, REST API, Oracle and AWS technologies.
Experience in importing and exporting multi terabytes of data using Sqoop from RDBMS to HDFS and vice versa.
Experience in meeting expectations wif Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
Experienced in implementing Hadoop in AWS EMR clusters and Microsoft Azure Big Data wif Databricks.
7 years of experience in Administration, Configuration and managing open - source technologies like Spark, Kafka, Zookeeper, Docker, Kubernetes on RHEL.
Good experience in writing Spark applications using Scala/Python/Java.
Experience in creating Resilient Distributed Datasets and Dataframes wif appropriate push down predicate filtering.
Experience in supporting and monitoring Spark Jobs through Spark web UI & Grafana.
Involved in performance tuning the spark applications through fixing right batch interval time and memory tuning.
Experienced in fine tuning the Spark jobs using dataframe repartitioning & coalesce techniques.
Experienced in Scala functional programming using Closures, Partial Functions, Currying, and monads.
Experience in developing applications using Hadoop echo system MapReduce, Hive, Pig, Oozie, HBase, Flume.
Implemented pre-defined operators in spark & pyspark such as map, flatMap, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
Experienced in Git repository and Maven builds.
Experienced in using Kerberos authentication wifin Hadoop application framework.
Experienced in Apache Hive creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing HiveQL queries and analyzing large datasets.
Experienced in creating Hive Transaction table and using merge Upsert operation to add the data from Staging to Final tables.
Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
Experience using Impala for data processing on top of HIVE for better utilization.
Experienced in using Spark MongoDB connect to grab the MongoDB collections and stored the transformed the data into the Hive tables.
Experience in Kafka Stream generation and Spark DStreams consuming the data from Kafka topic.
Extensive hands-on experience in Hadoop file system commands for file handling operations.
Experience in using Spark over Map Reduce for faster and efficient data processing and to perform analytics on data.
Experience in creating spark dataframe transformations using wifColumn, wifColumnRenamed and drop operations to modify dataframe columns.
Experience in generating spark sql queries using anti joins for the upsert operations.
Hands on experience in generating spark submit pipeline and generating DAG using Airflow.
Experience in AWS Glue spark jobs generation and scheduling the job execution.
Worked on developing ETL processes to load data from multiple data sources to HDFS using Sqoop, perform structural modifications using Hive and analyze data using visualization/reporting tools.
Experience in analyzing large datasets and deriving actionable insights for process improvement.
Worked on loading and transforming of large sets of structured, semi structured and unstructured data.
Experience in working wif different file formats like Avro, Parquet, ORC, Sequence, and JSON files.
Background wif traditional databases such as Oracle, MySQL, MS SQL Server, PostgreSQL.
Good understanding of Web Services like SOAP, REST and build tools like SBT, Maven, and Gradle.
Experience in Jenkins and JFrog Artifactory Images for the deployment automations.
Good analytical, interpersonal, communication, problem solving skills wif ability to quickly master new concepts and capable of working in group as well as independently.

TECHNICAL SKILLS

Hadoop Distribution: Cloudera (CDH 4 and 5), Hortonworks

Hadoop Ecosystem: HDFS, Sqoop, Hive, Pig, Impala, Map Reduce, Spark Core, Spark SQL

Databases: Oracle, MySQL, MS SQL Server, PostgreSQL

NoSQL Database: HBase, Cassandra, Mango

Data warehouse: Redshift

Cloud: AWS, Azure, Google

AWS: S3, EMR, EC2, Athena, Glue

Languages: Scala, Java, Python

Operating System: Windows, UNIX / Linux, Mac

PROFESSIONAL EXPERIENCE

Confidential

Hadoop Spark Scala Developer

Responsibilities:

Involved in generating the Pyspark framework for generating the Dataframes in Palantir Foundry.
Responsible for the contour graphs using spark dataframes wif in Palantir Foundry.
Involved in scheduling the Palantir Foundry jobs to run as trigger event or on schedule time.
Involved in performance tuning the Spark Sql and analyzing the Spark logs and Dag on Palantir Foundry.
Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
Experience working wif Spark SQL and creating RDD’s using PySpark sparkContext & SparkSession.
Experience in implementing efficient storage formats like Avro, Parquet and ORC.
Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
Involved in spark structured streaming and sending the Data from SqlServer to Cassandra Db for the spark MLib activity.
Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
Actively involved in interacting wif team members for the issues and problem resolution.

Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Palantir Foundry, Python, Spark, Scala, Python, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, MongoDB, and Redshift.

Confidential, Mc Lean, VA

Senior Bigdata Spark/Hive Developer

Responsibilities:

Developed shell scripts and scheduled the Spark jobs using Autosys.
Developed PySpark Framework to consume the MongoDB collection and saved into Hive tables.
Developed Spark SQL and Tuned the queries for the overall Spark Job performance.
Developed Hive Internal tables and used Bloom Filters to improve the Hive queries performance.
Used Various Hive parameters for the hive tables Append & Upsert operation out of memory errors.
Used ORC, Parquet and Avro file formats for the Hive tables.
Developed Shell scripts for the Sqoop exports to migrate the data from Netezza and DB2 database and store into HDFS.
Developed AWS Glue Jobs & Crawlers to generate Glue Tables from Mysql DB and transform the data and saved into S3 location.
Monitored the AWS Glue Jobs using AWS Cloud Watch and Spark Web UI.

Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), AWS Glue, Hive, Sqoop, Python, Spark, Mongo, Git, Jenkins, Jfrog Artifactory, Autosys.

Confidential, Sterling, VA

Senior Bigdata Spark Developer

Responsibilities:

Involved in generating the spark dataframes from mongo DB source collections and saved the transformed dataframes into hive db.
Involved in loading the hive tables incremental loads wif merge command and fine tuning the hive tables wif bloom filters.
Involved in generating Kafka Streams producing from the SqlServer & http request, Spark Streaming to consume Kafka streams and write into Cassandra DB.
Responsible for generating Spark ETL Jobs, generating Dataframes from Cassandra tables and writing the transformed tables back into Cassandra DB.
Responsible for monitoring the Kafka Stream activity (Kafka Connects & Topics) and spark DStream activity.
Involved in Generating the Cassandra table wif respect to Spark Dataframe pushdown filtering (Primary, Cluster, & Secondary Keys)
Responsible for developing the Spark Sql queries for the better Spark Job performance.
Responsible for the monitoring the spark job using spark UI.
Responsible for scheduling the spark jobs through shell script using Autosys.
Responsible for the Spark jobs optimization for the better timing and out of memory failures.
Experienced in implementing storage formats like Parquet, Avro, and ORC.
Experienced in extracting the json dump to generate Spark Dataframes.
Experienced in Git repository, maven builds and deployments using Jenkins.

Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), Scala, Python, Spark, Kafka, Cassandra, Mongo, Airflow, Jenkins, MS-SqlServer, Autosys and Redshift.

Confidential, McLean, VA

Hadoop Spark Scala Developer

Responsibilities:

Involved in generating the Scala spark framework for generating the Dataframes from HDFS and write the Dataframes to S3 buckets.
Responsible for mapping the hive tables and designing the data transformations to move to Redshift.
Involved in copying the files of various formats (json, cvs etc..) to HDFS using Nify.
Involved in running and scheduling spark submits through Oozie, Airflow and Korn shell.
Involved in performance tuning the spark sql and spark submit jobs.
Developed Hive tables on data using different storage format and compression techniques.
Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
Experience working wif Spark SQL and creating RDD’s using pyspark sparkContext & SparkSession.
Experience working wif collecting and moving logs from exec source to HDFS using Flume.
Experience in implementing efficient storage formats like Avro, Parquet and ORC.
Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files
Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
Involved in spark structured streaming and sending the from Hive to Redshift Datawarehouse for the spark MLib activity.
Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports.
Actively involved in interacting wif team members for the issues and problem resolution.

Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Python, Spark, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, datastage and Redshift.

Confidential, Augusta, ME

Hadoop Spark Bigdata Engineer

Responsibilities:

Responsible for importing data from Oracle & Mysql database to HDFS using Sqoop for further transformation.
Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
Implemented AWS EMR clusters for generating Hadoop POC environment.
Actively involved in SQL and Azure SQL DW code development using T-SQL
Developed Hive tables on data using different storage format and compression techniques.
Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
Consumed the data from Kafka queue to Spark DStreams. Configured different topologies for Spark cluster and deployed them on regular basis.
Monitored the Spark jobs in production environment and taken appropriate steps to fine tune the Spark Jobs.
Extensive hands-on experience in Hadoop file system commands for file handling operations.
Loading data from local file systems to HDFS and vice versa using hadoop fs commands.
Design & Develop ETL workflow using Oozie which includes automating the extraction of data from different database into HDFS using Sqoop scripts, Transformation and Analysis in Hive/Pig, Parsing the raw data using Spark.
Extensively worked on the core and Spark SQL modules of Spark for faster testing and processing of data.
Experience working wif SparkSQL and creating RDD’s using Scala SparkContext & SparkSession.
Experience working wif collecting and moving logs from exec source to HDFS using Flume.
Actively involved in SQL and Azure SQL DW code development using T-SQL
Troubleshooting Azure Data Factory and SQL issues and performance.
Component unit testing using Azure Emulator.
Analyze escalated incidences wifin the Azure SQL database. Implemented test scripts to support test driven development and continuous integration.
Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
Involved in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes. Strong experience in Data Warehousing and ETL using Datastage.
Worked on MicroStrategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
Leveraged wif the design and development of MicroStrategy dashboards and interactive documents using MicroStrategy web and mobile.
Extracted data from SQL Server 2008 into data marts, views, and/or flat files for Tableau workbook consumption using T-SQL. Partitioned and queried the data in Hive for further analysis by the BI team.
Managed Tableau extracts on Tableau Server and administered Tableau Server.
Extensively worked in data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata
Extensively used the Teradata fast load/Multiload utilities to load data into tables
Used Teradata SQL Assistant to build the SQL queries
Did data reconciliation in various source systems and in Teradata.
Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
Worked extensively on date manipulations in Teradata.
Extracted the data from oracle using sql scripts and loaded into teradata using fast/multi load and transformed according to business transformation rules to insert/update the data in data marts.
Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and certifying environments for production readiness.
Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
Developed workflow in Oozie to automate the tasks of loading the data into HDFS, pre-processing and transformations wif Sqoop script, Pig script, Hive queries.
Created Scala programs to develop the reports for Business users
Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports.
Actively involved in interacting wif team members for the issues and problem resolution.

Environment: CDH (CDH4 & CDH 5), Hadoop 2.4, Azure, Spark, pyspark, Hive, Pig, Oozie, Flume, Databricks, Sqoop, Cloudera manager, Jenkins, Cassandra, MS SqlServer, Oracle RDBMS.

We provide IT Staff Augmentation Services!

Hadoop Spark Scala Developer Resume

Mc Lean, VA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship