Hadoop Spark Scala Developer Resume
Mc Lean, VA
SUMMARY
- Over 16 years of IT experience in analysis, design, development, documentation, implementing and testing of software systems in Python, Scala, Java, J2EE, REST API, Oracle and AWS technologies.
- Experience in importing and exporting multi terabytes of data using Sqoop from RDBMS to HDFS and vice versa.
- Experience in meeting expectations wif Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
- Experienced in implementing Hadoop in AWS EMR clusters and Microsoft Azure Big Data wif Databricks.
- 7 years of experience in Administration, Configuration and managing open - source technologies like Spark, Kafka, Zookeeper, Docker, Kubernetes on RHEL.
- Good experience in writing Spark applications using Scala/Python/Java.
- Experience in creating Resilient Distributed Datasets and Dataframes wif appropriate push down predicate filtering.
- Experience in supporting and monitoring Spark Jobs through Spark web UI & Grafana.
- Involved in performance tuning the spark applications through fixing right batch interval time and memory tuning.
- Experienced in fine tuning the Spark jobs using dataframe repartitioning & coalesce techniques.
- Experienced in Scala functional programming using Closures, Partial Functions, Currying, and monads.
- Experience in developing applications using Hadoop echo system MapReduce, Hive, Pig, Oozie, HBase, Flume.
- Implemented pre-defined operators in spark & pyspark such as map, flatMap, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
- Experienced in Git repository and Maven builds.
- Experienced in using Kerberos authentication wifin Hadoop application framework.
- Experienced in Apache Hive creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing HiveQL queries and analyzing large datasets.
- Experienced in creating Hive Transaction table and using merge Upsert operation to add the data from Staging to Final tables.
- Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
- Experience using Impala for data processing on top of HIVE for better utilization.
- Experienced in using Spark MongoDB connect to grab the MongoDB collections and stored the transformed the data into the Hive tables.
- Experience in Kafka Stream generation and Spark DStreams consuming the data from Kafka topic.
- Extensive hands-on experience in Hadoop file system commands for file handling operations.
- Experience in using Spark over Map Reduce for faster and efficient data processing and to perform analytics on data.
- Experience in creating spark dataframe transformations using wifColumn, wifColumnRenamed and drop operations to modify dataframe columns.
- Experience in generating spark sql queries using anti joins for the upsert operations.
- Hands on experience in generating spark submit pipeline and generating DAG using Airflow.
- Experience in AWS Glue spark jobs generation and scheduling the job execution.
- Worked on developing ETL processes to load data from multiple data sources to HDFS using Sqoop, perform structural modifications using Hive and analyze data using visualization/reporting tools.
- Experience in analyzing large datasets and deriving actionable insights for process improvement.
- Worked on loading and transforming of large sets of structured, semi structured and unstructured data.
- Experience in working wif different file formats like Avro, Parquet, ORC, Sequence, and JSON files.
- Background wif traditional databases such as Oracle, MySQL, MS SQL Server, PostgreSQL.
- Good understanding of Web Services like SOAP, REST and build tools like SBT, Maven, and Gradle.
- Experience in Jenkins and JFrog Artifactory Images for the deployment automations.
- Good analytical, interpersonal, communication, problem solving skills wif ability to quickly master new concepts and capable of working in group as well as independently.
TECHNICAL SKILLS
Hadoop Distribution: Cloudera (CDH 4 and 5), Hortonworks
Hadoop Ecosystem: HDFS, Sqoop, Hive, Pig, Impala, Map Reduce, Spark Core, Spark SQL
Databases: Oracle, MySQL, MS SQL Server, PostgreSQL
NoSQL Database: HBase, Cassandra, Mango
Data warehouse: Redshift
Cloud: AWS, Azure, Google
AWS: S3, EMR, EC2, Athena, Glue
Languages: Scala, Java, Python
Operating System: Windows, UNIX / Linux, Mac
PROFESSIONAL EXPERIENCE
Confidential
Hadoop Spark Scala Developer
Responsibilities:
- Involved in generating the Pyspark framework for generating the Dataframes in Palantir Foundry.
- Responsible for the contour graphs using spark dataframes wif in Palantir Foundry.
- Involved in scheduling the Palantir Foundry jobs to run as trigger event or on schedule time.
- Involved in performance tuning the Spark Sql and analyzing the Spark logs and Dag on Palantir Foundry.
- Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
- Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
- Experience working wif Spark SQL and creating RDD’s using PySpark sparkContext & SparkSession.
- Experience in implementing efficient storage formats like Avro, Parquet and ORC.
- Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
- Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
- Involved in spark structured streaming and sending the Data from SqlServer to Cassandra Db for the spark MLib activity.
- Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
- Actively involved in interacting wif team members for the issues and problem resolution.
Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Palantir Foundry, Python, Spark, Scala, Python, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, MongoDB, and Redshift.
Confidential, Mc Lean, VA
Senior Bigdata Spark/Hive Developer
Responsibilities:
- Developed shell scripts and scheduled the Spark jobs using Autosys.
- Developed PySpark Framework to consume the MongoDB collection and saved into Hive tables.
- Developed Spark SQL and Tuned the queries for the overall Spark Job performance.
- Developed Hive Internal tables and used Bloom Filters to improve the Hive queries performance.
- Used Various Hive parameters for the hive tables Append & Upsert operation out of memory errors.
- Used ORC, Parquet and Avro file formats for the Hive tables.
- Developed Shell scripts for the Sqoop exports to migrate the data from Netezza and DB2 database and store into HDFS.
- Developed AWS Glue Jobs & Crawlers to generate Glue Tables from Mysql DB and transform the data and saved into S3 location.
- Monitored the AWS Glue Jobs using AWS Cloud Watch and Spark Web UI.
Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), AWS Glue, Hive, Sqoop, Python, Spark, Mongo, Git, Jenkins, Jfrog Artifactory, Autosys.
Confidential, Sterling, VA
Senior Bigdata Spark Developer
Responsibilities:
- Involved in generating the spark dataframes from mongo DB source collections and saved the transformed dataframes into hive db.
- Involved in loading the hive tables incremental loads wif merge command and fine tuning the hive tables wif bloom filters.
- Involved in generating Kafka Streams producing from the SqlServer & http request, Spark Streaming to consume Kafka streams and write into Cassandra DB.
- Responsible for generating Spark ETL Jobs, generating Dataframes from Cassandra tables and writing the transformed tables back into Cassandra DB.
- Responsible for monitoring the Kafka Stream activity (Kafka Connects & Topics) and spark DStream activity.
- Involved in Generating the Cassandra table wif respect to Spark Dataframe pushdown filtering (Primary, Cluster, & Secondary Keys)
- Responsible for developing the Spark Sql queries for the better Spark Job performance.
- Responsible for the monitoring the spark job using spark UI.
- Responsible for scheduling the spark jobs through shell script using Autosys.
- Responsible for the Spark jobs optimization for the better timing and out of memory failures.
- Experienced in implementing storage formats like Parquet, Avro, and ORC.
- Experienced in extracting the json dump to generate Spark Dataframes.
- Experienced in Git repository, maven builds and deployments using Jenkins.
Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), Scala, Python, Spark, Kafka, Cassandra, Mongo, Airflow, Jenkins, MS-SqlServer, Autosys and Redshift.
Confidential, McLean, VA
Hadoop Spark Scala Developer
Responsibilities:
- Involved in generating the Scala spark framework for generating the Dataframes from HDFS and write the Dataframes to S3 buckets.
- Responsible for mapping the hive tables and designing the data transformations to move to Redshift.
- Involved in copying the files of various formats (json, cvs etc..) to HDFS using Nify.
- Involved in running and scheduling spark submits through Oozie, Airflow and Korn shell.
- Involved in performance tuning the spark sql and spark submit jobs.
- Developed Hive tables on data using different storage format and compression techniques.
- Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
- Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
- Experience working wif Spark SQL and creating RDD’s using pyspark sparkContext & SparkSession.
- Experience working wif collecting and moving logs from exec source to HDFS using Flume.
- Experience in implementing efficient storage formats like Avro, Parquet and ORC.
- Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files
- Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
- Involved in spark structured streaming and sending the from Hive to Redshift Datawarehouse for the spark MLib activity.
- Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports.
- Actively involved in interacting wif team members for the issues and problem resolution.
Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Python, Spark, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, datastage and Redshift.
Confidential, Augusta, ME
Hadoop Spark Bigdata Engineer
Responsibilities:
- Responsible for importing data from Oracle & Mysql database to HDFS using Sqoop for further transformation.
- Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
- Implemented AWS EMR clusters for generating Hadoop POC environment.
- Actively involved in SQL and Azure SQL DW code development using T-SQL
- Developed Hive tables on data using different storage format and compression techniques.
- Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
- Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
- Consumed the data from Kafka queue to Spark DStreams. Configured different topologies for Spark cluster and deployed them on regular basis.
- Monitored the Spark jobs in production environment and taken appropriate steps to fine tune the Spark Jobs.
- Extensive hands-on experience in Hadoop file system commands for file handling operations.
- Loading data from local file systems to HDFS and vice versa using hadoop fs commands.
- Design & Develop ETL workflow using Oozie which includes automating the extraction of data from different database into HDFS using Sqoop scripts, Transformation and Analysis in Hive/Pig, Parsing the raw data using Spark.
- Extensively worked on the core and Spark SQL modules of Spark for faster testing and processing of data.
- Experience working wif SparkSQL and creating RDD’s using Scala SparkContext & SparkSession.
- Experience working wif collecting and moving logs from exec source to HDFS using Flume.
- Actively involved in SQL and Azure SQL DW code development using T-SQL
- Troubleshooting Azure Data Factory and SQL issues and performance.
- Component unit testing using Azure Emulator.
- Analyze escalated incidences wifin the Azure SQL database. Implemented test scripts to support test driven development and continuous integration.
- Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
- Involved in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes. Strong experience in Data Warehousing and ETL using Datastage.
- Worked on MicroStrategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
- Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
- Leveraged wif the design and development of MicroStrategy dashboards and interactive documents using MicroStrategy web and mobile.
- Extracted data from SQL Server 2008 into data marts, views, and/or flat files for Tableau workbook consumption using T-SQL. Partitioned and queried the data in Hive for further analysis by the BI team.
- Managed Tableau extracts on Tableau Server and administered Tableau Server.
- Extensively worked in data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata
- Extensively used the Teradata fast load/Multiload utilities to load data into tables
- Used Teradata SQL Assistant to build the SQL queries
- Did data reconciliation in various source systems and in Teradata.
- Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
- Worked extensively on date manipulations in Teradata.
- Extracted the data from oracle using sql scripts and loaded into teradata using fast/multi load and transformed according to business transformation rules to insert/update the data in data marts.
- Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and certifying environments for production readiness.
- Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS, pre-processing and transformations wif Sqoop script, Pig script, Hive queries.
- Created Scala programs to develop the reports for Business users
- Involved in end-to-end data processing like ingestion, processing, and quality checks and splitting.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports.
- Actively involved in interacting wif team members for the issues and problem resolution.
Environment: CDH (CDH4 & CDH 5), Hadoop 2.4, Azure, Spark, pyspark, Hive, Pig, Oozie, Flume, Databricks, Sqoop, Cloudera manager, Jenkins, Cassandra, MS SqlServer, Oracle RDBMS.