Senior Bigdata Spark Developer Resume
SUMMARY
- Over 15 years of IT experience in analysis, design, development, documentation, implementing and testing of software systems in Python, Scala, Java, J2EE, Akka, REST API, MySQL and AWS technologies.
- Expertise in design and development of various web and enterprise applications using Type safe technologies like Scala, Akka, Play framework, Slick.
- Experience in importing and exporting multi terabytes of data using Sqoop from RDBMS to HDFS and vice versa.
- Experience in meeting expectations with Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
- Experienced in implementing Hadoop in AWS EMR clusters and Microsoft Azure Big Data with Databricks.
- 7 years of experience in Administration, Configuration and managing Open source technology like Spark, Kafka, Zookeeper, Docker, Kubernets on RHEL.
- Good experience in writing Spark applications using Scala/Java/Python.
- Experience in creating Resilient Distributed Datasets and Dataframes with appropriate push down predicate filtering.
- Experience in supporting and monitoring Spark Jobs through Spark web UI & Grafana.
- Involved in performance tuning the spark applications through fixing right batch interval time and memory tuning.
- Experienced in fine tuning the Spark jobs using dataframe repartitioning & coalesce techniques.
- Experienced in Scala functional programming using Closures, Currying, and monads.
- Experience in Hadoop echo system MapReduce, Hive, Pig, Oozie, HBase, Flume.
- Implemented pre - defined operators in spark & pyspark such as map, flatMap, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
- Experienced in Git repository and Maven builds.
- Added security to the cluster by integrating Kerberos.
- Experience working with Hive Data warehouse for creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing HiveQL queries and analyzing large datasets.
- Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
- Experience using Impala for data processing on top of HIVE for better utilization.
- Experience in Kafka Stream generation and Spark DStreams consuming the data from Kafka topic.
- Extensive hands on experience in Hadoop file system commands for file handling operations.
- Experience in using Spark over Map Reduce for faster and efficient data processing and to perform analytics on data.
- Experience in creating spark dataframe transformations using withColumn, withColumnRenamed and drop operations to modify dataframe columns.
- Experience in generating spark sql queries using anti joins for the upsert operations.
- Hands on experience in generating spark submit pipeline and generating DAG using Airflow.
- Experience in AWS Glue spark jobs generation and scheduling the job execution.
- Hands on experience in setting up workflow using Oozie workflow engine for managing and scheduling Hadoop job.
- Worked on developing ETL processes to load data from multiple data sources to HDFS using Flume and Sqoop, perform structural modifications using Hive and analyze data using visualization/reporting tools.
- Experience in analyzing large datasets and deriving actionable insights for process improvement.
- Worked on loading and transforming of large sets of structured, semi structured and unstructured data.
- Experience in working with different file formats like Avro, Parquet, ORC, Sequence, and JSON files.
- Background with traditional databases such as Oracle, MySQL, MS SQL Server, PostgreSQL.
- Good understanding of Web Services like SOAP, REST and build tools like SBT, Maven, and Gradle.
- Experience in Jenkins for the deployment automations.
- Detail oriented with strong multitasking, interpersonal skills and ability to produce high quality results.
- Good analytical, interpersonal, communication, problem solving skills with ability to quickly master new concepts and capable of working in group as well as independently.
TECHNICAL SKILLS
Hadoop Distribution: Cloudera (CDH 4 and 5), Hortonworks
Hadoop Ecosystem: HDFS, Sqoop, Flume, Hive, Pig, Impala, Map Reduce, Spark Core, Spark SQL, Oozie
Databases: Oracle, MySQL, MS SQL Server, PostgreSQL
NoSQL Database: HBase, Cassandra, Mango
Data warehouse: Redshift
AWS: S3, EMR, EC2, Athena, Glue
Languages: Scala, Java, Python
Operating System: Windows, UNIX / Linux, Mac
PROFESSIONAL EXPERIENCE
Confidential
Senior Bigdata Spark Developer
Responsibilities:
- Involved in generating the spark dataframes from mongo DB source collections and saved the transformed dataframes into hive db.
- Involved in loading the hive tables incremental loads with merge command and fine tuning the hive tables with bloom filters.
- Involved in generating Kafka Streams producing from the SqlServer & http request, Spark Streaming to consume Kafka streams and write into Cassandra DB.
- Responsible for generating Spark ETL Jobs, generating Dataframes from Cassandra tables and writing the transformed tables back into Cassandra DB.
- Responsible for monitoring the Kafka Stream activity (Kafka Connects & Topics) and spark DStream activity.
- Involved in Generating the Cassandra table with respect to Spark Dataframe pushdown filtering (Primary, Cluster, & Secondary Keys)
- Responsible for developing the Spark Sql queries for the better Spark Job performance.
- Responsible for the monitoring the spark job using spark UI.
- Responsible for scheduling the spark jobs through shell script using Autosys.
- Responsible for the Spark jobs optimization for the better timing and out of memory failures.
- Experienced in implementing storage formats like Parquet, Avro, and ORC.
- Experienced in extracting the json dump to generate Spark Dataframes.
- Experienced in Git repository, maven builds and deployments using Jenkins.
Environment: Hortonworks, Ambari, AWS (EMR, EC2, & S3), Scala, Python, Spark, Kafka, Cassandra, Mongo, Airflow, Jenkins, MS-Sqlserver, Autosys and Redshift.
Confidential
Hadoop Spark Scala Developer
Responsibilities:
- Involved in generating the Scala spark frame work for generating the Dataframes from HDFS and write the Dataframes to S3 buckets.
- Responsible for mapping the hive tables and designing the data transformations to move to Redshift.
- Involved in copying the files of various formats (json, cvs etc..) to HDFS using Nify.
- Involved in running and scheduling spark submits through Oozie, Airflow and Korn shell.
- Involved in performance tuning the spark sql and spark submit jobs.
- Developed Hive tables on data using different storage format and compression techniques.
- Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
- Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
- Experience working with Spark SQL and creating RDD’s using pyspark sparkContext & SparkSession.
- Experience working with collecting and moving logs from exec source to HDFS using Flume.
- Experience in implementing efficient storage formats like Avro, Parquet and ORC.
- Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files
- Involved in generating the EMR clusters and design the termination of EMR clusters after successful completion of spark submit jobs.
- Involved in spark structured streaming and sending the from Hive to Redshift datawarehouse for the spark MLib activity.
- Involved in end to end data processing like ingestion, processing, and quality checks and splitting.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports.
- Actively involved in interacting with team members for the issues and problem resolution.
Environment: Horntonworks 2.6, Ambari, AWS (EMR, EC2, & S3), Python, Spark, Hive, Pig, Airflow, Flume, Sqoop, Jenkins, Hbase, S3, EC2, EMR, datastage and Redshift.
Confidential
Hadoop Spark DevOps Engineer
Responsibilities:
- Responsible for importing data from Oracle & Mysql database to HDFS using Sqoop for further transformation.
- Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
- Implemented AWS EMR clusters for generating Hadoop POC environment.
- Actively involved in SQL and Azure SQL DW code development using T-SQL
- Developed Hive tables on data using different storage format and compression techniques.
- Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
- Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD’s.
- Consumed the data from Kafka queue to Spark DStreams. Configured different topologies for Spark cluster and deployed them on regular basis.
- Monitored the Spark jobs in production environment and taken appropriate steps to fine tune the Spark Jobs.
- Extensive hands on experience in Hadoop file system commands for file handling operations.
- Loading data from local file systems to HDFS and vice versa using hadoop fs commands.
- Design & Develop ETL workflow using Oozie which includes automating the extraction of data from different database into HDFS using Sqoop scripts, Transformation and Analysis in Hive/Pig, Parsing the raw data using Spark.
- Extensively worked on the core and Spark SQL modules of Spark for faster testing and processing of data.
- Experience working with SparkSQL and creating RDD’s using Scala SparkContext & SparkSession.
- Experience working with collecting and moving logs from exec source to HDFS using Flume.
- Actively involved in SQL and Azure SQL DW code development using T-SQL
- Troubleshooting Azure Data Factory and SQL issues and performance.
- Component unit testing using Azure Emulator.
- Analyze escalated incidences within the Azure SQL database. Implemented test scripts to support test driven development and continuous integration.
- Worked on ETL tool Informatica, Oracle Database and PL/SQL, Python and Shell Scripts.
- Involved in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes. Strong experience in Data Warehousing and ETL using Datastage.
- Worked on MicroStrategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
- Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
- Leveraged with the design and development of MicroStrategy dashboards and interactive documents using MicroStrategy web and mobile.
- Extracted data from SQL Server 2008 into data marts, views, and/or flat files for Tableau workbook consumption using T-SQL. Partitioned and queried the data in Hive for further analysis by the BI team.
- Managed Tableau extracts on Tableau Server and administered Tableau Server.
- Extensively worked in data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata
- Extensively used the Teradata fast load/Multiload utilities to load data into tables
- Used Teradata SQL Assistant to build the SQL queries
- Did data reconciliation in various source systems and in Teradata.
- Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
- Worked extensively on date manipulations in Teradata.
- Extracted the data from oracle using sql scripts and loaded into teradata using fast/multi load and transformed according to business transformation rules to insert/update the data in data marts.
- Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and certifying environments for production readiness.
- Monitored workloads, job performance, managed, reviewed and troubleshoot Hadoop log files.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS, pre-processing and transformations with Sqoop script, Pig script, Hive queries.
- Created Scala programs to develop the reports for Business users
- Involved in end to end data processing like ingestion, processing, and quality checks and splitting.
- Exported the analyzed data to the relational databas es using Sqoop for visualization and to generate reports.
- Actively involved in interacting with team members for the issues and problem resolution.
Environment: CDH (CDH4 & CDH 5), Hadoop 2.4, Azure, Spark, pyspark, Hive, Pig, Oozie, Flume, Databricks, Sqoop, Cloudera manager, Jenkins, Cassandra, MSSql Server, Oracle RDBMS.