We provide IT Staff Augmentation Services!

Spark Developer Resume

2.00/5 (Submit Your Rating)

New York, NY

SUMMARY:

  • Over 10+ years of IT Professional Experience in Analyzing, Designing, Developing, Testing, Implementing and Maintaining Data Warehouse business systems using ETL tools and data process /storing technologies such as Oracle, Teradata and Bigdata /Hadoop technologies.
  • Experience in Hadoop distributed file system (HDFS), Impala, Hive, Hbase, Spark, Hue, Map Reduce framework and Sqoop, Hue and YARN.
  • Exposure to design and development of database driven systems.
  • Good knowledge of Hadoop architectural components like Hadoop Distributed File System, Name Node, Data Node, Task Tracker, Job Tracker, and Map Reduce programming.
  • Experience in developing and deploying of applications using Hadoop based components like, YARN (MR2), HDFS, Hive, Pig, HBase, Flume, Sqoop, Spark, Oozie, Zookeeper, Impala and HUE.
  • Hands on experience in importing and exporting data into HDFS and Hive using Sqoop.
  • Experience in working with Informatica BDM.
  • Having good data migration project experience.
  • Exposure on usage of NoSQL databases column oriented HBase and Cassandra.
  • Extensive experienced in working with structured, semi - structured, and unstructured data by implementing complex map reduce programs using design patterns.
  • Excellent knowledge of multiple platforms such as Cloudera, Hortonworks etc.
  • Experienced on Apache Hadoop Map Reduce programming, PIG Scripting, Hive warehousing and Distribute Applications.
  • Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
  • Experience using various Hadoop Distributions (Cloudera, Hortonworks) to fully implement and leverage new Hadoop features.
  • Strong experience in architecting batch style large scale distributed computing applications using tools like Spark SQL, Spark Data frames, Hive etc.
  • Experience using various Hadoop Distributions (Cloudera, Hortonworks,) to fully implement and leverage new Hadoop features
  • Experience in working on Microsoft Azure.
  • Knowledge of NoSQL databases including HBase, MySQL.
  • Experience in designing, developing and implementing connectivity products that allow efficient exchange of data between our core database engine and the Hadoop ecosystem.
  • Experienced in job workflow scheduling and monitoring tools like Oozie.
  • Advanced Unix Shell Scripting, SQL, AWK programming, SED programming.
  • Extensive experience in Analysis, Design, Data Extraction, Cleansing, Transformation and Loading into Data Marts.
  • Experience in Dimensional Data Modeling (Star Schema, Snow-Flake Schema) Data Architecture, Business and Data Analysis.
  • Experience in writing C++ routines.
  • Designed Technical Design Specifications and Mapping Documents with Transformation Rules.
  • Extensively worked on Infosphere DataStage Parallel Extender Edition.
  • Used both Pipeline and Partition Parallelism for improving performance.
  • Experience in developing Parallel jobs using various stages like Join, Merge, Lookup, Funnel, Sort, Transformer, Copy, Remove Duplicate, Filter, Pivot and Aggregate stages for grouping and summarizing on key performance indicators used in decision support systems.
  • Having very good experience with Oracle connector Stage.
  • Used IBM Change Data Capture (CDC) to capture the updates.
  • Frequently used Peek, Row Generator and Column Generator Stages to debug.
  • Expertise in Software Development Life Cycle (SDLC) of Projects - System study, Analysis, Physical and Logical design, Coding and implementing business applications.
  • Expertise in performing Data Migration from various legacy systems to target database.
  • Expertise in Data Modeling, OLAP/ OLTP Systems, generation of Surrogate Keys.
  • In depth knowledge of Star Schema, Snowflake Schema, Dimensional Data Modeling, Fact and Dimension tables.
  • Experience in Data Warehouse development, worked with Data Migration and ETL using IBM Infosphere DataStage with Oracle, Teradata.
  • Experience in creating the ETL mapping documents for Extraction, Transformation and loading data into data Warehouse.
  • Extensive experience in development, debugging, troubleshooting, monitoring and performance tuning using Infosphere DataStage Designer, Director, and Administrator.
  • Prepared job sequences and job schedules to automate the ETL processes.
  • Experience in handling multiple relational databases like Oracle and Teradata for Extraction, Staging and Production data warehouse environments.
  • Experience with Tivoli workload Scheduler and ZENA Scheduler.
  • Experience with UNIX Shell Scripting for Data Validations and Scheduling the Infosphere DataStage Jobs.
  • Proficient in development methodologies such as Agile and Waterfall.
  • Used Infosphere DataStage Version Control to promote Infosphere DataStage jobs from Development to Testing and then to Production Environment.
  • Strong analytical, problem solving and leadership skills and has ability to interact with various levels of management to understand requests and validate job requirements.
  • Team player with strong ability to quickly adapt to any dynamic developments in projects and capable of working in groups as well as independently.
  • Trained in Informatica tool.
  • Trained in BI / Reporting tools like Cognos.
  • Good understanding of AWS EMR.

TECHNICAL SKILLS:

Big Data: Apache Hadoop, Hive, HDFS, Spark, MapReduce, Sqoop, Zookeeper, Scala, Databricks

Data Mirror: IBM Change Data Capture (CDC)

Data modeling tools: Erwin.

Cloud Technologies: Microsoft Azure, Google Cloud, Amazon S3.

Operating Systems: Windows XP, 7, UNIX, AIX, Linux

Ascential Software: Infosphere Datastage (versions8.0.1, 8.1, 8.7, 11.5) Parallel Extender, C++ Routines

Languages: SQL, UNIX(AIX) Script, AWK programming, Sed programming, Scala, Python, JAVA

Databases: Oracle 11g/10g/9i, Teradata, Nettezza, DB2, Vertica

NoSQL: HBase, Cassandra

Applications and Tools: Microsoft Office (Excel, Word, PowerPoint)

Other: SQL Developer, SQL Plus, Edit Plus, Putty, Toad, MyEclipse, Squirrel

Scheduler: Tivoli (9.1) Scheduling and Monitoring, ZENA Scheduler, Control-M

Version Control tools: SVN

CI tools: Git, BitBucket, Maven

Migration Tools: CLM

BI tool: Cognos, Tableau

Ticketing Tools: CA Service Desk (CASD), Service Now (SNOW)

Hadoop: Distribution Cloudera, Hortonworks, MapR

PROFESSIONAL EXPERIENCE:

Confidential, New York, NY

Spark Developer

Responsibilities:

  • Optimized the loyalty account pig script. Post optimization the execution time has been come down to 5hrs to 1.5hrs which also reduced number of MapReduce jobs Initialization
  • Modularized and Optimized the long running Hive scripts. Identified the SOR’s (System of Records) and applied parallelism in event engine (Scheduling Tool)
  • All the code migrations are done in E1, E2 and E3 environments using CI/CD XLR process.
  • Designed new Data Model to redesign the existing Spark Framework.
  • Processed up-to 20 billions of records using spark framework in Scala on daily basis
  • Effectively calculated the number of executors and memory required for a spark job
  • Effectively optimized the long running spark jobs due to data Skew
  • Designed the Framework to load the data from Hive into Cassandra server
  • Developed UDF’s in spark based on the requirements.
  • Experience in working on Maven Projects.
  • Used Spark API over MapR Hadoop YARN to perform analytics on data in Hive
  • Implemented spark using Scala and SparkSQL for faster testing and processing of data
  • Implemented Apache pig scripts to load data from and to store data into Hive
  • Involved in converting Hive/Sql queries into Spark transformation using Spark SQL, Scala
  • Analyzed the SQL scripts and designed the solution in Spark Scala
  • Experience in working in Agile environment.
  • Experience in using different file-formats like Orc, Parquet and compression formats like Snappy
  • Migrated Elastic Search 5.X to Elastic Search 7.X and converted Spark Scala scripts to support the same.
  • Established SSL authentication to Elastic Search 7.X from Spark.
  • Created ES 7.X indexes and mapping from Spark using HttpClient request and response libraries.
  • Loaded Data into ES 7.X using Elastic load Library ESload.
  • Designed ETL pipelines using spark to Amazon S3.
  • Transferred ETL data into AWS EC2, S3 and EMR
  • Created Spark Utility to connect to Amazon S3 and Google Cloud Storage
  • Strictly followed AWS Devops model for CICD integration

Environment: Hive, SQL, Pig, Scala, Python, Shell Scripting, Unix Scripting,Spark, MapR, Cassandra, Java, Parquet, Orc, Hadoop, HDFS, AWS S3, Elastic Search 7.X, Google Cloud.

Confidential, Atlanta, GA

Spark Developer

Responsibilities:

  • Written sqoop scripts to load incremental data into HDFS as AVRO file format.Identified the split by columns and number of mappers needed to perform the data ingestion from SQL server into Azure Datalake.
  • Coordinated with ETL developers for preparation of hive and pig scripts.
  • Proposed and designed the architecture and Framework of the HDFS landing Zone layer to build the Datalake in Azure Cloud.
  • More than 50 + tables are ingested into HDFS landing zone using Apache Sqoop.
  • Created HIVE external tables on top of AVRO files.
  • Experience working on HIVE ORC files formats. Created partitions and buckets on the External Hive tables.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Spark SQL, Spark Data frames, Scala and Python.
  • Implemented the SCD type-2 in Spark as a part of POC.
  • Designed and implemented Spark jobs to support distributed data processing.
  • Implemented python application to take the back of the Hive Schema.
  • Framework has been defined to create the Hive tables with properties for faster query processing
  • Python modules has been developed to connect to Hive.
  • Very good understanding of Hive partitions and bucketing and designed both Managed and external tables in Hive to optimize performance
  • Good understanding of Hive ORC file format structure and data storage procedure.
  • Good knowledge and experience in implementing HIVE advance features like Stride, Stripe, Bloom Filters
  • Worked with SCRUM team in delivering agreed user stories on time for every sprint.
  • Trained and Followed agile methodology for the entire project.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark
  • Worked on reading and writing multiple data formats like ORC, Parquet on HDFS using PySpark.
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance

Environment: Hadoop-HDFS, Pig, Sqoop, HBase, Hive, MapReduce, Cassandra, Oozie and MySQL, JAVA, Pyspark, Python

Confidential, Atlanta, GA

Big data Developer

Responsibilities:

  • Analyzed large data sets by running Hive queries and Pig scripts
  • Involved in creating Hive tables, loading with data and writing hive queries that will run internally in MapReduce way.
  • Experienced in defining job flows using Oozie
  • Responsible to manage data coming from different sources and application
  • Good Knowledge of analyzing data in HBase using Hive and Pig.
  • Involved in Unit level and Integration level testing.
  • Prepared design documents and functional documents.
  • Involved in running Hadoop jobs for processing millions of records of text data
  • Involved in loading data from local file system (LINUX) to HDFS
  • Responsible to manage data coming from different sources.
  • Assisted in exporting analyzed data to relational databases using Sqoop
  • Submit a detailed report about the daily activities on a weekly basis.

Environment: Hadoop-HDFS, Pig, Sqoop, HBase, Hive, Flume MapReduce, Cassandra, Oozie and MySQL, Java

Confidential, Atlanta, GA

Hadoop Developer:

Responsibilities:

  • Loading data into parquet Tables by applying transformation using Impala.Experienced in working with Spark eco system using PYSPARK and HIVE/IMPALA Queries on different data formats like Text file and parquet.
  • Utilized SQL scripts for supporting existing applications.
  • Executing parameterized Pig, Hive, impala, and UNIX batches in Production.
  • Created HIVE/PIG scripts for ETL purpose.
  • Experienced in handling large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformation and other during ingestion process itself.
  • Load and transform data into HDFS from large set of structured data using IBM DataStage.
  • Migration of Oracle code to Hive/Impala.
  • Optimization and performance tuning of Hive QL, formatting table column using Hive functions.
  • Have been involved in designing & creating hive tables to upload data in Hadoop and process like merging, sorting and creating, joining tables
  • Responsible for all the data flow and quality of data.
  • Automated the process using UNIX and prepare Standardize jobs for production.
  • Involved in design and implementation of standards for development, testing and deployment.
  • Creating Hive/Impala/PIG Scripts, identifying parameters as per requirement to apply transformation and perform Unit testing on data as per design.
  • Design, develop, unit test, scripts for data Items using Hive/Impala
  • Assist in designing and development of ETL procedures as per business requirements for Financial Domain.
  • Used REST API commands to load the csv files into Hadoop HDFS.
  • Implemented Slow changing dimension logic in IMPALA.
  • Worked on trouble shooting the Kerberos login issues in the production.
  • Designed an ETL procedure to REFRESH the IMPALA tables after the data load.
  • Experience in handling the HIVE external tables and managed tables in HADOOP.
  • Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.

Environment: Hadoop, HDFS, Hive, Python, Scala, Spark 2.0, SQL, Teradata, UNIX Shell Scripting, HUE, IBM DataStage, Sqoop, HDInsight, Microsoft Azure, Cloudera, Hortonworks

Confidential, Atlanta, GA

Hadoop/Spark Developer

Responsibilities:

  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
  • Developed and executed shell scripts to automate the jobs
  • Wrote complex Hive queries and UDFs.
  • Worked on reading multiple data formats on HDFS using PySpark.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Developed multiple POCs using PySpark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Involved in loading data from UNIX file system to HDFS.
  • Migrated the data from ODS to Hadoop Datalake Using Automation Scripts
  • Extracted the data from Teradata into HDFS using Sqoop
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Spark and loaded data into HDFS.
  • Used Talend tool to perform transformation s and for data ingestion in to HDFS.
  • Manage and review Hadoop log files.
  • Involved in analysis, design, testing phases and responsible for documenting technical specifications
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Worked on the core and Spark SQL modules of Spark extensively.
  • Experienced in running Hadoop streaming jobs to process terabytes data.

Environment: Hadoop, HDFS, Hive, Python, Scala, Spark, SQL, Teradata, UNIX Shell Scripting, Talend

Confidential

Senior ETL/Infosphere DataStage Developer

Responsibilities:

  • Created Infosphere DataStage job templates for Change Data Capture and Surrogate Key Assignment processes.
  • Provide the development team with Infosphere DataStage job templates that would be used as a standard for ETL development across the project.
  • Wrote Shell script to automate Data Stage jobs.
  • Conducted a detailed analysis of relevant heterogeneous data sources like flat files, Oracle to use for the ETL process.
  • Developing Jobs, Based on the given Functional Design Documents. Coding and Testing Supported for User Acceptance testing.
  • Involved in production implementation process.
  • Used the Data stage Designer to develop process for extracting.
  • To implement and maintain the ETL (Extraction, Transformation, used to develop the jobs/Cleansing) Process using DataStage.
  • Provide the development team with Infosphere DataStage job templates that would be used as a standard for ETL development across the project.
  • Performed Import and Export of Data Stage components and table definitions using Data Stage Manager.
  • Batches for different regions, scheduled to perform data loading.
  • Involved in Unit testing and preparation of test cases for the developed jobs.
  • Wrote Shell script to automate Data Stage jobs.

Environment: Web sphere (Infosphere DataStage/PX 8.0.1) on UNIX and Oracle for Data Warehouse, Shell Scripting, Teradata, Sql Developer.

Confidential

ETL/Infosphere DataStage Developer

Responsibilities:

  • Responsible for development, maintenance, and support of ETL templates and jobs to load source system data into Oracle warehouse and mentoring of team members.
  • Utilize Infosphere DataStage to design, build, and support jobs used to populate Oracle warehouse.
  • Manage migrations from development to test to production.
  • Create Tivoli jobs and scheduling to run cycles in test and production.
  • Manage all operations of the data warehouse.
  • Worked on Netezza appliance to extract and load data in DW.
  • Create Infosphere DataStage job templates for developers to follow when creating new jobs.
  • Mentor team of 8 developers on usage of Infosphere DataStage to build, run, and test jobs and sequences.
  • Analyze and determine jobs and sequences that need to be modified or created and assign such to development team members.
  • Create/Setup new Infosphere DataStage projects including Unix directories and environment variables.
  • Track team members assignments, completion dates, and progress through use of Excel spreadsheet

Environment: DataStage/Server 8.1 on AIX and Oracle Databases for source and Data Warehouse, Netezza

We'd love your feedback!