Sr Big Data/ Spark Developer Resume
Atlanta, GA
SUMMARY
- Overall 8+ years of experience in Big Data and Software Development.
- Excellent knowledge on Hadoop Ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce programming paradigm.
- Expertise with the tools in Hadoop Ecosystem which includesPig, Hive, HDFS, MapReduce, Sqoop, Spark, Kafka, YARN, HBase, Flume and Oozie.
- Implemented MapReduce programs in Java for data processing and data transformations. Multiple mappers are implemented to handle data from multiple sources.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Experienced in building data pipelines, data ingestion flows using Spark and Hadoop MapReduce.
- Implemented pyspark scripts for data cleansing, transformations, pre - processing, and post-processing using Spark.
- Experienced in writing complex MapReduce programs that work with different file formats like Text, Sequence, XML, JSON, Parquet, and Avro.
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle (RDBMS).
- Strong experience in writing applications using python using different libraries likePandas,NumPy, SciPy, Matplotlib, etc.
- Good Knowledge in Machine Learning algorithms usingPythonand its concepts as data - preprocessing, Regression, Classification, etc and appropriate model selection techniques.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Great expertise in SQL databases such as Oracle DB, MySQL, PostgreSQL, and Teradata.
- Good understanding of NoSQL databases and hands-on work experience in writing applications on NoSQL databases like HBase, Cassandra, andMongoDB.
- Experience in writing Spark SQL scripts and improved the performance by implementing techniques like Partitioning and bucketing on Hive tables.
- Experience in migrating the data from RDBMS to HDFS and vice-versa using Sqoop.
- Extensive experience in importing and exporting data using stream processing platforms like Flume and Kafka.
- Strong experience on Hadoop distributions like eraandHorton Works.
- Hands-on writing custom UDF’s for extending Spark, Hive and Pig functionality.
- Worked in AWS environment for development and deployment of Spark/MapReduce jobs.
- Good knowledge working on components such as Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure HDInsight.
- Experience in analyzing, designing, developing ETL strategies, processes, and writing ETL specifications.
- Experience in dealing with various data sources like Oracle PL/SQL, T-SQL, SQL Server 2012, Teradata v15/13.
- Experience on Physical & Logical Data Modeling, Dimensional Modeling using Star and Snowflake Schemas, Data marts, OLAP, FACT & Dimensions tables.
- Experience in creating Mappings and Mapplets transformations like Source Qualifier, Aggregator, Expression, Lookup, Filter, Joiner, Union, Router, Rank, Sequence Generator, SQL, HTTP transformations, and Transaction control using Informatica Power Center 8.6.
- Experienced in using GIT to maintain code versions and have knowledge in CI/CD tools such as Bamboo, Jenkins for code build and deployment in production environments.
- Good exposure with Agile software development process and familiar with Sprint concept through different stages of project life cycle which includes development, testing, debugging, documentation, end-user training, and production support.
- Aptitude for analyzing, identifying problems and coming with out of the box solutions.
- Ability to achieve organizational integration, assimilate job requirements, employ new ideas, concepts, methods, and technologies.
- Excellent communication, interpersonal and analytical skills.
- Self-motivated, quick learner and adaptive to new and challenging technological environments.
TECHNICAL SKILLS
Bigdata Tools: Hadoop (Cloudera, Hortonworks), HDFS, MapReduce, YARN, Apache Spark, Apache Sqoop, Hive, Pig, Flume, Kafka, Impala, Oozie.
Programming: Python, R, C++, C, Java, Scala
Cloud Computing Tools: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Query Languages: SQL, T-SQL, PL/SQL.
Data Modeling: Star Schema, Snowflake Schema, Erwin 4.0, Dimension Data Modeling.
SQL Databases: Oracle11g, SQL Server 2012, Teradata 15
NoSQL Databases: MongoDB, HBase, Cassandra
Scheduling Tools: Control-M, Autosys, Informatica Scheduler, Oozie
Operating Systems: Windows 7/XP/NT/95/98/2000, UNIX and LINUX
Other Tools: SQL*Plus, Toad, SQL Navigator, Putty, WINSCP, MS-Office, SQL Developer.
PROFESSIONAL EXPERIENCE
Confidential, Atlanta, GA
Sr Big Data/ Spark Developer
Environment: Spark (PySpark), Spark SQL, AWS, Python, AWS, Cloudera, Azure,Hive, Sqoop, Impala, HDFS, JSON, Linux, Oozie, RDBMS and various file formats.
Responsibilities:
- Implemented Sqoop jobs to load data from RDBMS like MySQL and Teradata into HDFS.
- Developed python scripts in Spark to load the data from different file formats such as Text, CSV, JSON, and Avro files into HDFS.
- Developed snow pipes for continuous injection of data using event handler from AWS (S3 bucket).
- Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
- Created Hive tables on the HDFS data in the data lake and performed data validation on different tables using HiveQL.
- Optimized Hive queries and involved in performance tuning using advance techniques like Partitioning (Static/Dynamic), Bucketing and Optimized joins.
- Involved in Designing and Developing Enhancements of CSG using AWS APIS.
- Implemented Spark UDFs in python for data transformations and aggregations.
- Performed wide transformations, narrow transformations and actions like filter, Lookup, Join, Aggregator, count, etc. as per requirement on Spark Data Frames.
- Hands-on experience on monitoring the Spark applications through the Spark UI and identifying executor failures, data skewness and other runtime issues.
- Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.
- Improved the job performance by using Spark SQL API for intermediate queries, and Paired RDDs.
- Experience in handling JSON datasets and writing custom Python functions to parse through JSON data using Spark.
- Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on AmazonElastic Compute Cloud (EC2) andAmazonSimple Storage Service (S3).
- Demonstrated expert level technical capabilities in areas of Azure Batch and Interactive solutions, Azure Machine learning solutions and operationalizing end to end Azure Cloud Analytics solutions.
- Used Control-M to automate the jobs for scheduling, monitoring, and Reporting.
- Resolved Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
- Designed end to end Azure cloud-based analytics dashboard for a state government for showing real time updates for the recently their state assembly elections 2016. Solution utilized PowerBI, Enterprise Gateway and Azure SQL Server.
- Involved in configuring Hadoop ecosystem components like Hive and Sqoop.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
- Worked extensively on JIRA for task assigning, work logs, and reporting issues on JIRA stories, tasks, and sub-tasks.
Confidential, CO
Sr Big Data Engineer
Environment: Cloudera Manager, HDFS, AWS, Spark, Hive, Azure, Pig, Impala, HiveQL, Kafka, SparkSQL, Oozie.
Responsibilities:
- Collaborate with technology leads and architects in conceptualizing objectives and develop new Hadoop applications to process large datasets in agile environment.
- Responsible for data extraction and data ingestion from different data sources into HDFS by creating ETL pipelines
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Involved in designing data warehouse using Hive external tables for Job Control & Data Quality to track job history and successful completion of production jobs.
- Create Hive external tables for consumption and store data in HDFS partitions as Parquet and Text file formats.
- Implement Kafka scripts and create jobs to import data from RDMS (Oracle, SQL) into HBase tables using incremental and full Refresh imports.
- Built mid to large clusters on AWS cloud using multiple instances of the Amazon EC2 cloud. This was to enable use cases which used distributions of Cloudera Hadoop or Hortonworks Hadoop.
- Conducted numerous training sessions, demonstration sessions on Big Data for various Government and Private sector customers ramping them up on Azure Big Data solutions.
- Develop python framework to convert Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Datasets and perform aggregations on data stored in memory.
- Develop Python code to gather the data from HBase and implement machine learning solutions using PySpark.
- Design and implement complex HiveQL scripts using hive CDC, window functions, joins and partitions to process complex data sets as per business requirements.
- Utilize SparkSQL to extract and process data by parsing using Datasets or RDDs with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
- Develop Spark SQL applications in PySpark to build analytics on incoming claims data.
- Write Exporter module in Go Lang to push YARN applications and Scheduler data onto Prometheus time series database to generate reports.
- Designed and Implement test environment on AWS.
- Develop exporter framework in Python to get scheduler, job history data from YARN REST APIs and store into HBase tables, generate custom alerts to monitor SLA bound Hadoop applications.
- Manage and schedule jobs by defining Hive, Spark, Sqoop and Python actions on a Hadoop cluster using Oozie workflows and Oozie Coordinator engine
- Involved in Migration of data from legacy systems to cloud instances such as Azure DataBricks.
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Monitor and troubleshoot performance of applications and take corrective actions in-case of failures and evaluate possible enhancements to meet SLAs.
Confidential, GA
Data Engineer
Environment: Hadoop, MapReduce, Horton Works Distribution, ORC, Ambari, Pig, HDFS, Sqoop, SQL, HBase, Eclipse, Autosys, AWS, Azure
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop and migrate legacy Retail applications ETL to Hadoop.
- Created MapReduce jobs for Log Analysis, Recommendation, and Analytics.
- Enabled cloud watch and Ganglia on large AWS based cloud clusters to enable effective monitoring of clusters.
- Handled HDFS data and created external tables using Hive, in order to analyze visitors per day, page views and most purchased products.
- Installed Horton Works Hadoop cluster on Confidential Azure cloud in the UK region to satisfy customer’s data locality needs
- Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Sevices) and also on Microsoft Azure.
- Composed Pig Scripts to create MapReduce jobs and performed ETL procedures on the data in HDFS.
- Utilized MapReduce and Sqoop to load, combine, store and examine web log data from distinctive web servers.
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance.
- Guaranteed the timely completion of the Unit and Integration Testing, testing effort on the project by collaborating with Business SMEs/IT, interface teams and stakeholders.
- Developed a data pipeline using Kafka to store data into HDFS and performed real-time analytics on the incoming data.
- Generated Pig Latin scripts to sort, group, join and filter the enterprise data.
- Widely used Pig for data cleansing.
- Generated Data model for Hive tables.
- Responsible to manage data originating from diverse sources.
- Used AutoSys to automate the jobs for scheduling, monitoring and Reporting.
- Executed tasks like Installation, Configuration, and Upgrade of LINUX operating systems and SQL database.
- Prepared Project maintenance, Test synopsis, Test result, Test case and Go - Live plan documents for project release.
Confidential
Informatica ETL Developer
Environment: Informatica Power Center 8.6.1/8.1.1 , Oracle 10g, TOAD 10.1 for Oracle, DB2, Flat Files, PL/SQL, OBIEE 11g, ERWIN, Windows 2000, UNIX PERL scripting, Control-M
Responsibilities:
- Designed, developed and documented the ETL (Extract, Transformation, and Load) strategy to populate the Data Mart from the various source systems.
- Worked only Informatica 8.6.1 client tools like Source Analyzer, Warehouse Designer, Mapping Designer, Workflow Manager, and Workflow Monitor.
- Involved in design and development of complex ETL mappings.
- Implemented partitioning and bulk loads for loading large volumes of data.
- Based on the requirements, used various transformations like Source Qualifier, Normalize, Expression, Filter, Router, Update strategy, Sorter, Lookup, Aggregator, Joiner and Sequence Generator in the mapping.
- Developed Mapplets, Worklets and Reusable Transformations for reusability.
- Identified performance bottlenecks and Involved in performance tuning of sources, targets, Mappings, transformations and sessions to optimize session performance.
- Identified bugs in existing mappings/workflows by analyzing the data flow and evaluating transformations.
- Performance tuning by session partitions, dynamic cache memory, and index cache.
- Developed Informatica SCD type-I, Type-II mappings. Extensively used almost all of the transformations of Informatica including complex lookups, Stored Procedures, Update Strategy, Mapplets and others.
- Implemented update strategies, incremental loads, Data capture and Incremental Aggregation.
- Created procedures, functions, packages, and triggers using PL/SQL.
- Created UNIX Shell scripts to automate the process.
- Developed Documentation for all the routines (Mappings, Sessions, and Workflows).
- Involved in scheduling the workflows through Autosys Job scheduler using UNIX scripts.
- Played a key role in all the testing phases and responsible for production support as well.
