Hadoop Architect/lead Spark Developer Resume
Sunnyvale, CA
SUMMARY:
- Hadoop Architect/Lead Spark Developer with 11+ years of experience in IT with application Analysis, Design, Development, Deployment and Maintenance of software Java/Python/Scala on Big - Data/Hadoop ecosystem such as Yarn, HDFS, Hive, Impala, Pig, Flume, Oozie, Sqoop, Zookeeper, Spark (Batch/Streaming), Kafka and Map-Reduce etc with various domains such as Finance, Insurance, Retail, Banking, HealthCare, and eCommerce.
- Experience in Hadoop/Big Data technologies such as in Hadoop, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Storm, Flink, Flume, Zookeeper, Impala, Kafka and Spark with hands on experience in writing Map Reduce/YARN and Spark/Scala jobs.
- Have good IT experience with special emphasis on Analysis, Design and Development and Testing of ETL methodologies in all the phases of the Data Warehousing.
- Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like star schema and Snowflake schema used in relational, dimensional modeling.
- Experience in optimizing and performance tuning of Mappings and implementing the complex business rules by creating re-usable Transformations, Mapplets and Tasks.
- Solid Experience in Cloud with Amazon Web services (AWS - EC2, S3, CloudWatch, RDS,EMR,SNS ) and Google cloud (GCP).
- Responsible for developing data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Queried Vertica, SQL Server for data validation along with developing validation worksheets in Excel in order to validate the dashboards on Tableau.
- Have knowledge on GCP cloud Pub/Sub and Microservices Event sourcing, cloud functions and algorithm.
- Experience in implementing and migrating and deploying workloads on Azure VM
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Extensively used SQL and PL/SQL for development of Procedures, Functions, Packages and Triggers.
- Experienced on Tableau Desktop, Tableau Server and good understanding of tableau architecture
- Experienced in integrating Kafka with Spark Streaming for high speed data processing.
- Experience in Implementing AWS solutions using EC2, S3 and Azure storage.
- Experienced in developing business reports by writing complex SQL queries using views, macros, volatile and global temporary tables.
- Working with AWS team in testing our Apache Spark- ETL application on EMR/EC2 using S3.
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Experienced with work flow schedulers, data architecture including data ingestion pipeline design and data modelling.
- Configuration of ElasticSearch on Amazon Web Service with static IP authentication security features
- Experience in AWS Cloud platform and its features which includes EC2, AMI, EBS Cloudwatch, AWS Config, Auto-scaling, IAM user management, and AWS S3.
- Managed AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing and Glacier for our QA and UAT environments as well as infrastructure servers for GI.
- Working as Big Data Architect for the last 4 years and having strong background of big data stack like Spark, Scala, Hadoop,Storm,Batch, HDFS, MapReduce, Kafka, Hive, Cassandra, Python, SQOOP, and PIG.
- Hands-on experience with Apache Spark and its components ( Spark core and Spark SQL)
- Experienced in converting HiveQL queries into Spark transformations using Spark RDDs and Scala
- Hands on experience in in-memory data processing with Apache Spark
- Developed Spark scripts by using Scala shell commands as per the requirement
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle
- Broad understanding and experience of real-time analytics and batch processing using apache spark.
- Hands on experience in AWS (Amazon Web Services), Cassandra, python and cloud computing.
- Experience with agile development methodologies like Scrum and Test-Driven Development, Continuous Integration
- Ability to translate business requirements into system design
- Experience in importing and exporting data from HDFS to RDBMS/ non-RDBMS and vice-versa using SQOOP
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MapReduce jobs.
PROFESSIONAL EXPERIENCE:
Confidential, Sunnyvale, CA
Hadoop Architect/Lead Spark Developer
Responsibilities:
- Worked on different tools for Presto to process these large datasets.
- Worked on Core tables of Revenue DataFeed(RDF) that calculates the revenue of the advertisers of the Facebook.
- Involved into testing and migration to Presto.
- Worked extensively with Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Experienced with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, Sqoop, Spark, Yarn and Oozie
- Have used Python with the Spark Python API (PySpark) to create and analyze Spark DataFrames.
- Involved in importing the real time data to Hadoop using kafka and implemented the Oozie job for daily.
- Experienced in writing complex SQL Queries, Stored Procedures, Triggers, Views, Cursors, Joins, Constraints, DDL, DML and User Defined Functions to implement the business logic.
- Developed Custom ETL Solution, Batch processing and Real - Time data ingestion pipeline to move data in and out of Hadoop using Python and shell Script.
- Experience in Large Data processing and transformation using Hadoop-Hive and Sqoop.
- Real time predictive analytics capabilities using Spark Streaming, Spark SQL and Oracle Data Mining tools.
- Experience with Tableau for Data Acquisition and visualizations.
- Working with AWS team in testing our Apache Spark- ETL application on EMR/EC2 using S3.
- Assisted in data analysis, star schema data modeling and design specific to data warehousing and business intelligence environment.
- Have been using pyspark with Jupyter in Docker Containers.
- Expertise in platform related Hadoop Production support tasks by analyzing the job logs.
- Have been using Spark with Python, working with RDD provided by the library Py4j in pyspark module.
- Monitored System health and logs and responded accordingly to any warning or failure conditions.
- Hands on experience in Spark, Cassandra, Scala, Python and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Used HIVE to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Handled importing of data from various data sources, performed transformations using Hive, Spark and loaded data into HDFS.
- Developed spark code and Spark-SQL, for faster testing and processing of data
- Snapped the cleansed data to the Analytics Cluster for reporting purpose to Business.
- Hands on experience on AWS platform with S3 & EMR.
- Experience on working with different data types like FLATFILES, ORC, AVRO and JSON.
- Automation of Business reports using Bash scripts in UNIX on Data lake by sending them to business owners.
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems and suggested some solution.
- Hands on expertise in running the SPARK & SPARK SQL.
- Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data.
- Worked on MapR Hadoop platform to implement BigData solutions using Hive, Map Reduce, shell scripting, and java technologies.
- Struts (MVC) is used for implementation of business model logic..
- Evaluate deep learning algorithms for text summarization using Python, Keras, TensorFlow and Theano n cloudera Hadoop System
- Experienced in querying data using Spark SQL on top of Spark engine.
- Experience in managing and monitoring Hadoop cluster using Cloudera Manager.
Environment: Amazon Web Service, Vertica, Informatic PowerCenter, Pyspark, Spark, AWS, Kafka, AWS-S3, Apache-Hadoop, Hive, Pig, Shell Script, ETL, tableau, Agile Methodology.
Confidential, Richmond, VA
Hadoop Architect/Lead Spark Developer
Responsibilities:
- Collected data using Spark Streaming from AWS S3 bucket in near - real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Developed Spark Code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Real-time experience in Hadoop Distributed files system, Hadoop framework and Parallel processing implementation (AWS EMR,Cloudera) with hands on experience in HDFS.
- Using Glue Data Catalog for storing the schema/metadata of Hive External tables.
- Used Aws EMR long running clusters for processing the spark jobs, use AWS S3 for storing data in buckets.
- Created Hive Internal/External tables, metastore for storing the metadata.
- Used sqoop jobs for ingestion from various sources such as Oracle, Salesforce, SAS etc.
- Worked on various complex SQL queries on the source side which is Oracle database.
- Worked on using cloudformation template, CI-CD tools like concourse to automate the data pipeline.
- Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
- Designed Data Quality Framework to perform schema validation and data profiling on Spark (Pyspark).
- Experience in Implementing AWS solutions using EC2, S3 and Azure storage.
- Responsible for monitor the tableau dashboard for reporting purpose and providing the refined data to end users.
- Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.
- Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
- Loaded and transformed large sets of structured, semi structured and unstructured data using
- Hadoop/Big Data concepts.
- Experience in AWS, implementing solutions using services like (EC2, S3, RDS, Redshift, VPC).
- Extensively development experience in different IDE like Eclipse, Net Beans and IntelliJ.
- Worked as a Hadoop consultant on (Map Reduce/Pig/HIVE/SQOOP).
- Worked using Apache Hadoop ecosystem components like HDFS, Hive, SQOOP, Pig, and Map Reduce.
- Good exposure to Github and Jenkins.
- Exposed to Agile environment and familiar with tools like JIRA, Confluence.
- Provided recommendations to machine learning group about customer roadmap.
- Sound knowledge in Agile methodology- SCRUM, Rational Tools.
- Lead architecture and design of data processing, warehousing and analytics initiatives.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in
- Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Used Apache Nifi for ingestion of data from the IBM MQ's (Messages Queue).
- Identify query duplication, complexity and dependency to minimize migration efforts Technology stack:
- Oracle,Cloudera, Hortonworks HDP cluster, Attunity Visibility, Cloudera Navigator Optimizer, AWS Cloud and Dynamo DB.
- As a POC, used Spark for data transformation of larger data sets.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Enable and configure Hadoop services such as HDFS, YARN, Hive, Ranger, Hbase, Kafka, Sqoop Zeppeline Notebook and Spark/Spark2.
- Worked on Spark, Scala, Python, Storm Impala.
- Extensive experience in Spark Streaming (version 1.5.2) through core Spark API running Scala, Java to transform raw data from several data sources into forming baseline data.
- Creating dashboard on Tableau and Elastic search with Kibana.
Environment: Apache Spark, Scala, Spark-Core, Spark-Streaming, Python, Spark-SQL, Hadoop, MapReduce, HDFS, Hive, Pig, MongoDB, Sqoop, Oozie, MySQL, Java (jdk1.7), AWS
Confidential, Atlanta, GA
Lead Hadoop / Spark Developer
Responsibilities:
- Worked with variables and parameter files and designed ETL framework to create parameter files to make it dynamic.
- Currently working on the Teradata to HP Vertica Data Migration Project Working extensively on the Copy Command for extracting the data from the files to Vertica. Monitor the ETL process job and validate the data loaded in Vertica DW.
- Built a Full - Service Catalog System which has a full workflow using ElasticSearch, Logstash, Kibana, Kinesis, CloudWatch.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Experienced in transferring data from different data sources into HDFS systems using kafka producers, consumers and kafka brokers
- The logs and semi structured content that are stored on HDFS were preprocessed using PIG and the processed data is imported into Hivewarehouse which enabled business analysts to write Hive queries.
- Worked with data migration form Hadoop clusters to cloud. Good knowledge of cloud components like AWS S3, EMR, Elastic Cache and EC2.
- Responsible to write Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some pre-aggregations before storing into the HDFS. Developed the Vertica UDF's to preprocess the data for analysis.
- Designed the reporting application that uses the Spark SQL to fetch and generate reports on HBase.
- Build custom batch aggression framework for creating reporting aggregates in Hadoop.
- Experience in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the Hive queries. Built real time pipeline for streaming data using Kafka and SparkStreaming.
- Experienced with NoSQL databases like HBase, MongoDB and Cassandra and wrote Storm topology to accept the events from Kafka producer and emit into Cassandra DB.
- Experienced in working with spark ecosystem using spark SQL and Scala queries on different formats like Text file, CSV file.
- Great hands on experience with Pyspark for using Spark libraries by using python scripting for data analysis.
- Wrote Python Script to access databases and execute scripts and commands.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDD's in Scala and Python.
- Created ODBC connection through Sqoop between Hortonworks and SQL Server
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server. Creating New Schedule's and checking the task's daily on the server.
Environment: Hadoop, Hive, Apache Spark, Apache Kafka, Hortonworks, AWS, ElasticSearch, Lambda, Apache Cassandra, Hbase, MongoDB SQL, Sqoop, Flume, Oozie, Java (jdk 1.6), Eclipse, Informatic Power Center 9.1, Tableau, Teradata 13.x, Teradata SQL Assistant.
Confidential
Sr Hadoop/Python Developer
Responsibilities:
- Lead the AML Cards North America development and DQ team successfully to implement the compliance project.
- Involved in the project from POC and worked from data staging till saturation of DataMart and reporting. Worked in an onsite - offshore environment.
- Completely responsible for creating data model for storing & processing data and for generating & reporting alerts. This model is being implemented as standard across all regions as a global solution.
- Involved in discussions and guiding other region teams on SCB Big data platform and AML cards data model and strategy.
- Responsible for technical design and review of data dictionary (Business requirement).
- Responsible for providing technical solutions and work arounds.
- Migrate of the needed data from Data warehouse and Product processors into HDFS using SQOOP and importing various formats of flat files into HDFS.
- Involved in discussion with source systems for issues related to DQ in data.
- Implemented partitioning, dynamic partitions, buckets and Custom UDF's in HIVE.
- Used Hive to process data and Batch data filtering.
- Supported and Monitored Map Reduce Programs running on the cluster.
- Monitored logs and responded accordingly to any warning or failure conditions.
- Responsible for preserving code and design integrity using SVN and SharePoint.
Environment: Apache Hadoop, HDFS, Hive, Map Reduce, Hive, Pig, HBase, Zookeeper, Oozie, MongoDB, Python,Java, Sqoop
Confidential
SDET/Python Engineer
Responsibilities:
- Involved in testing various business rules User/Customer Functionality, Change Process, Configuration Data Requirements, Legacy Data Requirements, and Access Permissions Requirements.
- Experienced in developing applications according to client requirements.
- Extensively used TestNg for assertions and grouping test cases.
- In depth understanding of Object-Oriented Programming and skilled in developing automated test scripts using Selenium in Python, Java.
- Developed automation test cases, executed these test scripts from test lab and logged defects in JIRA.
- Developed and executed SQL queries to verify the proper insertion, deletion and updates into the Oracle supporting tables.
- Reviewed database test cases according to assigned Requirements to validate reports by retrieving data with complex SQL queries from oracle database.
- Design, develop and implement MVC Pattern based Keyword Driven automation testing framework utilizing Python, Java, JUnit and Selenium WebDriver and
- Performed Data driven testing with framework for all Test cases.
- Designed data-driven testing framework in Selenium and captured data dynamically from web controls.
- Used automated scripts and performed functionality testing during the various phases of the application development using Selenium.
- Demonstrated ability to solve complex automation challenges involving Ajax, dynamic objects, custom object types, unexpected event handling.
- Reported defects to developer and discussed about the issues in weekly status meetings.
- Prepared user documentation with screenshots for UAT (User Acceptance testing).
- Experienced in using SVN repository for source code management.
- Used Ant and Maven as build management tools.
- Extensively used continuous integration tool Jenkins for performing test scripts execution.
Environment: Python, Java, JavaScript, HTML, CSS, Xpath, Selenium Webdriver, Eclipse, JUnit, Jira, Svn, Jenkins, Maven, Windows, Oracle 10g, Agile Methodology.
Confidential
Software Engineer
Responsibilities:
- Involved in testing various business rules User/Customer Functionality, Change Process, Configuration Data Requirements, Legacy Data Requirements, and Access Permissions Requirements.
- Experienced in developing applications according to client requirements.
- Extensively used TestNg for assertions and grouping test cases.
- In depth understanding of Object-Oriented Programming and skilled in developing automated test scripts using Selenium in Python, Java.
- Developed automation test cases, executed these test scripts from test lab and logged defects in JIRA.
- Developed and executed SQL queries to verify the proper insertion, deletion and updates into the Oracle supporting tables.
- Reviewed database test cases according to assigned Requirements to validate reports by retrieving data with complex SQL queries from oracle database.
- Design, develop and implement MVC Pattern based Keyword Driven automation testing framework utilizing Python, Java, JUnit and Selenium WebDriver and
- Performed Data driven testing with framework for all Test cases.
- Designed data-driven testing framework in Selenium and captured data dynamically from web controls.
- Used automated scripts and performed functionality testing during the various phases of the application development using Selenium.
- Demonstrated ability to solve complex automation challenges involving Ajax, dynamic objects, custom object types, unexpected event handling.
- Reported defects to developer and discussed about the issues in weekly status meetings.
- Prepared user documentation with screenshots for UAT (User Acceptance testing).
- Experienced in using SVN repository for source code management.
- Used Ant and Maven as build management tools.
- Extensively used continuous integration tool Jenkins for performing test scripts execution.
Environment: Python, Java, JavaScript, HTML, CSS, Xpath, Selenium Webdriver, Eclipse, JUnit, Jira, Svn, Jenkins, Maven, Windows, Oracle 10g, Agile Methodology.