Sr. Spark Developer Resume
Rochester, MN
SUMMARY
- Having 8+ years of IT development experience, including experience in Big Data, Apache Spark, Python, Hadoop, Scala, Java, SQL and Cloud technologies.
- Experience in requirement analysis, system design, development and testing of various applications.
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test - Driven Development (TDD)
- Experienced in frameworks like Flask, Django and Python packages like PySide, PyQtGraph, NumPy, MatPlot Lib.
- Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
- Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
- Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa.
- Excellent knowledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
- Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
- Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
- Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
- Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
- Played a key role inmigrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies
- Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Extensive experience usingMAVENas a Build Tool for the building of deployable artifacts from source code
- Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
- Experienced in developing Web Services with Python programming language and Good working experience in processing large datasets with Spark using Scala and Pyspark.
- Experience working with Amazon's AWS services likeEC2, EMR, Glue, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
- Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark in Scala.
- Experience with Snowflake Multi - Cluster Warehouses.
- Hands on experience in data processing automation using python.
- Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
- Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports, SQL*Plus, Toad, Crystal Reports.
- Good understanding of Hadoop Gen1/Gen2 architecture and hands-on experience withHadoopcomponents such as Job Tracker, Task Tracker, Name Node, Secondary Name Node,Data Node, Map Reduce concepts andYARNarchitecture which includes Node manager, Resource manager and App Master.
- Expertise in relational database systems (RDBMS) such as My SQL, Oracle, MS SQL, and No SQL database systems like Hbase, MongoDB and Cassandra.
- Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
- Flexible working Operating Systems like Unix/Linux (Centos, Redhat, Ubuntu) and Windows Environments.
- Experience with Software development tools such as JIRA, GIT, SVN.
- Good experience in creating build scripts using Maven. Extensively used Log4J in development of logging standards and mechanisms.
TECHNICAL SKILLS
Big Data Tools: Apache Spark, Spark Streaming, Kafka, Cassandra, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP
Programming Languages: SQL, PL/SQL, Python, UNIX, Pyspark, Pig, HiveQL, Scala, Shell Scripting
Spark Components: RDD, Spark SQL, Spark Streaming
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Management: MS Azure, Amazon Web Services (AWS), Snowflake
Databases: Oracle 12c/11g/ 10g, MySql, MS Sql, DB2
No Sql Databases: MongoDB, Hbase, Cassandra
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica, and Tableau.
Version Control: CVS, SVN, Clear Case, Git
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Rochester MN
Sr. Spark Developer
Responsibilities:
- Installing, configuring and maintaining Data Pipelines
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Built Apache Avro schemas for publishing messages to topics and enabled relevant serializing formats for message publishing and consumption.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
- Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
- Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
- Connected Tableau from client end with AWS IP addresses and view the end results.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3
- Involved in converting Hive/SQL queries into Sparktransformations using SparkRDDs, and Scala.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Developed Kafka producer and consumers, HBase clients, Sparkand Hadoop MapReduce jobs along with components on HDFS, Pig, Hive.
- Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
- Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
- Wrote Sparktransformations and action jobs to get data from source DB/log files and migrating to destination Cassandra database.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloudwatch
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Worked on Oracle Databases, RedShift and Snowflake
- Created multiple dashboards in tableau for multiple business needs.
- Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
- Start working with AWS for storage and halding for tera byte of data for customer BI Reporting tools
- Used SQL Server Management Tool to check the data in the database as compared to the requirement give
- Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
- Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
- Created data sharing between two snowflake accounts.
- Worked on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
- Created a Serverless data ingestion pipeline on AWS usingMSK(Kafka)and lambda functions.
- Developed applications using Java that reads data from MSK(kafka) and writes it toDynamo DB.
- Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
Environment: Cloudera Manager (CDH5), Spark, Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, AWS, S3, EC2, Auto Scaling, Cloud Formations, Cloud Watch, IAM, Glue, Security Groups, Kafka, Scrum, Git, Sqoop, Oozie. Pyspark, Informatica, Tableau, Snowflake, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix.
Confidential, Tampa, FL
Big Data Engineer/Spark Developer
Responsibilities:
- Used SQOOP to import data from RDBMS source system and loaded data into Hive table staging table and base tables.
- Worked on the core and Spark SQL modules of Spark extensively.
- Implemented to reprocess the failure messages in Kafka using offset id.
- Worked extensively with Sqoop for importing metadata from Oracle.
- Azure StoragePlanning Migrated Blob Storage for document and media file, Table storage for structured datasets, Queue storage for reliable messaging for workflow processing and file storage to share file data.
- Worked with PaaS architect on the complex project for the Azure data center assessment and migration.
- Performed several ad-hoc data analysis in Azure Data Bricks Analysis Platform on KANBAN board.
- Used Azure reporting services to upload and download reports
- Handled different kinds of files types like JSON, XML, Flat Files and CSV by using appropriate SERDES or Parsing logic to load into Hive tables.
- Implemented software enhancements to port legacy software systems to Spark and Hadoop ecosystems on Azure Cloud.
- Works on loading data into Snowflake DB in the cloud from various sources.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Data bricks
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit
- Develop Spark applications using pyspark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover insight into the customer usage patterns.
- Translated business requirements into SAS code for use within internal systems and models.
- Developed multipleKafkaProducers and Consumers from as per the software requirement specifications
- Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling
- Built a Hortonworks Cluster on Confidential Azure to extract actionable insights for data collected from IOT sensors installed in excavators.
- Installed Horton Works Hadoop cluster on Confidential Azure cloud in the UK region to satisfy customer’s data locality needs.
- Used various sparkTransformationsandActionsfor cleansing the input data.
- Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
- Good Exposure to Azure Cloud, ADF, ADLS, Azure Devops (VSTS), portal services.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Implemented test scripts to support test driven development and continuous integration.
- Used partitioning techniques for faster performance.
- OptimizedHive QL/ pig scriptsby using execution engine like Tez, Spark.
- Ensure ETL/ELT’s succeeded and loaded data successfully in Snowflake DB.
- Analysing the production jobs in case of a bends and fixing the issues
- Loaded real time data from various data sources into HDFS using Kafka.
- Developed Map Reduce jobs for Data Cleanup in Python.
- Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.
Environment: Spark, Hadoop, Sqoop, Hive, Snowflake, Json, XML, Kafka, Python, MapReduce, oracle, Agile Scrum, MapReduce, Pig, Spark, Scala, Hive, Azure, Azure Data Bricks, DAX, Azure Synapse Analytics, Azure Data Lake, Kafka, Python.
Confidential
Big Data Engineer
Responsibilities:
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Used Spark andSpark-SQLto read the parquet data and create the tables in hive using the Scala API.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin.
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR, Glue, and DynamoDB
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Implementing and Managing ETL solutions and automating operational processes.
- UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
- UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- Extract Real time feed usingKafkaandSpark Streamingand convert it to RDD and process data in the form ofData Frameand save the data as Parquet format in HDFS.
- Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Migrated on premise database structure to Confidential Redshift data warehouse
- Experienced in using the spark application master to monitor thespark jobsand capture the logs for the spark jobs.
- Built PL/SQL (Procedures, Functions, Triggers, and Packages) to summarize the data to populate summary tables that will be used for generating reports with performance improvement.
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
Environment: Hadoop, Spark, Informatica, RDS, NOSQL, AWS, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL.
Confidential
Data Engineer
Responsibilities:
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
- Responsible in performing sort,join, aggregations, filter, and other transformations on thedatasetsusingSpark.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Use Spark-MongoDB connector to load data into MongoDB and analyze data from MongoDB tables for quick searching, sorting, and grouping.
- Worked on multiple PoC’s onApache NiFilike executing Spark script, Sqoop scripts through NiFi, worked on creating scatter and gather pattern in NiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a custom NiFi processor for filtering text from Flow files etc.
- Imported data fromAWS S3and intoSpark RDDand performed transformations and actions onRDD's.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Collected data using Spark Streaming fromAWS S3bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
Environment: Hadoop, MapReduce, Hive, Sqoop, Mongo DB, Tableau, Scala, Teradata, AWS, Python, Pig, Jira, MS Excel, Power Point
Confidential
Data Engineer
Responsibilities:
- Analyzed requirement and impact by participating in Joint Application Development sessions with business client online.
- Configured, monitored, and optimized Flume agent to captureweb logsfrom theVPNserver to be put intoHadoop Data Lake.
- Used ETL to extract files for the external vendors and coordinated that effort.
- Performed and automated SQL Server version upgrades, patch installs and maintained relational databases.
- CreatedHBasetables and usedHBasesinks and loaded data into them to perform analytics usingTableau.
- Developed complex parameterized reports which were used for making current and future business decisions.
- Modified and maintained SQL Server stored procedures, views, ad-hoc queries, and SSIS packages used in the search engine optimization process.
- Developed Logical and Physical data models that capture current state/future state data elements and data flows using Erwin
- Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Oracle and Informatica Power Center.
- Performed unit tests on all code and packages.
- Monitored and tuned database resources and activities for SQL Server databases.
Environment: Hadoop, MS SQL, Stored Procedures, Views, ad-hoc queries, SSIS, MS Office, Windows.