We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Fremont, CA

SUMMARY

  • Having 9+ years of IT development experience, including experience in Big Data, Apache Spark, Python, Hadoop, Scala, Java, SQL and Cloud technologies.
  • Experience in requirement analysis, system design, development, and testing of various applications.
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test - Driven Development (TDD)
  • Experienced in frameworks like Flask, Django, and Python packages like PySide, PyQtGraph, NumPy, MatPlot Lib.
  • Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
  • Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
  • Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa.
  • Excellent knowledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
  • Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
  • Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
  • Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
  • Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
  • Played a key role inmigrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies
  • Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Extensive experience usingMAVENas a Build Tool for the building of deployable artifacts from source code
  • Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
  • Experienced in developing Web Services with Python programming language and Good working experience in processing large datasets with Spark using Scala and Pyspark.
  • Experience working with Amazon' Confidential AWS services likeEC2, EMR, Glue, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark in Scala.
  • Experience with Snowflake Multi - Cluster Warehouses.
  • Hands on experience in data processing automation using python.
  • Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
  • Experience in Text Analytics, developing different Statistical Machine Learning, Data Miningsolutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
  • Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports,SQL*Plus, Toad, Crystal Reports.
  • Good understanding of Hadoop Gen1/Gen2 architecture and hands-on experience withHadoopcomponents such as Job Tracker, Task Tracker, Name Node, Secondary Name Node,Data Node,Map Reduce concepts andYARNarchitecture which includes Node manager, Resource manager and App Master.
  • Expertise in relational database systems (RDBMS) such as My SQL, Oracle, MS SQL, and No SQL database systems like Hbase, MongoDB and Cassandra.
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, AzureDatabricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
  • Flexible working Operating Systems like Unix/Linux (Centos, Redhat, Ubuntu) and Windows Environments.
  • Experience with Software development tools such as JIRA, GIT, SVN.
  • Good experience in creating build scripts using Maven. Extensively used Log4J in development of logging standards and mechanisms.

TECHNICAL SKILLS:

Big Data Tools: Apache Spark, Spark Streaming, Kafka, Cassandra, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper

Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP

Programming Languages: SQL, PL/SQL, Python, UNIX, Pyspark, Pig, HiveQL, Scala, ShellScripting

Spark Components: RDD, Spark SQL, Spark Streaming

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Management: MS Azure, Amazon Web Services (AWS), Snowflake

Databases: Oracle 12c/11g/ 10g, MySql, MS Sql, DB2

No Sql Databases: MongoDB, Hbase, Cassandra

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica, and Tableau.

Version Control: CVS, SVN, Clear Case, Git

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Fremont, CA

Senior Big Data Engineer

Responsibilities:

  • Installing, configuring and maintaining Data Pipelines
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Built ApacheAvro schemas for publishing messages to topics and enabled relevant serializing formats for message publishing and consumption.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and SQL.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
  • Connected Tableau from client end with AWS IPaddresses and view the end results.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Involved in converting Hive/SQL queries into Sparktransformations using SparkRDDs, and Scala.
  • Authoring Python (PySpark) Scripts for custom UDF’ Confidential for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
  • Developed Kafka producer and consumers, HBase clients, Spark,and Hadoop MapReduce jobs along with components on HDFS, Pig, Hive.
  • Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Wrote Sparktransformations and action jobs to get data from source DB/log files and migrating to destination Cassandra database.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Created monitors, alarms, notifications and logs for Lambdafunctions, Glue Jobs, EC2hosts using Cloudwatch
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Worked on Oracle Databases, RedShift and Snowflake
  • Created multiple dashboards in tableau for multiple business needs.
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Start working with AWS for storage and holding for tera byte of data for customer BI Reporting tools
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Created data sharing between two snowflake accounts.
  • Worked on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
  • Created a Serverless data ingestion pipeline on AWS usingMSK(Kafka)and lambda functions.
  • Developed applications using Java that reads data from MSK (Kafka) and writes it toDynamo DB.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.

Environment: Cloudera Manager (CDH5), Spark, Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, AWS, S3, EC2, Auto Scaling, Cloud Formations, Cloud Watch, IAM, Glue, Security Groups, Kafka, Scrum, Git, Sqoop, Oozie. Pyspark, Informatica, Tableau, Snowflake, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML,Unix.

Confidential, Tampa, Florida

Big Data Engineer/Spark Developer

Responsibilities:

  • Used SQOOP to import data from RDBMS source system and loaded data into Hive table staging table and base tables.
  • Worked on the core and Spark SQL modules of Spark extensively.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Worked extensively with Sqoop for importing metadata from Oracle.
  • Azure StoragePlanning Migrated Blob Storage for document and media file, Table storage for structured datasets, Queue storage for reliable messaging for workflow processing and file storage to share file data.
  • Worked with PaaS architect on the complex project for the Azure data center assessment and migration.
  • Performed several ad-hoc data analysis in Azure Data Bricks Analysis Platform on KANBAN board.
  • Used Azure reporting services to upload and download reports
  • Handled different kinds of files types like JSON, XML, Flat Files and CSV by using appropriate SERDES or Parsing logic to load into Hive tables.
  • Implemented software enhancements to port legacy software systems to Spark and Hadoop ecosystems on Azure Cloud.
  • Works on loading data into Snowflake DB in the cloud from various sources.
  • Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Data bricks
  • Used Scala sbt to develop Scala coded spark projects and executed using spark-submit
  • Develop Spark applications using Pyspark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover insight into the customer usage patterns.
  • Translated business requirements into SAS code for use within internal systems and models.
  • Developed multipleKafkaProducers and Consumers from as per the software requirement specifications
  • Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling
  • Built a Hortonworks Cluster on Confidential Azure to extract actionable insights for data collected from IOT sensors installed in excavators.
  • Installed Horton Works Hadoop cluster on Confidential Azure cloud in the UK region to satisfy customer’ Confidential data locality needs.
  • Used various sparkTransformationsandActionsfor cleansing the input data.
  • Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
  • Good Exposure to Azure Cloud, ADF, ADLS, Azure Devops (VSTS), portal services.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Implemented test scripts to support test driven development and continuous integration.
  • Used partitioning techniques for faster performance.
  • OptimizedHive QL/ pig scriptsby using execution engine like Tez, Spark.
  • Ensure ETL/ELT’ Confidential succeeded and loaded data successfully in Snowflake DB.
  • Analysing the production jobs in case of a bends and fixing the issues
  • Loaded real time data from various data sources into HDFS using Kafka.
  • Developed Map Reduce jobs for Data Cleanup in Python.
  • Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.

Environment: Spark, Hadoop, Sqoop, Hive, Snowflake, Json, XML, Kafka, Python, MapReduce, oracle, Agile Scrum, MapReduce, Pig, Spark, Scala, Hive, Azure, Azure Data Bricks, DAX, Azure Synapse Analytics, Azure Data Lake, Kafka, Python.

Confidential

Big Data Engineer

Responsibilities:

  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Used Spark andSpark-SQLto read the parquet data and create the tables in hive using the Scala API.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball' Confidential Dimensional Data Mart modeling methodology using Erwin.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Worked on Big data on AWS cloud services me.e. EC2, S3, EMR, Glue, and DynamoDB
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau Analytics.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Developed SSRSreports, SSIS packages to Extract, Transform and Load data from various source systems
  • Implementing and Managing ETL solutions and automating operational processes.
  • UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApacheKafkaclusters.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • Extract Real time feed usingKafkaandSpark Streamingand convert it to RDD and process data in the form ofData Frameand save the data as Parquet format in HDFS.
  • Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Experienced in using the spark application master to monitor thespark jobsand capture the logs for the spark jobs.
  • Built PL/SQL (Procedures, Functions, Triggers, and Packages) to summarize the data to populate summary tables that will be used for generating reports with performance improvement.
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS codepipeline.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.

Environment: Hadoop, Spark, Informatica, RDS, NOSQL, AWS, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL.

Confidential

Tibco BW Developer

Responsibilities:

  • Gathering requirements from the client through mail or telephone communication. Analyzing the requirements and getting clarity on scope, timeline, and development.
  • Developed use case diagrams and sequence diagrams based on the requirements of the integration.
  • Designing the architecture for the Tibco BW application based on the requirements. Interacting with client on requirement gap based on the design.
  • Setting up the environment needed for the application development like TIBCO Admin Server, Tibco EMS server configurations, Tomcat Server.
  • Designing the Database based on the architecture of the application.
  • Creating Tables, Sequences on Oracle and Inserting required data into the tables.
  • Developing processes on Tibco BW to integrate the applications based on the requirements and architecture of the application.
  • Developing Java classes needed as part of the Tibco application to create the request and response payload for HTTP and Tibco EMS communications.
  • Generated XML Schemas and used XML Beans to parse XML files.
  • Created Ant Scripts to automate the build process and build the jar files.
  • Implemented MAVEN in the projects to centralize the dependencies of the code.
  • Developed the code which will create XML files and Flat files with the data retrieved from Databases and XML files.
  • Implemented OAuth API for autantication and authorization of input and output payloads of HTTP communication.

Environment: Java, Java Script, Tibco BW 5.x, Tibco EMS 8.x, Oracle 11g, Apache ANT, Maven, HTTP, XML, JSON.

Confidential

Java Developer

Responsibilities:

  • Participated in project team meetings to gather the requirements and also understanding the End Users System.
  • Analyzed Business Requirements and Identified mapping documents required for system and functional testing efforts for all test scenarios.
  • Implemented HTML, DHTML, Java Script, AJAX, jQuery, JSP and Tag Libraries in developing view pages.
  • Created Use Case diagrams, Sequence diagrams, Class Diagrams, ER Models based on the requirements.
  • Created Stored Procedures Functions. Used JDBC to process database calls for SQL Server databases.
  • Created Data sources and Helper classes using Hibernate which will be utilized by all the interfaces to access the data and manipulate the data.
  • Developed web application using Spring MVC and JSP to manage the defects raised from the code deployed in different environments (DEV, TST, STG, PRD).
  • Involved in building JUNIT test cases for various modules.
  • Maintained the existing code base developed in Spring and Hibernate framework by adding new featured and doing bug fixes.
  • Maintaining code versions using Git.
  • Involved in Application Server Configuration and in Production issues resolution.
  • Documentation of common problems prior to go-live and while actively involved in a Production Support role.
  • Documented the project flow and required steps that needs to be followed during the application deployment in Confluence.

Environment: Java/J2EE, Java Script, jQuery, JSP, JUnit, Spring MVC, Hibernate, XML, HTML, CSS, Oracle, AJAX, Maven, Git.

We'd love your feedback!