Senior Big Data Engineer Resume
Foster City, CA
SUMMARY
- Having 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Hands - on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Good knowledge onAWScloud formation templates and configuredSQSservice through javaAPIto send and receive the information.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive usingHBase-Hive Integration.
- Responsible for developing customUDFs,UDAFs and UDTFs in Pig and Hive.
- Responsible for building scalable distributed data solutions usingHadoopand involved in Job management using Fair scheduler and Developed job processing scripts usingOozie workflow.
- Migrated projects from Cloudera Hadoop Hive storage to Azure Data Lake Store to satisfy Confidential Digital transformation strategy
- Ability to spin up differentAWSinstances includingEC2-classic and EC2-VPCusing cloud formation templates.
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
- Have good knowledge on NoSQL databases likeHBase,CassandraandMongoDB.
- Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
- Educating client/business users onthe pros and cons of variousAzure PaaS and SaaSsolutions ensuring themost cost-effectiveapproaches are taken into consideration
- Experience in designing star schema, Snowflake schema for Data Warehouse.
- Exploratory Data Analysis and Data wrangling with R and Python.
- Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
- Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python. f data sources using SQL and ‘big data’ technologies likeHadoop Hive, Azure Data Lake storage
- Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows
- Experience in Hadoop streaming and writing MR jobs by using Perl, Python other than JAVA.
- Using HBase to store majority of data which needs to be divided based on region.
- Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Having good knowledge in writing Map reduce jobs through Pig, Hive, and Sqoop.
- Moving data between cloud and on premise Hadoop using DISTCP and proprietary ingest framework.
- Worked on NoSQL databases includingHBaseandCassandra.
- Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to load data files.
- Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
- Wrote Pig Scripts to generate Map reduce jobs and performed ETL procedures on the data in HDFS.
- Experience in Designing and implementing data structures and commonly used data business intelligence tools for data analysis
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm
Hadoop Distribution: Cloudera distribution and Horton works
Programming Languages: PL/SQL, SQL, Python, Java, Scala
Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB
Operating Systems: Linux, Windows, Ubuntu, Unix
Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS
Data Warehousing & BI: Star Schema, Snowflake schema, SAS, SSIS and Splunk
ETL Tools: Informatica, Talend, PowerCentre
Cloud Services: AWS, Azure
PROFESSIONAL EXPERIENCE
Confidential, Foster City, CA
Senior Big Data Engineer
Responsibilities:
- Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
- Broad involvement in working with SQL, with profound knowledge on T-SQL (MS SQL Server).
- Worked with data science group to do pre-processing and include feature engineering, helped Machine Learning algorithm in production.
- Involvement in working with Azure cloud stage (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB and SQL DWH).
- Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
- Worked onKafkaandSparkintegration for real time data processing.
- Developed a data pipeline using Kafka and Spark to store data into HDFS.
- Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Experience in Configure, Design, Implement and monitorKafkaCluster and connectors.
- Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
- Managed assets and scheduling over the cluster utilizing Azure Kubernetes Service.
- Performed information purging and applied changes utilizing Databricks and Spark information analysis.
- Extensive information in Data changes, Mapping, Cleansing, Monitoring, Debugging, execution tuning and investigating Hadoop clusters.
- Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Provide guidance to development team working on PySpark as ETL platform
- Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
- Used Databricks to integrate easily with the whole Microsoft stack.
- Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
- Involved in building an Enterprise Data Lake utilizing Data Factory and Blob storage, empowering different groups to work with more perplexing situations and ML solutions.
- Used Azure Synapse to oversee handling outstanding workloads and served data for BI and predictions.
- Responsible for design & deployment ofSpark SQLscripts andScalashell commands based on functional specifications.
- Facilitated information for interactive Power BI dashboards and reporting.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
- Using Linked Services/Datasets/Pipeline/ to extract, transform and load data from various sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards, ADF pipelines were created.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Scripting via Linux & OSX platforms: Bash, GitHub GitHub API.
Environment: Hadoop, Spark, Hive, Sqoop, HBase, Oozie, Talend, Kafka Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS), Scala, Python, Cosmos DB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Microservices, K-Means, KNN. Ranger, Git
Confidential, Jersey City, NJ
Big Data Engineer
Responsibilities:
- Used Agile methodology in developing the application, which included iterative application development, weekly Sprints, stand up meetings and customer reporting backlogs.
- Created and managed cloud VMs with AWS EC2 Command line clients and AWS management console.
- Migrated on premise database structure to Confidential Redshift data warehouse. Worked on AWS Data Pipeline to configure data loads from S3 into Redshift
- Monitoring resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, Spark and loaded data into HDFS.
- Design and build ETL workflows, leading the efforts of programming data extraction from various sources into Hadoop file system, implement end to end ETL workflows using Teradata, SQL, TPT, SQOOP and load to HIVE data stores.
- Analyze and develop programs by considering the extract logic and the data load type using Hadoop ingest processes using relevant tools such as Sqoop, Spark, Scala, Kafka, Unix shell scripts and others.
- Design the incremental, historical extract logic to load the data from flat files into Massive Event Logging Database (MELD) from various servers.
- Assist with the analysis of data used for the tableau reports and creation of dashboards.
- Design and implement large scale distributed solutions in AWS.
- Writing technical design document based on the data mapping functional details of the tables.
- Extracting batch and Real time data from DB2, Oracle, Sql server, Teradata, Netezza to Hadoop (HDFS) using Teradata TPT, Sqoop, Apache Kafka, Apache Storm.
- Developing Apache Spark jobs for data cleansing and pre-processing.
- Using Scala language to write programs for faster testing and processing of data.
- Writing code and creating hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
- Automating the ETL tasks and data work flows for the data pipeline of the ingest process through UC4 scheduling tool.
- Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues
- Deployment support including change management and preparation of deployment instructions.
- Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
- Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
- Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Used ORC and Parquet file formats in Hive.
- Development of efficient pig and hive scripts with joins on datasets using various techniques.
- Write documentation of program development, subsequent revisions and coded instructions in the project related GitHub repository.
- Writing spark programs to improve the performance and optimization of the existing algorithms in Hadoop using spark context, spark-sql, data frame, pair RDD's, spark yarn.
Environment: RHEL, HDFS, Map-Reduce, Hive, AWS, EC2, S3, Lambda, Redshift, Pig, Sqoop, Oozie, Teradata, Oracle SQL, UC4, Kafka, GitHub, Hortonworks data platform distribution, Spark, Scala.
Confidential, Raritan, NJ
Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Used python NLTK tool kit to be used in smart MMI interactions.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
- Involved in development of Web Services using SOAP for sending and getting data from the external interface in the XML format.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Ingested data from RDBMS and performed data transformations, and tan export the transformed data to Cassandra for data access and analysis.
- Developed tools using Python, Shell scripting, XML to automate tasks.
- Enhanced by adding Python XML SOAP request/response handlers to add accounts, modify trades and security updates.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
Environment: Hadoop, Spark, Python, Restful, Flask, AWS, Hive, Sqoop, HDFS, Cloudera, Map Reduce, Oracle, MY SQL, PostgreSQL, Shell Scripting.
Confidential
Hadoop Developer
Responsibilities:
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a MapReduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed shell scripts for running Hive scripts in Hive and Impala.
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Developing Scripts and Batch Job to schedule various Hadoop Program.
- Written Hive queries for data analysis to meet the business requirements.
- Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop
- Installed and configuredHadoop MapReduceHDFS Developed multipleMapReducejobs in java for data cleaning and preprocessing.
- Collected the JSON data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
- This plugin allowsHadoop MapReduce programs HBase Pigand Hive to work unmodified and access files directly.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Handled Hive queries using Spark SQL that integrate with Spark environment.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Eclipse, Oracle, PL/SQL, MongoDB, Toad.
Confidential
Data Warehouse Developer
Responsibilities:
- Involved in migrating PowerCenter folders from Development to Production Repository using Repository Manager.
- Modified several of the existing mappings and created several new mappings based on the user requirements.
- Used Informatica Repository Manager for managing all the repositories (development, test & validation), was also involved in migration of folders from one repository to another.
- Extracted data from sources like Oracle and Fixed width and Delimited Flat files. Transformed the data according to the business requirements and then Loaded into the Oracle.
- Ran data loads into all the environments using DAC (Data warehouse Application Console), DAC is a centralized console, providing access to the entire Siebel Data Warehouse application. It allows you to create, configure, and execute modular data warehouse applications in a parallel, high performing environment.
- Extensively involved in testing by writing some QA procedures, for testing the target data against source data.
- Developed the design document for each ETL mapping, defining the source and target tables and all the fields, transformations and the join condition, which helped the users to better understand the type of data.
- Maintained existing mappings by resolving performance issues.
- Created Mappings using Mapping Designer to load the data from various sources using different transformations like Aggregator, Expression, Stored Procedure, External Procedure, Filter, Joiner, Lookup, Router, Sequence Generator, Source Qualifier, and Update Strategy transformations.
- Created Mapping Parameters and Variables.
- Produced a Unit Test Document, which captures the test conditions and scripts, expected/actual results.
Environment: Informatica Power Center, Siebel, DAC (Data warehouse Application Console),HPUNIX, Windows,Oracle 9i/10g, SQL, PL/SQL, SQL * Loader, TOAD, Erwin