Sr. Big Data Engineer Resume New Jersey - Hire IT People

SUMMARY

Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementationand Data Modeler.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Used Spark Dataframes API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Having extensive knowledge on RDBMS such as Oracle, DevOps, MicrosoftSQLServer and MYSQL
Extensive experience working on various databases and database script development using SQL and PL/SQL.
Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
Experienced in working wif Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
Worked on Spark SQL, created Dataframes by loading data from Hive tables and created prep data and stored in AWS S3.
Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
Worked wif various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO.
Over 2 years of experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the SQL Activity in present project.
Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra wif functionality and implementation.
Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra.
Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
Excellent programming skills wif experience in Python, SQL and C Programming.
Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
Experienced in working in SDLC, Agile and Waterfall Methodologies.
Has very strong inter-personal skills and the ability to work independently and wif the group, can learn quickly and easily adaptable to the working environment.
Good exposure in interacting wif clients and solving application environment issues and can communicate TEMPeffectively wif people at different levels including stakeholders, internal teams, and the senior management.

TECHNICAL SKILLS

Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, Delta Lake, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper

Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP

Programming Languages: SQL, PL/SQL, Python, R, PYSpark, Pig, Python, Hive QL, Scala, Shell Scripting, Regular Expressions

Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming

Cloud Infrastructure: AWS, Azure

Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)

Scripting &Query Languages: Shell scripting, SQL

Version Control: CVS, SVN and Clear Case, GIT

Build Tools: Maven, SBT

Containerization Tools: Kubernetes, Docker, Docker Swarm

Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks,UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau

PROFESSIONAL EXPERIENCE

Confidential, New Jersey

Sr. Big Data Engineer

Responsibilities:

Worked on migrating Python, PySpark, Hive, shell scripts from on-prem to AWS.
Built AWS EMR pipeline to migrate scripts from Hadoop Cloudera to AWS S3, Vertica and Snowflake
Responsible for Data Analysis, data governance and monitoring AWS pipelines.
Built multiple ETL/ELT pipelines and deployed them into cloud environments using CI/CD pipelines.
Expertise in working on PySpark and optimizing the scripts to make good use of resources.
Expertise in developing python and Bash scripts and deployed them into GIT.
Automated the tasks like fetching job status/backfilling data.
Used orchestration tools including Oozie and created workflows to schedule jobs.
Triggered pipelines Ad-hoc as part of the support activities
Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3 and Snowflake
Expertise in using Presto, Beeline, Impala(on-prem) and UNIX scripting.
Utilized Spark SQL API in PySpark to extract and load data and perform complex SQL queries.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snowpipe and Matillion from data lake Confidential AWS S3 bucket.
Involved in building an information pipeline and performed analysis utilizing AWS stack (EMR, EC2, S3, RDS, Lambda, Glue, SQS, and Redshift).
Batch scripts has been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark framework.
Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
Using python, the ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
Worked on creating tables in HQL and automating table creation in PROD via automated scripts.
Transformed HiveQL to spark scripts for most of the jobs.
Scheduled cronjobs wherever required for the jobs.
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
Developed various UDFs in Map-Reduce and Python for Pig and Hive.
Data Integrity checks has been handled using hive queries, Hadoop, and Spark
Worked on performing transformations & actions on RDDs and Spark Streaming data wif Scala.
Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
Used GitHub repositories to maintain the code, versioning and delivering of the source code.
Used Visualization tools such as Power view in excel, Tableau for creating dashboards, visualizing, and generating reports.
Worked as team player and individual contributor to deliver the tasks.
Maintained the document for the team to be able to track the jobs and respective tables.
Involved in support activities to maintain pipelines and debug / fix any prod issues.

Environment: AWS(S3, EMR, awscli), Azure, SQL, PySpark, HDFS, HIVEQL, Impala, oozie, Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Python, Power BI, Tableau, Presto, Hive/Hadoop, Snowflakes, NoSQL, Data Pipeline, Data Analysis, Data Processing, ETL/ELT, Spark Streaming, Data Wrangling, Data Mining, Spark udfs, JIRA, GIT, Bash scripting.

Confidential, Detroit, MI

AWS Data Engineer

Responsibilities:

Responsible for building scalable and distributed data solutions using Cloudera CDH. Worked on analyzing Hadoop cluster and different Big Data analytic tools including Spark, Hive, HDFS, Sqoop, Pig and Python.
Developed Spark Streaming by consuming static and streaming data from different sources.
Used Spark Streaming to stream data from external sources using Kafka service and migrated an existing on-premises application to AWS. Used AWS services like EC2 for processing datasets and S3 storing small datasets. Experienced in maintaining Hadoop cluster on AWS EMR.
Performed configuration, deployment, and support of cloud services in Amazon web services (AWS).
Designed and developed functionality to get JSON document from MongoDB document store and send it to the client using RESTful web service. Implemented a Data interface to get information of customers using REST API and pre-process data using MapReduce and store it into HDFS.
Built and configured a virtual data center in AWS cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, and Elastic Load Balancer.
Created and configured Snowflake warehouse strategy to move a terabyte of data from S3 into Snowflake via PUT scripts. Loaded data from AWS S3 bucket to Snowflake database.
Utilized AWS services wif focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility.
Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function.
Developed Spark applications using Python and Spark SQL for faster processing of data. Developed Spark-based ingestion framework to ingest data from different sources into HDFS and then loaded data into Cassandra.
Imported data from relational databases like MS SQL and Oracle into HDFS using Sqoop incremental imports.
Implemented data ingestion and handling clusters in real-time processing using Kafka. Developed a data pipeline using Spark Streaming and Kafka to store data into HDFS. Used Spark to process the data in HDFS.
Customized the BI tool for the manager team dat performs query analytics using HiveQL.
Involved in converting HiveQL into Spark transformations using Spark RDD’s, using Python and Scala. Created Hive UDF’s to process business logic dat varies based on policy.
Performed data transformations in Hive and used partitions, buckets to improve performance.
Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic Search for consumption by the API and UI.
Developed ETL jobs using Spark-Python to migrate data from Oracle to Cassandra tables.
Worked on loading data into Spark RDD’s, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate output response.
Design and construct AWS data pipelines using various resources in AWS like Lambda, SQS, SNS, S3, and EMR by receiving event notifications from S3 and write them to Postgres.
Involved in writing Java API for AWS Lambda to manage some of the AWS EMR clusters.
Assisted continuous storage in AWS using Elastic Block Storage, S3, Glacier and created volumes and configured snapshots for EC2 instances.
Automated the backups for the short-term data store to S3 buckets and EBS using Amazon CLI and created nightly AMI’s for mission-critical production servers as backup.
Used Amazon SDK(Boto3) and Amazon CLI for data transfers to and from Amazon S3 buckets.
Worked wif Agile and Scrum software development framework for managing software development.
Involved in setting up CI/CD pipelines using Jenkins. Worked along wif DevOps team and managed Jenkins integration service wif Puppet. Experience in creating GitLab repositories wif specified branching strategies.
Used SVN for version control and used JIRA to track issues.

Environment: Cloudera, HDFS, MapReduce, Hive, Sqoop, Spark, AWS (EC2, EMR, IAM, S3, Lambda, Redshift, SQS, SNS), Python, Scala, Oracle, Cassandra, Snowflake, MS SQL, MongoDB, Agile, Jenkins, SVN, JIRA, GitLab

Confidential, Boise, ID

Data Engineer

Responsibilities:

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Strong experience of leading multiple Azure Big Data and Data transformation implementations.
Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the SQL Activity.
Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications wifout container orchestration expertise.
Undertake data analysis and collaborated wif down-stream, analytics team to shape the data according to their requirement.
Used Azure Event Grid for managing event service dat enables you to easily manage events across many different Azure services and applications.
Used Service Bus to decouple applications and services from each other, providing the benefits like Load-balancing work across competing workers.
Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Delta lake supports merge, update and delete operations to enable complex use cases.
Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
Used Databricks to integrate easily wif the whole Microsoft stack.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
Used Azure Data Catalog which halps in organizing and to get more value from their existing investments.
Used Azure Synapse to bring these worlds together wif a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
Created Partitioned and Bucketed Hive tables in Parquet File Formats wif Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
Provide guidance to development team working on PySpark as ETL platform
Utilized machine learning algorithms such as linear regression, multivariate regression, PCA, K-means, & KNN for data analysis.
Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta Lake, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka

Confidential, Denver, CO

Big Data Engineer

Responsibilities:

Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating wif other platforms.
Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts wif various Dimensions like Time, Services, Customers, and policies.
Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
Extensively worked on Python and build the custom ingest framework.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, TEMPEffective & efficient Joins, Transformations and other during ingestion process itself.
Experienced in writing live Real-time Processing using Spark Streaming wif Kafka.
Created Cassandra tables to store various data formats of data coming from different sources.
Developed Spark scripts by using Scala shell commands as per the requirement.
Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
Worked wif the Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
Designed, developed data integration programs in a Hadoopenvironment wif NoSQL data store Cassandra for data access and analysis.
Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
Experienced in working wif spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
Developed spark code and spark-SQL/streaming for faster testing and processing of data.
Closely involved in scheduling Daily, Monthly jobs wif Precondition/Postcondition based on the requirement.
Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Control-M, Java, AWSS3, Oracle 12c, Linux

Confidential

Data Engineer

Responsibilities:

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
Strong understanding of AWS components such as EC2 and S3
Implemented a Continuous Delivery pipeline wif Docker and Git Hub
Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow wif Python
Worked wif g-cloud function wif Python to load Data in to Bigquery for on arrival csv files in GCS bucket
Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoopclusters which are set up in AWS EMR.
Performed Data Preparation by using Pig Latin to get the right data format needed.
Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
Created Session Beans and controller Servlets for handling HTTP requests from Talend
Used Git for version control wif Data Engineer team and Data Scientists colleagues.
Developed and deployed data pipeline in cloud such as AWS and GCP
Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
Develop database management systems for easy access, storage, and retrieval of data.
Perform DB activities such as indexing, performance tuning, and backup and restore.
Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling skilled in data visualization like Matplotlib and seaborn library
Hands on experience wif big data tools like Hadoop, Spark, Hive
Experience implementing machine learning back-end pipeline wif Pandas, NumPy

Environment: Gcp, Bigquery, Gcs Bucket,, Cloud Shell, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, Numpy, ETL workflows, Python, Scala, Spark

Confidential

Data & Reporting Analyst

Responsibilities:

Performed data transformations like filtering, sorting, and aggregation using Pig
Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
Created Hive tables to push the data to MongoDB.
Wrote complex aggregate queries in mongo for report generation.
Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
Automated workflows using shell scripts and Control-M jobs to pull data from various databases into HadoopDataLake.
Extensively used DB2 Database to support the SQL
Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
Inserted Overwriting the HIVE data wif HBasedata daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and has a good experience in using Spark-Shell and Spark Streaming.
Designed, developed and maintained Big Data streaming and batch applications using Storm.
Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
Developed scripts to run scheduled batch cycles using Oozie and present data for reports
Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScalaAPI and Spark.
Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoopstreaming, ApacheSpark, SparkSQL, Scala, Hive, and Pig.
Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
Performed data validation and transformation using Python and Hadoop streaming.
Developed highly efficient PigJavaUDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare wif historical data.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.

Environment: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Oracle, Sql, Splunk, Unix, Shell Scripting.

We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

New, JerseY

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship