We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

New, JerseY

SUMMARY

  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementationand Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
  • Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
  • Experience in importing and exporting teh data using Sqoop from HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Used Spark Dataframes API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in teh data.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Having extensive knowledge on RDBMS such as Oracle, DevOps, MicrosoftSQLServer and MYSQL
  • Extensive experience working on various databases and database script development using SQL and PL/SQL.
  • Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
  • Worked on Spark SQL, created Dataframes by loading data from Hive tables and created prep data and stored in AWS S3.
  • Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO.
  • Over 2 years of experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh SQL Activity in present project.
  • Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.
  • Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra.
  • Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
  • Excellent programming skills with experience in Python, SQL and C Programming.
  • Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Has very strong inter-personal skills and teh ability to work independently and with teh group, can learn quickly and easily adaptable to teh working environment.
  • Good exposure in interacting with clients and solving application environment issues and can communicate TEMPeffectively with people at different levels including stakeholders, internal teams, and teh senior management.

TECHNICAL SKILLS

Big Data Technologies: Kafka, Cassandra, Apache Spark, Spark Streaming, Delta Lake, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper

Hadoop Distribution: Cloudera CDH, Apache, AWS, Horton Works HDP

Programming Languages: SQL, PL/SQL, Python, R, PYSpark, Pig, Python, Hive QL, Scala, Shell Scripting, Regular Expressions

Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming

Cloud Infrastructure: AWS, Azure

Databases: Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)

Scripting &Query Languages: Shell scripting, SQL

Version Control: CVS, SVN and Clear Case, GIT

Build Tools: Maven, SBT

Containerization Tools: Kubernetes, Docker, Docker Swarm

Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks,UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau

PROFESSIONAL EXPERIENCE

Confidential, New Jersey

Sr. Big Data Engineer

Responsibilities:

  • Worked on migrating Python, PySpark, Hive, shell scripts from on-prem to AWS.
  • Built AWS EMR pipeline to migrate scripts from Hadoop Cloudera to AWS S3, Vertica and Snowflake
  • Responsible for Data Analysis, data governance and monitoring AWS pipelines.
  • Built multiple ETL/ELT pipelines and deployed them into cloud environments using CI/CD pipelines.
  • Expertise in working on PySpark and optimizing teh scripts to make good use of resources.
  • Expertise in developing python and Bash scripts and deployed them into GIT.
  • Automated teh tasks like fetching job status/backfilling data.
  • Used orchestration tools including Oozie and created workflows to schedule jobs.
  • Triggered pipelines Ad-hoc as part of teh support activities
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3 and Snowflake
  • Expertise in using Presto, Beeline, Impala(on-prem) and UNIX scripting.
  • Utilized Spark SQL API in PySpark to extract and load data and perform complex SQL queries.
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snowpipe and Matillion from data lake Confidential AWS S3 bucket.
  • Involved in building an information pipeline and performed analysis utilizing AWS stack (EMR, EC2, S3, RDS, Lambda, Glue, SQS, and Redshift).
  • Batch scripts has been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using teh Spark framework.
  • Developed a python script to transfer data, REST API’s and extract data from on-premises to AWS S3. Implemented Micro Services based Cloud Architecture using Spring Boot.
  • Using python, teh ETL pipeline was developed and programmed to collect data from Redshift data warehouse.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
  • Worked on creating tables in HQL and automating table creation in PROD via automated scripts.
  • Transformed HiveQL to spark scripts for most of teh jobs.
  • Scheduled cronjobs wherever required for teh jobs.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Data Integrity checks has been handled using hive queries, Hadoop, and Spark
  • Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
  • Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
  • Used GitHub repositories to maintain teh code, versioning and delivering of teh source code.
  • Used Visualization tools such as Power view in excel, Tableau for creating dashboards, visualizing, and generating reports.
  • Worked as team player and individual contributor to deliver teh tasks.
  • Maintained teh document for teh team to be able to track teh jobs and respective tables.
  • Involved in support activities to maintain pipelines and debug / fix any prod issues.

Environment: AWS(S3, EMR, awscli), Azure, SQL, PySpark, HDFS, HIVEQL, Impala, oozie, Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Python, Power BI, Tableau, Presto, Hive/Hadoop, Snowflakes, NoSQL, Data Pipeline, Data Analysis, Data Processing, ETL/ELT, Spark Streaming, Data Wrangling, Data Mining, Spark udfs, JIRA, GIT, Bash scripting.

Confidential, Detroit, MI

AWS Data Engineer

Responsibilities:

  • Responsible for building scalable and distributed data solutions using Cloudera CDH. Worked on analyzing Hadoop cluster and different Big Data analytic tools including Spark, Hive, HDFS, Sqoop, Pig and Python.
  • Developed Spark Streaming by consuming static and streaming data from different sources.
  • Used Spark Streaming to stream data from external sources using Kafka service and migrated an existing on-premises application to AWS. Used AWS services like EC2 for processing datasets and S3 storing small datasets. Experienced in maintaining Hadoop cluster on AWS EMR.
  • Performed configuration, deployment, and support of cloud services in Amazon web services (AWS).
  • Designed and developed functionality to get JSON document from MongoDB document store and send it to teh client using RESTful web service. Implemented a Data interface to get information of customers using REST API and pre-process data using MapReduce and store it into HDFS.
  • Built and configured a virtual data center in AWS cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, and Elastic Load Balancer.
  • Created and configured Snowflake warehouse strategy to move a terabyte of data from S3 into Snowflake via PUT scripts. Loaded data from AWS S3 bucket to Snowflake database.
  • Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility.
  • Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function.
  • Developed Spark applications using Python and Spark SQL for faster processing of data. Developed Spark-based ingestion framework to ingest data from different sources into HDFS and then loaded data into Cassandra.
  • Imported data from relational databases like MS SQL and Oracle into HDFS using Sqoop incremental imports.
  • Implemented data ingestion and handling clusters in real-time processing using Kafka. Developed a data pipeline using Spark Streaming and Kafka to store data into HDFS. Used Spark to process teh data in HDFS.
  • Customized teh BI tool for teh manager team that performs query analytics using HiveQL.
  • Involved in converting HiveQL into Spark transformations using Spark RDD’s, using Python and Scala. Created Hive UDF’s to process business logic that varies based on policy.
  • Performed data transformations in Hive and used partitions, buckets to improve performance.
  • Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic Search for consumption by teh API and UI.
  • Developed ETL jobs using Spark-Python to migrate data from Oracle to Cassandra tables.
  • Worked on loading data into Spark RDD’s, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate output response.
  • Design and construct AWS data pipelines using various resources in AWS like Lambda, SQS, SNS, S3, and EMR by receiving event notifications from S3 and write them to Postgres.
  • Involved in writing Java API for AWS Lambda to manage some of teh AWS EMR clusters.
  • Assisted continuous storage in AWS using Elastic Block Storage, S3, Glacier and created volumes and configured snapshots for EC2 instances.
  • Automated teh backups for teh short-term data store to S3 buckets and EBS using Amazon CLI and created nightly AMI’s for mission-critical production servers as backup.
  • Used Amazon SDK(Boto3) and Amazon CLI for data transfers to and from Amazon S3 buckets.
  • Worked with Agile and Scrum software development framework for managing software development.
  • Involved in setting up CI/CD pipelines using Jenkins. Worked along with DevOps team and managed Jenkins integration service with Puppet. Experience in creating GitLab repositories with specified branching strategies.
  • Used SVN for version control and used JIRA to track issues.

Environment: Cloudera, HDFS, MapReduce, Hive, Sqoop, Spark, AWS (EC2, EMR, IAM, S3, Lambda, Redshift, SQS, SNS), Python, Scala, Oracle, Cassandra, Snowflake, MS SQL, MongoDB, Agile, Jenkins, SVN, JIRA, GitLab

Confidential, Boise, ID

Data Engineer

Responsibilities:

  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh SQL Activity.
  • Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape teh data according to their requirement.
  • Used Azure Event Grid for managing event service that enables you to easily manage events across many different Azure services and applications.
  • Used Service Bus to decouple applications and services from each other, providing teh benefits like Load-balancing work across competing workers.
  • Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
  • Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • Delta lake supports merge, update and delete operations to enable complex use cases.
  • Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
  • Used Databricks to integrate easily with teh whole Microsoft stack.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in Azure Databricks.
  • Used Azure Data Catalog which halps in organizing and to get more value from their existing investments.
  • Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all teh hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
  • Responsible for resolving teh issues and troubleshooting related to performance of Hadoop cluster.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Analysed teh sql scripts and designed it by using PySpark SQL for faster performance.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Provide guidance to development team working on PySpark as ETL platform
  • Utilized machine learning algorithms such as linear regression, multivariate regression, PCA, K-means, & KNN for data analysis.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of teh GIT Repositories, and teh access control strategies.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta Lake, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka

Confidential, Denver, CO

Big Data Engineer

Responsibilities:

  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Used Spark, Hive for implementing teh transformations need to join teh daily ingested data to historic data.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real time.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
  • Developed dimensions and fact tables for data marts like Monthly Summary, Inventory data marts with various Dimensions like Time, Services, Customers, and policies.
  • Developed reusable transformations to load data from flat files and other data sources to teh Data Warehouse.
  • Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
  • Implemented Python script to call teh Cassandra Rest API, performed transformations and loaded teh data into Hive.
  • Extensively worked on Python and build teh custom ingest framework.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, TEMPEffective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Developed Spark scripts by using Scala shell commands as per teh requirement.
  • Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using teh in-memory computing capabilities of Spark.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customer’s variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
  • Created/ Managed Groups, Workbooks and Projects, Database Views, Data Sources and Data Connections
  • Worked with teh Business development managers and other team members on report requirements based on existing reports/dashboards, timelines, testing, and technical delivery.
  • Designed, developed data integration programs in a Hadoopenvironment with NoSQL data store Cassandra for data access and analysis.
  • Generated Custom SQL to verify teh dependency for teh daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on teh requirement.
  • Monitor teh Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Control-M, Java, AWSS3, Oracle 12c, Linux

Confidential

Data Engineer

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
  • Strong understanding of AWS components such as EC2 and S3
  • Implemented a Continuous Delivery pipeline with Docker and Git Hub
  • Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow with Python
  • Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoopclusters which are set up in AWS EMR.
  • Performed Data Preparation by using Pig Latin to get teh right data format needed.
  • Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis teh CT scan pictures to figure out teh disease in CT scan.
  • Processed teh image data through teh Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Used Git for version control with Data Engineer team and Data Scientists colleagues.
  • Developed and deployed data pipeline in cloud such as AWS and GCP
  • Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling skilled in data visualization like Matplotlib and seaborn library
  • Hands on experience with big data tools like Hadoop, Spark, Hive
  • Experience implementing machine learning back-end pipeline with Pandas, NumPy

Environment: Gcp, Bigquery, Gcs Bucket,, Cloud Shell, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, Numpy, ETL workflows, Python, Scala, Spark

Confidential

Data & Reporting Analyst

Responsibilities:

  • Performed data transformations like filtering, sorting, and aggregation using Pig
  • Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
  • Created Hive tables to push teh data to MongoDB.
  • Wrote complex aggregate queries in mongo for report generation.
  • Developed bash scripts to bring teh TLOG file from ftp server and then processing it to load into hive tables.
  • Automated workflows using shell scripts and Control-M jobs to pull data from various databases into HadoopDataLake.
  • Extensively used DB2 Database to support teh SQL
  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
  • Inserted Overwriting teh HIVE data with HBasedata daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and has a good experience in using Spark-Shell and Spark Streaming.
  • Designed, developed and maintained Big Data streaming and batch applications using Storm.
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per teh design using ORC file format and Snappy compression.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Developed scripts to run scheduled batch cycles using Oozie and present data for reports
  • Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
  • Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScalaAPI and Spark.
  • Implement automation, traceability, and transparency for every step of teh process to build trust in data and streamline data science efforts using Python, Java, Hadoopstreaming, ApacheSpark, SparkSQL, Scala, Hive, and Pig.
  • Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
  • Performed data validation and transformation using Python and Hadoop streaming.
  • Developed highly efficient PigJavaUDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
  • Loading teh data from teh different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
  • Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Developed pig scripts to transform teh data into structured format and it are automated through Oozie coordinators.
  • Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.

Environment: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Oracle, Sql, Splunk, Unix, Shell Scripting.

We'd love your feedback!