We provide IT Staff Augmentation Services!

Gcp Data Engineer Resume

0/5 (Submit Your Rating)

AZ

SUMMARY

  • Having around 10+ years of IT experience in Design, Development, Maintenance and Support of Big Data Applications.
  • Exposure to Spark, Spark Streaming, Spark MLlib, Scala and Creating the Data Frames handled in Spark with Scala.
  • Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS.
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
  • Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice - versa using Sqoop.
  • Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
  • Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage.
  • Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Developed Python code to gather the data from HBase and designs the solution to implement using spark.
  • Experience in job workflow scheduling and monitoring tools like Oozie and good noledge on Zookeeper.
  • Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop Map Reduce HDFS HBase Hive Sqoop Pig Zookeeper and Flume.
  • Strong experience and knowledge of real time data analytics using Spark, Kafka and Flume.
  • Hands on experience in Capturing data from existing relational databases (Oracle, MySQL, SQL and Teradata) that provide SQL interfaces using Sqoop.
  • Extract Transform and Load data from Sources Systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
  • Experience in creating Docker Containers leveraging existing Linux containers and AMI’s in addition in creating Docker containers from scratch.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Implemented Technics with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Spark YARN.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data Management.
  • Virtualized the servers using Docker for the test environments and Dev environments needs, also configuration automation using Docker Contatiners.
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop
  • Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
  • Productionize models in cloud environment, which would include, automated process, CI/CD pipelines
  • Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and Cloud Watch.
  • Worked with NoSQL databases like Hbase, Cassandra, dynamo DB (AWS) and MongoDB.
  • Created and maintained various Shell and Python scripts for automating various processes and optimized Map Reduce code, pig scripts and performance tuning and analysis.
  • Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
  • Development of cloud service including Jenkins and Nexus on Docker using Terraform
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and processing.
  • Designed and Developed database architecture to create and enhance the Enterprise Applications.
  • Actively working with Business team for gathering the requirements and fully understand the business requirements.
  • Programming - utilize programming languages to create complex functions for use in databases
  • Thorough knowledge of Software Development Life Cycle (SDLC) with deep understanding of various phases like Requirements gathering, Analysis, Design, Development and Testing.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, Flink, YARN, Kafka, Flume, Sqoop, Impala, CI/CD, Oozie, Zookeeper, Spark2.0, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and Snappy.

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache

Languages: Java, Python, Jruby, SQL, HTML, DHTML, Scala, JavaScript, XML and C/C++

No SQL Databases: Cassandra, MongoDB and HBase

Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB

Development Methodology: Agile, waterfall

Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J

Frameworks: Struts, spring and Hibernate

App/Web servers: WebSphere, WebLogic, JBoss and Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle

Cloud Technologies: AWS, Azure

PROFESSIONAL EXPERIENCE

Confidential, AZ

GCP Data Engineer

Responsibilities:

  • Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
  • Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Built reports for monitoring data loads into GCP and drive reliability at the site level
  • Performed the migration of Hive and Map Reduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole
  • Configured Ansible to manage AWS environments and automate the build process for core AMIs used by all application deployments including Auto Scaling and Cloud Formation scripts.
  • Implemented a production ready, load balanced, highlyavailable, faulttolerantkubernetes infrastructure.
  • Managed Kubernets cgharts using Helm. Created reproducible builds of the kubernetes applications, managed Kubernetes manifest files and managedreleases of Helm packages.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Added authentication to the Flink dashboard by using Nginx to ensure Flink jobs running safely.
  • Creating Talend ETL JOBS FOR Data services with Talend open Studio (version: 6.5.1) using ETL methodologies and best practices.
  • Used Ansible to manage systems configuration to facilitate interoperability between existing insrastructure and new infrastructure in alternate physcial data centres or cloud(AWS)
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
  • Developed multi cloud strategiesin better using GCP(for its PAAS) and Azure (for its SAAS).
  • Hands on experience on google cloud platform(GCP) in all the big data products BigQuery, Cloud Data proc, google cloud storage, composer
  • Solely responsible for the Voice server (dynamic rendering IVR using speech recognition, Text-to-Speech and high availability
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Lead the team in developing real time streaming applications using pySpark. Apache, Flink, kafka, Hive on distributedHadoop Cluster.
  • Educate Developers on how to commit their work and how can they make use of the CI/CD pipelines that are in place.
  • Developed a Queryable state for Flink by Scala to query streaming data and enriched the functionalities of the framework.
  • Configured and monitored distributed and multi-platform servers using Nagios and Splunk.
  • Got involved in migrating on prem Hadoop system to using GCP ( Google cloud Platform)
  • Created private cloud using kubernetes that supports DEV, TEST, and PROD environments
  • Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop Map Reduce HDFS HBase Hive Sqoop Pig Zookeeper and Flume
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Working knowledge in working around kubernetes in GCP, working on creating new monitoring techniques using the stackdriver's log router and designing reports in data studio.
  • Helped individual teams to setup their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
  • Designed and developed the Voice and Synchronization server and develop the mobile application clients on Windows CE and J2ME
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Experience other Hadoop ecosystem tools in jobs such as ZooKeeper, Oozie, Implala
  • Experience in working with Map reduce programs, Pig scripts and Hive commands to deliver the best results
  • Migrated previuously written cron jobs to airflow/composer in GCP
  • Created a data stage job using different strages like trasformer, aggregator, sort, join, merg,lookup, data set, funnal, remove duplicate, copy,modify,filter,change dsata capture,change apply, sample,surrogatekey,column generated,row generated, etc.
  • Proficient in big data tools like Hive and Spark(Java) and relational data ware house tool Teradata etc.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Expertise in writing the Scala code using higher order functions for the iterative algorithms in spark for performance consideration. worked on Natural Language Processing with NLTK module of python and developed NLP models for sentiment analysis
  • Setup full CI/CD pipelines so that each commit a developer makes will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery, Azure Data Factory DataBricks.
  • Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
  • Build Data Pipelines in airflow in GCP for ETL related jobs using different airflow operations.
  • Designed and developed Flink pipelines to consume streaming data from kafka and applied bussiness logicto massage and transform and serialize raw data.
  • Set up data preprocessing pipeline to guarantee the consistency between the training data and new coming data.
  • Responsible for building scalable distributed data solutions using Hadoop and involved in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Experience in Oracle Cloud infrastructure (OCI) Migrating Oracle E Busineess Suite R12 to OCI(lift and Shift)
  • Worked on Scala for implementing spark (Java) machine learning libraries and spark streaming.
  • Experience in developing enterprise level solution using batch processing and streaming framework(using spark streaming, apache kafka @Apache Flink)
  • Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to load data files.
  • Developed Map Reduce programs to parse the raw data, populate staging tables and store teh refined data in partitioned tables in the EDW
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Experience in Oracle Cloud Services (OCS, JCS, ICS, IoT, Paas, Iaas).
  • Tuned data stage trsformations and jobs to enhance their performance.
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Development of individual and Batch jobs in Talend and Tibco BW and Design, develop, validate and deply the Talend ETL processes to the Oracle Data Warehouse (ODW).
  • Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool.
  • Experienced in migrating Hive QL into Impala to minimize query response time.
  • Implemented Avro and parquet data formats for apache Hive computations to handle custom business requirements.
  • Experience in Oracle Cloud infrastructure (OCI) Migrating Oracle Fusion Middleware to OCI
  • Implemented the Spark Scala code for Data Validation in Hive
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
  • Defining process low using data stage job sequences
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
  • Experience d in developing Map Reduce programs using Apache Hadoop for working with Big Data
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
  • Worked with Oracle Cloud infrastructure Classic (OCi-C), Managed VM instances’ orchestrations, and their associated storage and Networking resources, OCI-C Myservices, Dashboard, and the environment Consoles.
  • Used Hbase/Pheonix to support front end applications that retrieve data using row keys
  • Boosted the performance of regression models by applying polynomial transformation and feature selectionand used those methods to select stocks.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Developed data ingestion application to bring data from source system to Hbase using Spark streaming kafka
  • Everyday Capture the data from OLTP systems and various sources of XML, Excel and CSV and load the data into Talend ETL tools.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Developed shared container jobs and a templet data stage jobs.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.
  • Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave.
  • Developed complex Talend ETL jobs to migrate the data from flat files to database. Pulled files from mainframe into Talend execution server using multiple ftp components.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS
  • Involved in extensive data validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues.
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Developed merge scripts to UPSERT data into Snowflake from an ETL source

Environment: Hdfs, Hive, Spark, Kafka, linux, Python, Numpy, Pandas, Tableau, GitHub, AWS EMR/EC2/S3/Redshift, Lambda, Pig, Map Reduce, Cassandra, Snowflake, Unix, Shell Scripting, Git.

Confidential, Foster City, CA

GCP Data Engineer

Responsibilities:

  • Developed Crawlers java ETL framework to extract data from Cerner client’s database and Ingest into HDFS & HBase for Long Term Storage.
  • Created Oozie workflows to manage the execution of the crunch jobs and vertica pipelines.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Setting up Azure infrastructure like storage accounts, integration runtime, service principal id, app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure
  • Used Zookeeper to provide coordination services to the cluster.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Utilized Waterfall methodology for team and project management.
  • By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking
  • Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format
  • Used Zookeeper to provide coordination services to the cluster. Experienced in managing and reviewing Hadoop log files.
  • Worked with Oracle Cloud, MS- Azure and AWS.
  • Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory.
  • Performed data manipulation on extracted data using Python Pandas.
  • Work with subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing
  • Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
  • Experience in using Zookeeper and Oozie operational services to coordinate clusters and scheduling workflows
  • Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data ManagementDesigned and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Prepare data migration plans including migration risk, milestones, quality and business sign-off details.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Designed and Developed Real Time Data Ingestion frameworks to fetch data from Kafka to Hadoop.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
  • Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.
  • Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in .
  • Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, ZooKeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
  • Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
  • Developed Custom Pig UDF’s in Java and used UDF’s from Piggy bank for sorting and preparing the data.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
  • Used broadcast variables in spark, effective & efficient Joins, caching, and other capabilities for data processing.
  • Involved in continuous Integration of application using Jenkins.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.

Environment: Ubuntu, Hadoop, Spark, PySpark, Nifi, Jenkins, Talend, SparkSQL, Spark MLIib, Pig, Python, Tableau, GitHub, AWS EMR/EC2/S3, and Open CV

Confidential, Reston, VA

Data Engineer

Responsibilities:

  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Worked on analysing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
  • Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Built Big Data analytical framework for processing healthcare data for medical research using Python, Java, Hadoop, Hive and Pig. Integrated R scripts with Map reduce jobs.
  • Start working with AWS for storage and halding for terabyte of data for customer BI Reporting tools
  • Building data pipelines for CI/CD in spark which cleans the data, applies business logic, and creates solutions on data
  • Responsible to build and run resilient data pipelines in production and have experience implementing ETL/ELT in stream sets to load a multi-terabyte enterprise data warehouse.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Pair RDD's, Spark YARN
  • Selecting appropriate AWS services to design and deploy an application based on given requirements.
  • Migrated to Hadoop-Spark framework from SAS and achieve 10x-50x better performance and speed.
  • Championed JIRA sprints across the products that improved team engagement and productivity.
  • Used Apache Nifi for the movement of data between the layers.
  • Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Developed Map Reduce programs in Java for parsing the raw data and populating staging Tables.
  • Used delta lake for storing data of business calculations the final data of business transformations is stored in delta lake.
  • Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS.
  • Experienced in working with AWS Athena Serverless Query Services.
  • Experience in working from scratch gathering business requirements from users, building road maps, writing user stories, planning, and executing developments, writing documents and working with cross functional agile teams.
  • Migrated HiveQL queries on structured into SparkSQL to improve performance.
  • In part of upgradation to the existing data processing systems, Converted major logical units of hive queries to data frames in SparkSQL for better performance gain.
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop, Map Reduce developed in python, pig, Hive.
  • Used EMR software to providers to manage patient medical records and automate clinical workflows. EHR systems allow providers to create customizable templates for taking notes during patient encounters.
  • Used Customer Health Record (CHR) repositorya database of patient information collected from various clinical IT systems. A CHR is centralized and allows healthcare providers to quickly access patient information at the point of care. A CHR designed to hold data specifically for analytics is a clinical data warehouse.
  • Developed TDCH scripts for importing and exporting data into S3 and Hive.
  • Worked on CICD pipeline, integrating code changes to Git repository and build using Jenkins. Utilized Kafka to capture and process near real time streaming data.

Environment: AWS Services, S3, EMR, Spark, Oozie Teradata, Unix, TDCH, Python, PySpark, Scala.

Confidential, Charlotte, NC

Data Engineer

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Developed Map reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest customer behavioral data into HDFS for analysis.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Used Sqoop to transfer data between relational databases and Hadoop.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Confidential, Minneapolis, MN

Hadoop Developer

Responsibilities:

  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
  • Administering large Hadoop environments build and support cluster set up, performance tuning and monitoring in an enterprise environment.
  • Close monitoring and analysis of the Map Reduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Designed and Developed data mapping procedures ETL-Data Extraction, Data Analysis and Loading process for integrating data using R programming.
  • Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.

Environment: Hadoop YARN, Spark, Spark Streaming, Spark SQL, Scala, Pig, Python, Hive, Sqoop, Map Reduce, No Sql, HBase, Tableau, Oracle, Linux

We'd love your feedback!