Sr. Hadoop/ Bigdata Engineer Resume Emeryville CA - Hire IT People

PROFESSIONAL SUMMARY:

8+ years of expertise in Hadoop, Big Data Analytics and Linux including architecture, design, installation, configuration, and management of Apache Hadoop Clusters, Mapr, and Hortonworks& Cloudera Hadoop Distribution. Experience in configuring, installing, managing MapR, Horton works & Cloudera Distributions
Hands on experience in installing, configuring, monitoring, and using Hadoop components like Hadoop MapReduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper, Hortonworks, Oozie, Apache Spark, Impala.
Working experience with large scale Hadoop environments build and support including design, configuration, installation, performance, tuning and monitoring.
Working knowledge of monitoring tools and frameworks such as Splunk, Influx DB, Prometheus, SysDig, Data Dog, App - Dynamics, New Relic, and Nagios.
Setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
Standardize Splunk forwarder deployment, configuration, and maintenance across a variety of Linux platforms. Also worked on Devops tools like Puppet and GIT.
Hands on experience on configuring a Hadoop cluster in a professional environment and on Amazon Web Services (AWS) using an EC2 instance.
Experience with complete Software Design Lifecycle including design, development, testing, and implementation of moderate to advanced complex systems.
Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Hortonworks, Cloudera and Map Reduce
Extensive experience in installing, configuring, and administrating Hadoop cluster for major Hadoop distributions like CDH5 and HDP.
Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
Experience in Ranger, Knox configuration to provide the security for Hadoop services (hive, base, hdfs etc.).
Experience in administration of Kafka and Flume streaming using Cloudera Distribution.
Developed automated scripts using Unix Shell for performing RUNSTATS, REORG, REBIND, COPY, LOAD, BACKUP, IMPORT, EXPORT and other related to database activities.
Experienced with deployments, maintenance and troubleshooting applications on Microsoft Azure Cloud infrastructure. Excellent knowledge of NOSQL databases like HBase, Cassandra.
Experience in large scale Hadoop cluster, handling all Hadoop environment builds, including design, cluster setup, performance tuning.
Experience in hbase replication and maprdb replication setup between two clusters
Release process implementation like Devops and Continuous Delivery methodologies to existing Build and Deployments. Experience with scripting languages python, Perl, or shell script also.
Modified reports and Talend ETL jobs based on the feedback from QA testers and Users in development and staging environments.
Deployed Grafana Dashboards for monitoring cluster nodes using Graphite as a Data Source and collect as a metric sender.
Proficiency with the application servers like Web Sphere, WebLogic, JBOSS and Tomcat.
Experienced in developing Map Reduce programs using Apache Hadoop for working with Big Data.
Responsible for designing highly scalable big data cluster to support various data storage and computation across varied big data cluster - Hadoop, Cassandra, MongoDB & Elastic Search.

TECHNICAL SKILLS:

Big Data Ecosystem: Hadoop, MapReduce, Spark, Strom, HDFS, HBase, Cassandra, Mongo DB ZookeeperHive, Pig, Sqoop, Flume, Kafka, Oozie, Logstash and Zepplin

Operating Systems: Windows, UNIX, LINUX, MAC.

Programming Languages: C++, Java, Scala, Python, Oracle PL/SQL, Ruby

Scripting Languages: JavaScript, Shell Scripting

Web Technologies: HTML, XHTML, XML, CSS, JavaScript, JSON, SOAP, WSDL.

Hadoop Distribution: Hortonworks,Cloudera.

Java/J2EE Technologies: Java, J2EE, JDBC.

Database: Oracle, MS Access, MySQL, SQL, No SQL (Hbase, MongoDB).

IDE/Tools: Eclipse, IntellIj, SBT, DBeaver, Datagrip, SQL Developer, TOAD

Methodologies: J2EE Design patterns, Scrum, Agile, Water Flow

ETL Tools: Informatica 8.X/9.X, IBM Data stage

Reporting tools: SAP Business objects 4.X/3.X,Tableau 10.X

PROFESSIONAL EXPERIENCE:

Confidential, Emeryville CA

Sr. Hadoop/ Bigdata Engineer

Responsibilities:

Worked with architecting, designing, installation, configuration, and management of Apache Hadoop Clusters, MapR, and Hortonworks & Cloudera Hadoop Distribution.
Developed data pipeline using Spark, Hive, Pig, python, Impala and HBase to ingest customer behavioural data and financial histories into Hadoop cluster for analysis.
Developed Spark jobs and Hive Jobs to summarize and transform data.
Developing Spark scripts for data analysis in python. Writing Spark Applications in Scala and Python (Pyspark).
Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and pre-processing
Built on-premises data pipelines using Kafka and Spark for real time data analysis.
Implemented Hive complex UDF’s to execute business logic with Hive Queries
Setup security using Kerberos and AD on Hortonworks clusters/Cloudera CDH
Installed Kerberos secured Kafka cluster with no encryption on POC also set up Kafka ACL's
Created NoSQL solution for a legacy RDBMS Using Kafka, Spark, SOLR, and HBase indexer for ingestion SOLR and HBase for and real-time querying
Worked in Administration, Installing, Upgrading and Managing distributions of Hadoop clusters with MapR 5.1 on a cluster of 200+ nodes in different environments such as Development, Test and Production (Operational & Analytics) environments.
Created instances in AWS &Migrated data to AWS from data Centre using snowball and AWS migration service.
Enabling Rest encryption for the data in HDFS using Ranger KMS (TDE).
Masking the PHIT/PII fields using Ranger and Atlas.
Worked with Hadoop security tools Kerberos, Ranger, and Knox on HDP 2.x stack and CDH 5.x.
Configuring F5 load balancer for High available components (Ranger, oozie, Knox, Atlas, NiFi).
Used AWS cloud services (VPC, EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, WorkSpaces, Lambda, Kinesis, RDS, SNS, SQS)
Used NoSQL Database including Hbase, MongoDB, Cassandra.
Extracted files from Cassandra and MongoDB through Sqoop and placed in HDFS and processed.
Loaded files to Hive and HDFS from MongoDB Solr.
Extracted BSON files from MongoDB and placed in HDFS and processed.
Created Airflow Scheduling scripts in Python
Installed and configured apache airflow for workflow management and created workflows in python.
Provisioning and managing multi-node Hadoop Clusters on public cloud environment Amazon Web Services (AWS) - EC2 and on private cloud infrastructure.
Involved on proof of concept for Hadoop cluster in AWS. Used EC2 instances, EBS volumes and S3 for configuring the cluster. Assisted in migrating from On-Premises Hadoop Services to cloud-based Data Analytics using AWS. Involved in migrating the ON PREMISE data to AWS.
Used Apache SOLR for indexing HBase tables and querying the indexes.
Troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
Extensively worked on Elastic search querying and indexing to retrieve the documents in high speeds.
Installed, configured, and maintained several Hadoop clusters which includes HDFS, YARN, Hive, HBase, Knox, Kafka, Oozie, Ranger, Atlas, Infra Solr, Zookeeper, and Nifi in Kerberized environments.
Deploying a Hadoop cluster using Hortonworks Ambari HDP 2.2 integrated with Sitescope for monitoring and Alerting.
Analysed the SQL scripts and designed the solution to implement using PySpark .
Developed RESTful endpoints (Rest Controller) to retrieve data or perform an operation on the back end.
Converting Map Reduce programs into Spark transformations using Spark RDD's and Scala.
Developing UDFs in java for hive and pig, Worked on reading multiple data formats on HDFS using Scala.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
Implemented Kerberos security in all environments. Defined file system layout and data set permissions.
Installed and configured Hadoop, Map Reduce, HDFS (Hadoop Distributed File System), developed multiple Map Reduce jobs in java for data cleaning.
Managing the Hadoop cluster with IBM Big Insights, Hortonworks Distribution Platform.
Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time.
Documented EDL (Enterprise Data Lake) best practices and standards includes Data Management
Regular Maintenance of Commissioned/decommission nodes as disk failures occur using MapR File
Worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration in MapR Control System (MCS).
Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
Release process implementation like Devops and Continuous Delivery methodologies to existing Build and Deployments.
Designing, developing, and ongoing support of a data warehouse environments.
Working on Oracle Big Data SQL. Integrate big data analysis into existing applications
Using Oracle Big Data Appliance Hadoop and NoSQL processing and integrating data inHadoop and NoSQL with data in Oracle Database
Worked with Different Relational Database systems like Oracle/PL/SQL. Used Unix Shell scripting, Python.
Developed applications, which access the database with JDBC to execute queries, prepared statements, and procedures.
Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java and Nifi for data cleaning and pre-processing.
Worked with Cloudera Navigator and Unravel data for Auditing hadoop access.
Performed data blending of Cloudera Impala and TeraData ODBC data source in Tableau.
Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
Sqoop configuration of JDBC drivers for respective relational databases, controlling parallelism, controlling distchache, controlling import process, compression codecs, importing data to hive, HBase, incremental imports, configure saved jobs and passwords, free form query option and trouble shooting.
Created MapR DB tables and involved in loading data into those tables.
Collection and aggregation of large amounts of streaming data into HDFS using Flume Configuration of Multiple Agents, Flume Sources.
Extensively worked on the ETL mappings, analysis, and documentation of OLAP reports
Responsible for implementation and ongoing administration of MapR 4.0.1 infrastructure.
Maintaining the Operations, installations, configuration of 150+ node cluster with MapR distribution.
Worked on Linux systems administration on production and development servers (Red Hat Linux, and UNIX utilities).

Confidential, San Francisco CA

Sr. Hadoop/ Bigdata Engineer

Responsibilities:

Experience in architecting, designing, installation, configuration, and management of Apache Hadoop Clusters, MapR, and Hortonworks & Cloudera Hadoop Distribution.
Developed ETL data pipelines using Spark, Spark streaming and Scala.
Building Real Time Data pipelines which code modularization, with dynamic configurations to handle multiple data transfer requests without making the production job changes.
Loaded data from RDBMS to Hadoop using Sqoop
Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API
Developed the Kafka producers, partitions in brokers and consumer groups.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
Tested Apache Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
Monitoring the hive Meta store and the cluster nodes with the help of Hue.
Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
Created AWS EC2 instances and used JIT servers.
Developed various UDFs in Map-Reduce and Python for Pig and Hive.
Data Integrity checks have been handled using hive queries, Hadoop and Spark.
Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
Implemented the Machine learning algorithms using Spark with Python.
Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.
Responsible in handling Streaming data from web server console logs.
Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
Developed PIG Latin scripts for the analysis of semi structured data.
Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
Used Sqoop to import data into HDFS and Hive from other data systems.
Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP and performed structural modifications using Map Reduce, HIVE.
Involved in NoSQL database design, integration and implementation
Loaded data into NoSQL database HBase.
Developed Kafka producer and consumers, HBase clients, Spark and Hadoop MapReduce jobs with components on HDFS, Hive.
Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Developed workflows for complete end to end ETL process starting with getting data into HDFS, validating and applying business logic, storing clean data in hive external tables, exporting data from hive to RDBMS sources for reporting and escalating and data quality issues.
Handled importing of data from various data sources performed transformations using Spark and loaded data into hive.
Involved in performance tuning of Hive (ORC table) for design, storage, and query perspectives.
Developing and deploying using Horton works HDP 2.3.0 in production and HDP 2.6.0 in the development environment.
Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems
Worked in developing Pig scripts to create the relationship between the data present in the Hadoop cluster.
Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of spark using Python and Scala.
Worked on implementing data lake and responsible for data management in Data Lake.
Developed Ruby Script to map the data to the production environment.
Experience in analyzing data using Hive, HBase and custom Map Reduce program.
Developed Hive UDFs and Pig UDFsusing Python script.
Experienced in working with IBM Data Science tool and responsible for injecting the processed data to IBM Data Science tool.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
Worked on Oozie Workflow Engine in running workflow jobs with actions that runHadoop Map/Reduce and Pig jobs.
Responsible to configure the cluster in IBM cloud and maintain the number of nodes as per requirement.

Confidential, Columbus OH

Sr. Hadoop/ Bigdata Developer

Responsibilities:

Migrated on-premises ETL pipelines running on IBM Netezza to AWS, developed and automated process to migrate data to AWS S3, run ETL using spark on EC2 and delivered data on S3, AWS Athena and AWS Redshift.
Involved in requirements gathering and building data lake on top of HDFS and Worked on Go-cd CI/CD tool to deploy the application and have experience within framework for big data testing.
Used Horton works distribution for Hadoop ecosystem.
Created Sqoop jobs for importing the data from Relational Database systems into HDFS and also used dump the result into the data bases using Sqoop.
Extensively used Pig for data cleansing using Pig scripts and Embedded Pig scripts.
Developed in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
Written python scripts to analyse the data of the customer.
Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
Captured the data logs from web server into HDFS using Flume for analysis.
Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Involved in migrating Map Reduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
Use Data frames for data transformations using RDD.
Designed and Developed Spark workflows using Scala for data pull from cloud-based systems and applying transformations on it.
Using Spark streaming consumes topics from distributed messaging source Event hub and periodically pushes batch of data to Spark for real time processing
Tuned Cassandra and MySQL for optimizing the data.
Implemented monitoring and established best practices around usage of elastic search
Used Spark API over Horton works Hadoop YARN to perform analytics on data in Hive.
Hands-on experience with Horton works tools like Tea and Amari.
Worked on Apache Knife as ETL tool for batch processing and real time processing.
Fetch and generate monthly reports. Visualization of those reports using Tableau.
Developed Tableau visualizations and dashboards using Tableau Desktop.
Extracted files from Cassandra through Sqoop and placed in HDFS for further processing.
Strong working experience on Cassandra for retrieving data from Cassandra clusters to run queries.
Experience in Data modelling using Cassandra.
Very good understanding Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
Used Data tax Spark-Cassandra connector to load data into Cassandra and used CQL to analyse data from Cassandra tables for quick searching, sorting and grouping.
Worked with BI (Business Intelligence) teams in generating the reports and designing ETL workflows on Tableau. Deployed data from various sources into HDFS and building reports using Tableau.
Extensively in creating Map-Reduce jobs to power data for search and aggregation.
Managed Hadoop jobs by DAG using Oozie workflow scheduler.
Involved in developing code to write canonical model JSON records from numerous input sources to Kafka Queues.
Involved in loading data from Linux file systems, servers, java web services using Kafka producers and consumers

Confidential, San Ramon CA

Hadoop/ Bigdata Developer

Responsibilities:

Worked on analysing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop.
Responsible for building scalable distributed data solutions using Hadoop.
Implemented nine nodes CDH3 Hadoop cluster on CentOS
Implemented Apache Crunch library on top of map reduce and spark for data aggregation.
Involved in loading data from LINUX file system to HDFS.
Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
Implemented a script to transmit suspiring information from Oracle to HBase using Sqoop.
Implemented best income logic using Pig scripts and UDFs.
Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports
Applied design patterns and OO design concepts to improve the existing Java/J2EE based code base.
Developed JAX-WS web services.
Handling Type 2 and type 1 slowly changing dimensions.
Importing and exporting data into HDFS from database and vice versa using Sqoop.
Written hive jobs to parse the logs and structure them in tabular format to facilitate effective querying
Involved in the design, implementation and maintenance of Data warehouses
Involved in creating Hive tables, loading with data and writing Hive queries.
Implemented custom interceptors for flume to filter data as per requirement.
Used Hive and Pig to analyse data in HDFS to identify issues and behavioural patterns.
Created internal and external Hive tables and defined static and dynamic partitions for optimized performance.
Configured daily workflow for extraction, processing and analysis of data using Oozie Scheduler.
Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
Wrote Pig Latin scripts for running advanced analytics on the data collected.

Confidential

Hadoop Developer

Responsibilities:

Generated the classes and interfaces from the designed UML sequence diagrams and coded as per those plans along with the team.
Suggestions on converting to Hadoop using MapReduce, Hive, Sqoop, Flume and Pig Latin.
Experience in writing Spark applications for Data validation, cleansing, transformations, and custom aggregations.
Imported data from different sources into Spark RDD for processing.
Developed custom aggregate functions using Spark SQL and performed interactive querying.
Worked on installing cluster, commissioning & decommissioning of Data node, Name node high availability, capacity planning, and slots configuration.
Responsible for managing data coming from different sources.
Imported and exported data into HDFS using Flume.
Experienced in analysing data with Hive and Pig.
Setup and benchmarked Hadoop/HBase clusters for internal use.
Setup Hadoop cluster on Amazon EC2 using whirr for POC.
Worked on developing applications in Hadoop Big Data Technologies-Pig, Hive, Map-Reduce, Oozie, Flume, and Kafka.
Experienced in managing and reviewing Hadoop log files.
Helped with Big Data technologies for integration of Hive with HBASE and Sqoop with HBase.
Analysed data with Hive, Pig and Hadoop Streaming.
Involved in transforming the relational database to legacy labels to HDFS and HBASE tables using Sqoop
Involved in Cluster coordination services through Zookeeper and Adding new nodes to an existing cluster.
Moved the data from traditional databases like MySQL, MS SQL Server and Oracle into Hadoop.
Worked on Integrating Talend and SSIS with Hadoop and performed ETL operations.
Installed Hive, Pig, Flume, Sqoop and Oozie on the Hadoop cluster.
Used Flume to collect, aggregate and push log data from different log servers.

Confidential

Big Data Developer

Responsibilities:

Suggestions on converting to Hadoop using MapReduce, Hive, Sqoop, Flume and Pig Latin. Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for urther analysis
Extracted data from various location and load them into the oracle table using SQL*LOADER.
Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
Developed the Pig Latin code for loading, filtering and storing the data.
Create, develop, modify and maintain Database objects, PL/SQL packages, functions, stored procedures, triggers, views, and materialized views to extract data from different sources.
Handled Imported of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
Extracted the data from Teradata into HDFS using Sqoop. Analyzed the data by performing Hive queries and running Pig scripts to know user behaviour.
Installed Oozie workflow engine to run multiple Hive. Developed Hive queries to process the data and generate the data cubes for visualizing.
Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle database into HDFS using Sqoop.
Importing and exporting data into HDFS and Hive using Sqoop.
Analysed the data by performing Hive queries and running Pig scripts to know user behaviour
Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Installed Oozie workflow engine to run multiple Hive and Pig jobs.
Developed Hive queries to process the data and generate the data cubes for visualizing.
Worked on loading the data from MySQL to HBase where necessary using Sqoop
Responsible for building scalable distributed data solutions using Hadoop. Worked hands on with ETL process.
Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files

Confidential

Informatica Developer

Responsibilities:

Performed role of ETL developer which includes requirement gathering and analysis, Specificationdesign, development, testing and documentation.
Worked on Informatica Power Center to develop mappings, sessions and workflows.
Implemented Change Data Capture (CDC), slowly changing dimensions Type 2 mappings usingMD5 algorithms.
Extracted data from sources like SQL Server, Oracle, flat files and csv files and transformed dataaccording to the requirement.
Created mappings using various transformations like Filter, Expression, Sequence Generator,Update Strategy, Lookup, Router, Joiner and Aggregator to create robust mappings in theInformatica Power Center Designer.
Involved in performance tuning for tables having high volume of data using various partitioningmethods in both Oracle and Informatica.
Worked on incremental approach to load data into various targets.
Worked on complex mapping design using UDT transformation, Java transformation
Was involved in unit testing by using Debugger and had written various unit test cases.
Written documentation in Confluence to describe program development, logic, coding, testing,changes and corrections.
Working knowledge of Power Center Visio to create reusable mapping templates.
Was involved in mappings automation using Java mapfwk library.
Created database tables which include analysis of data in order to find columns that need to beused for creation of CDC KEY, NAT KEY, PRIM KEY constraints.
Good knowledge of Data Vault Architecture.
Guided and provided help to team mates in crisis to achieve the project goals.
Interaction with the onsite team/client on a daily basis on the development activities.

We provide IT Staff Augmentation Services!

Sr. Hadoop/ Bigdata Engineer Resume

Emeryville, CA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship