We provide IT Staff Augmentation Services!

Big Data Hadoop Developer Resume

5.00/5 (Submit Your Rating)

St Louis, MissourI

SUMMARY

  • Over 10 years of IT experience as a Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem, Java and functional automation
  • Hands on experience in installing, configuring, and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie
  • Responsible for writing MapReduce programs
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
  • Experienced in developing Java UDFs for Hive and Pig
  • Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub-query
  • Scheduled all Hadoop/hive/Sqoop/HBase jobs using Oozie
  • Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
  • Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
  • Experienced in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope
  • Gathered and defined functional and UI requirements for software applications
  • Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
  • Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
  • Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • DevOps Practice for Micro Services using Kubernetes as Orchestrator.
  • Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
  • Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
  • Knowledge of using Routed Protocols as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN'S and Firewall Groups.
  • Complete application builds for Web Applications, Web Services, Windows Services, Console Applications, and Client GUI applications.
  • Experienced in troubleshooting and automated deployment to web and application servers like WebSphere, WebLogic, JBOSS and Tomcat.
  • Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple projects.
  • Hands on experience with integrating Rest APIs to cloud environment to access resources.
  • Developed pyspark programs and created the data frames and worked on transformations.
  • Worked on data processing and transformations and actions in spark by using Python (Pyspark) language.
  • Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

TECHNICAL SKILLS

Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, Pig, Sqoop, OozieFlume, Yarn, HBase, Spark with Scala

No SQL Databases: HBase,Cassandra, Mongo DB

Languages: Java, Python, Scala, Pyspark, UNIX shell scripts

Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL

Frameworks: Spring, Hibernate

Operating Systems: Red Hat Linux, Ubuntu Linux and Windows XP/Vista/7/8

Web/Application servers: Apache Tomcat, WebLogic, JBoss

Databases: SQL Server, MySQL

IDE: Eclipse, IntelliJ

PROFESSIONAL EXPERIENCE

BIG DATA HADOOP DEVELOPER

Confidential, St Louis, Missouri

Responsibilities:

  • Preparing Design Documents (Request-Response Mapping Documents, Hive Mapping Documents).
  • Experienced with batch processing of data sources using Apache Spark and Elastic search.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis
  • Migrated Hive QL queries on structured into Spark QL to improve performance
  • Documented the data flow form Application Kafka Storm HDFS Hive tables
  • Configured, deployed, and maintained a single node storm cluster in DEV environment
  • Developing predictive analytic using Apache Spark Scala APIs
  • Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, Xml and JSon files, ORC and Parquet).
  • Handled importing of data from RDBMS into HDFS using Sqoop.
  • Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Developed Spark scripts to import large files from Amazon S3 buckets.
  • Developed Spark core and Spark SQL scripts using Scala for faster data processing.
  • Experienced in data cleansing processing using Pig Latin operations and UDFs.
  • Experienced in writing Hive Scripts for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Involved in creating Hive tables, loading with data and writing hive queries to process the data.
  • Load the data into Spark RDD and performed in-memory data computation to generate the output response.
  • Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
  • Created scripts to automate the process of Data Ingestion.
  • Experience in using Testing Frameworks of Bigdata world, MRUnit, PIGUnit for testing raw data and executed performance scripts.Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
  • Integrate visualizations into a Spark application using Databricks and popular visualization libraries (gplot, matplotlib).
  • Implemented discretization and binning, data wrangling, cleaning, transforming, merging, and reshaping data frames using Python.
  • MR2 Batch job was written to fetch required data from DB and store the same in CSV (static file)
  • Spark job to process the files from Vision EMS and AMN Cache to identify the violations and sending the same to Smarts as SNMP traps.
  • Automated workflows using shell scripting to schedule(crontab) Spark jobs.
  • Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Experience in deploying data from various sources into HDFS and building reports using Tableau.
  • Developed a data pipeline using Kafka and Strom to store data into HDFS.
  • Developed REST APIs using Scala and Play framework to retrieve processed data from Cassandra database.
  • Performed real time analysis on the incoming data.
  • Re-engineered n-tiered architecture involving technologies like EJB, XML and JAVA into distributed applications.
  • Load the data into Spark RDD and performed in-memory data computation to generate the output response.
  • Loading data into HBase using Bulk Load and Non-bulk load.
  • Created HBase column families to store various data types coming from various sources.
  • Loaded data into the cluster from dynamically generated files
  • Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures
  • Created common audit and error logging processes job monitoring and reporting mechanism
  • Troubleshooting performance issues with ETL/SQL tuning.
  • Developed and maintained the continuous integration and deployment systems using Jenkins
  • Create Py spark frame to bring data from DB2 to Amazon S3.
  • Optimize the Py spark jobs to run on Kubernetes Cluster for faster data processing
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
  • BuiltS3buckets and managed policies for S3 buckets and usedS3 bucketandGlacierfor storage and backup on AWS.
  • Work with other teams to help develop thePuppetinfrastructure to conform to various requirements including
  • Perform troubleshooting and monitoring of the Linux server on AWS using NagiosandNew Relic
  • Management and Administration of AWS ServicesCLI,EC2,VPC,S3,ELBGlacier,Route 53,Cloudtrail,IAM, and Trusted Advisor services.
  • Created automated pipelines in AWS Code Pipelineto deployDockercontainers in AWSECSusing serviceslikeCloudFormation,CodeBuild,CodeDeploy,S3andpuppet.
  • Worked onJIRAfor defect/issues logging & tracking and documented all my work usingCONFLUENCE.
  • Integrated services likeGitHub, AWS Code Pipeline, Jenkins and AWS Elastic Beanstalk to create a deployment pipeline.
  • Strong exposure in Automation of maintenance tasks in Bigdata environment through Cloudera Manager API.
  • Having good knowledge of Oracle9i, 10g, 11g as Database and excellent in writing the SQL queries and scripts.
  • Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine - grained access to AWS resources to users
  • Experience in Building S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
  • Ability to handle a team of developers and co-ordinate smooth delivery of the project.

Environment: HDFS, Apache Spark, Kafka, Cassandra, Hive, Scala, Java, Sqoop’s, Shell scripting, Spark, AWS

BIG DATA HADOOP DEVELOPER

Confidential, Stamford, CT

Responsibilities:

  • Administered, maintained, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux
  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Developed shell scripts to perform Data Quality validations like Record count, File name consistency, Duplicate File and for creating Tables and views.
  • Creating the views by masking PHI Columns for the table, so that data in the view for the PHI columns cannot be seen by unauthorized teams.
  • Worked on Parquet File format to get a better storage and performance for publish tables.
  • Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
  • Experience in using the Docker container system with the Kubernetes integration
  • Developed a Web Application using Java with the Google Web Toolkit API with PostgreSQL
  • Used R for prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, Caffe, TensorFlow, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, HBase, Hive Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
  • Built Kafka-Spark-Cassandra Scala simulator for Met stream, a big data consultancy; Kafka-Spark-Cassandra prototypes.
  • Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce
  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • It is python and Scala based analytic system with ML Libraries.
  • Worked with NoSQL Platforms and Extensive understanding on relational databases versus No-SQL platforms.Created and worked on large data frames with a schema of more than 300 columns.
  • Ingestion of data into Amazon S3 using Sqoop and apply data transformations using Python scripts.
  • Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions in HIVE.
  • Deployed and analyzed large chunks of data using HIVE as well as HBase.
  • Worked on querying data using SparkSQL on top of pyspark engine.
  • Used Amazon EMR to perform the Pyspark Jobs on the Cloud.
  • Created Hive tables to store various data formats of PII data coming from the raw hive tables.
  • Developed Sqoop jobs to import/export data from RDBMS to S3 data store.
  • Designed and implemented Pyspark UDF's for evaluation, filtering, loading and storing of data.
  • Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
  • Used Microservices architecture, with Spring Boot based services interacting through a combination of REST and Spring Boot.
  • Built Spring Boot microservices for the delivery of software products across the enterprise
  • Created the ALB, ELBs and EC2 instances to deploy the applications into cloud environment.
  • Providing service discovery for all microservices using Spring Cloud Kubernetes project
  • Developed fully functional responsive modules based on Business Requirements using Scala.
  • Worked in building servers like DHCP, PXE with kick-start, DNS and NFS and used them in building infrastructure in a Linux Environment. Automated build, testing and integration with Ant, Maven and JUnit.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Built pipelines to move hashed and un-hashed data fromAzure Blob to Data lake.
  • Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
  • Collaborated on insights with Data Scientists, Business Analysts and Partners.
  • Performed advanced procedure like text analytics and processing, using the in-memory computing capabilities ofSpark using Python.
  • Created pipelines to move data fromon-premise servers to Azure Data Lake

Environment: Apache Hive, HBase, Spark, Azure, python, Agile, Stream sets, Bitbucket, Cloudera, Shell Scripting, Amazon EMR, Amazon S3, PyCharm, Jenkins, Scala, Java.

BIG DATA HADOOP DEVELOPER/Architect

Confidential, New York, NY

Responsibilities:

  • Administered, maintained, provisioned, patched and maintained Cloudera Hadoop clusters on Linux
  • Experienced in Spark Streaming and creating RDD and applying operations transformations and actions
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions
  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Upgraded the Hadoopcluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Supported MapReduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
  • Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.

Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Python, Scala, Pyspark, MapR, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5

BIG DATA HADOOP DEVELOPER/ARCHITECT

Confidential, New York, NY

Responsibilities:

  • Administered, maintained, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux
  • Experienced in Spark Streaming and creating RDD and applying operations transformations and actions
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions
  • Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Experienced with different scripting language like Python and shell scripts.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ
  • Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala, and Python.
  • Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL)
  • Extensively worked on Text, ORC, Avro and Parquet file formats and compression techniques like Snappy, Gzip and Zlib.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Created, managed, and utilized policies for S3 buckets and Glacier for storage and backup AWS.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Java and Talend.

Environment: Hadoop 2.0, YARN Resource Manager, SQL, Python, Kafka, Hive, Sqoop 1.4.6, Qlik Sense, Tableau, Oozie, Jenkins, Linux, Scala 2.12, Spark 2.4.3.

BIG DATA HADOOP DEVELOPER

Confidential, New York, NY

Responsibilities:

  • Worked on installing Kafka on Virtual Machine and created topics for different users
  • Installed Zookeepers, brokers, schema registry, control Center on multiple machines.
  • Develop SSL security layers and setup ACL/SSL security for users and assigned multiple topics
  • Worked on Hadoop cluster and data querying tools Hive to store and retrieve data.
  • While developing applications involved in complete Software Development Life Cycle (SDLC).
  • Reviewed and managed Hadoop log files by consolidating logs from multiple machines using Flume.
  • Developed Oozie workflow for scheduling ETL process and Hive Scripts.
  • Started using apache NiFi to copy the data from local file system to HDFS.
  • Involved in teams to analyze the Anomaly detection and ratings of data.
  • Implemented custom input format and record reader to read XML input efficiently using SAX parser.
  • Involved in writing queries in SparkSQL using Scala. Worked with Splunk to analyze and visualize data.
  • Integrated Cassandra as to provide metadata resolution for network entities on the network
  • Experienced in Spark RDD operations and optimized transformations and actions
  • Involved in working with Impala for data retrieval process.
  • Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
  • Loaded data from Linux file system to HDFS and vice-versa
  • Developed UDF's using both Data Frames/SQL and RDD in Spark for data Aggregation queries and reverting back into OLTP through Sqoop.
  • Leveraged ETL methods for ETL solutions and data warehouse tools for reporting and analysis
  • Used CSV Excel Storage to parse with different delimiters in PIG.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Developed multiple MapReduce jobs in java to clean datasets.
  • Developed code to write canonical model JSON records from numerous input sources to Kafka queues.
  • Performed streaming of data into Apache ignite by setting up cache for efficient data analysis.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Developed UNIX shell scripts for creating the reports from Hive data.
  • Prepared Avro schema files for generating Hive tables and Created Hive tables and loaded the data into tables and query data using HQL
  • Installed and Configured Hadoop cluster using AWS for POC purposes.

Environment: Hadoop MapReduce 2 (YARN), NiFi, HDFS, PIG, Hive, Flume, Cassandra, Eclipse, Sqoop, Spark, Splunk, Maven, Cloudera, Linux shell scripting

BIG DATA HADOOP DEVELOPER

Confidential, New York, NY

Responsibilities:

  • Developed NiFi workflows to automate the data movement between different Hadoop systems.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Imported large datasets from DB2 to Hive Table using Sqoop
  • Implemented Apache PIG scripts to load data from and to store data into Hive.
  • Partitioned and bucketed Hive tables and compressed data with Snappy to load data into Parquet hive tables from Avro hive tables
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL
  • Developed Spark scripts by using Scala Shell commands as per the requirement.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Responsible for implementing ETL process through Kafka-Spark-HBase Integration as per the requirements of customer facing API
  • Worked on Batch processing and real-time data processing on Spark Streaming using Lambda architecture
  • Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive
  • Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java
  • Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive.
  • Fetched live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
  • Load the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Used Spark for interactive queries, processing of streaming data and integration with MongoDB
  • Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.
  • Developed Spark programs with Scala to process the complex unstructured and structured data sets
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.
  • Analyzed the SQL scripts and designed the solution to implement using Spark.
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.
  • Used Oozie workflow to co-ordinate pig and Hive Scripts

Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn.

BIG DATA HADOOP DEVELOPER

Confidential, New York, NY

Responsibilities:

  • Responsible for implementation, administration and management of Hadoop infrastructures
  • Evaluation of Hadoop infrastructure requirements and design/deploy solutions (high availability, big data clusters and involved in cluster monitoring and troubleshooting Hadoop issues
  • Worked with application teams to install OSs and Hadoop updates, patches, version upgrades as required
  • Helped maintain and troubleshoot UNIX and Linux environment
  • Analyzed and evaluated system security threats and safeguards
  • Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
  • Experienced in handling data from different datasets, join them and preprocess using Pig join operations.
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
  • Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Imported and exported data from Teradata to HDFS and vice-versa.
  • Strong understanding of Hadoop eco system such as HDFS, MapReduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive
  • Implemented Secondary sorting to sort reducer output globally in map reduce.
  • Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
  • Created Hive Dynamic partitions to load time series data
  • Handled different types of joins in Hive like Map joins, bucker map joins, sorted bucket map joins.
  • Created tables, partitions, buckets and perform analytics using Hive ad-hoc queries.
  • Experienced import/export data into HDFS/Hive from relational data base and Tera data using Sqoop.
  • Handled continuous streaming data from different sources using Flume and set destination as HDFS.
  • Integrated spring schedulers with Oozie client as beans to handle corn jobs.
  • Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters
  • Actively participated in software development lifecycle (scope, design, implement, deploy, test), including design and code reviews.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
  • Worked on spring framework for multi-threading.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, RDBMS/DB, Flat files, Teradata, MySQL, CSV, Avro data files. JAVA, J2EE.

We'd love your feedback!