Hadoop Developer Resume
Bentonville, AR
SUMMARY:
- Around 7+ years of total IT experience.
- 5+ Years of Hands On experience on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Apache NiFi, Python3.0, Pandas, HDFS, YARN, Sqoop, Hive, Pig, Map Reduce, Data Frame, Datasets, KAFKA.
- Export data to excel using Python Pandas.
- Excellent experience in Oracle SQL, PL SQL, Data Modeling in Data Mart and Data Base creation projects and Migration projects.
- Actively worked in the Data Warehouse environments.
- Excellent experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, HIVE, HBASE, Sqoop, Scala, Git, Maven, Json.
- Experience in performing in - memory data processing and real time streaming analytics using ApacheSpark with Scala, Java and Python.
- Created tool to calculate data statistics using Apache Spark 2.0. Tool read Parquet file and generated output in the Json format.
- Performance tuning of Data profiler tool created using Spark 2.0 by tuning configuration parameters.
- Experience in advanced procedures like text analytics, processing using in memorycomputing capabilitieslike Spark written in Scala.
- Experience with Cloudera distribution (5.6, 5.7, 5.8), Hortonworks platform.
- Developed analytical components using Scala, Spark and Spark Streaming .
- Hands on experience on using Spark RDD, Data frames, Datasets, Spark SQL, and Pair RDD .
- Experience on Structured Streaming in ApacheSpark and near real-time streaming using Kafka .
- Hands on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
- Hands on experience in Benchmarking & Performance Tuning of Hive queries using Partitions, Bucketing and Map Side joins .
- Extensive experienced in working with structured data using Hive QL, join operations, writing custom UDF's and experienced in optimizing HiveQueries .
- Worked as Data Modeler for Data Mart Design in the Data Warehousing environment.
- Worked as Data Modeler for database creation for data migration .
- Expertise in working with Apache Hive data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Strong experience in analysing large amount of data using Hive QL.
- Larger sized Batch and Stream processing using Apache Spark.
- Handled large table partitioning in Hive, optimized query performance using TEZ, Vectorization, and Bucketing.
- Hands on experience in setting up workflow using Apache OOZIE workflow engine for managing and scheduling Hadoop jobs. Other job schedulers used such as Autosys, Cron.
- Build real time data solutions using HBASE handling huge data volume.
- Expertise in handling File Formats Sequence Files, RC, ORC, Text/CSV, Avro, Parquet and analyzed using HiveQL .
- Used Cron job scheduler along with Autosys to schedule and monitor Spark jobs.
- Clickstream log data ingestion from multiple sources using Kafka by adding transformation on it to HDFS and HBASE.
- Written Application Program Interface - API using Python 3.0.
- Recursively copied data from S3 buckets.
- Experienced with Java API and REST to access HBase data.
- Experience in Object Oriented Analysis, Design (OOAD) and development of software using UML Methodology, good knowledge of J2EE design patterns.
- Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
- Proficiency with the application servers like WebSphere, WebLogic, JBOSS and Tomcat.
- Experience with build tools such as Maven using Eclipse .
- Hands on Experience in UNIX Shell Scripting .
- Expertise in writing efficient SQL, PL SQL Procedures, Functions, Packages, Triggers, Collections, Advance Pl/SQL, Dynamic SQL, Analytical functions in oracle and Performance Tuning .
- Hands on experience with AWS (Amazon Web Services) , Elastic Map Reduce (EMR) , Storage S3 , EC2 instances and Data Warehousing
- Worked and learned a great deal from Amazon Web Services ( AWS ) Cloud services like EC2, S3, EBS.
- Working experience in Big data on cloud using AWS EC2 & Microsoft Azure, and handled redshift & Dynamo databases with huge amount of data 3PB.
- Experience in Effort Estimation, Scheduling, Project planning, execution, Management and closure.
- Strong leadership, conflict resolution, communication and facilitation skills.
- Strong process orientation and client interaction capabilities.
- Extensive exposure to all stages of software development having worked in Waterfall, Agile model.
- Sound understanding of continuous integration & continuous deployment environments.
- Strong exposure to Data Management, Governance and Controls functions.
TECHNICAL SKILLS:
Technology: Apache Spark, Hadoop Ecosystem, Oracle, Teradata, Visual Basic 6.0
Programming Languages: PL SQL oracle, DB2, Scala, Python, Java,C,C++,JSE, XML, JSP/Servlets, HTM.
Hadoop Platforms: Cloudera (5.6, 5.7, 5.8)
Databases: Oracle, DB2, Teradata, AWS RDS, Postgres
Big Data Ecosystem: Spark, HDFS, Map Reducing, Hive, Apache NiFi, Pig, Sqoop, Flume,Kafka, Hbase, Python Pandas.
ClusterManagement Tools: Cloudera Manager, Hadoop Security Tools, Hortonworks
Scripting Language: HTML, XML, Scala, Python
Operating Systems: Windows Vista/XP/NT/2000, UNIX, Mac OS
Methodologies: Agile, Waterfall, Lean, V Model
Application Software, E: Commerce Software, Database Systems, Web Portal Software, Data Warehousing
Version Control Tools: GitHub, VSS, Perforce, SVN,TFS
PROFESSIONAL EXPERIENCE:
Hadoop Developer
Confidential, Bentonville, AR esponsibilities:
- Handled the importing of data from various data sources like TERADATA , ORACLE , DB2 , SQLSERVER and GREENPLUM using Datastage, SQOOP and TPT, performed transformation using Hive and loaded the data into HDFS.
- Handled the exporting of the data from HDFS back to various Databases like TERADATA and Green Plum using the SQOOP and TPT.
- Developed end-to-end scalable distributed data pipelines which receiving data using distributed messaging systems Kafka through persistence of data into HDFS with Apache Spark using Scala.
- Involved in creating Hive tables and written multiple Hive queries to load the hive tables for analyzing the market data coming from distinct sources.
- Participate in requirement gathering and documenting the business requirements by conducting workshops/meetings with various business users.
- Involved in preparing sprint planning (Agile methodology) for each implementation task.
- Created extensive SQL queries for data extraction to test the data against the various databases like ORACLE, TERADATA and DB2 .
- Involved in preparing the design flow for the Datastage objects to pull the data from various upstream applications and do the required transformations and load the data into various downstream applications.
- Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDD's.
- Performed advanced operations like text analytics and processing, using in-memory computing capabilities of Spark using Scala.
- Migrated an existing on-premises application to AWS . Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Experience in query data using Spark SQL on Spark to implement Spark RDD’S in Scala.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2 , S3 , EBS , RDS and VPC .
- Collaborate with Business Analysts to clarify application requirements.
- Follow procedures and standards set by the project.
- Perform structured application code reviews and walkthroughs.
- Identify the technical cause and potential impact of errors and implement coding or configuration changes.
- Create/Update documentation for the application as and when code changes are applied.
- Participating in pre and post implementation support activities.
Environment: IBM Information Server (DataStage 9.1/11.5), UNIX Scripting, Hadoop, HDFS, Hive and SQOOP, ORACLE, TERADATA , AWS,Greenplum, DB2 and CA7
Hadoop Developer
Confidential, Chicago, IL
Responsibilities:
- Installed multi cluster nodes on Cloudera platform with the help of Admin.
- Data was Ingested which is received from various database providers using Sqoop onto HDFS for analysis and data processing.
- Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Writing Pig and Hive scripts with UDF’S in MR and Python to perform ETL on AWS Cloud Services.
- Ingested the data from various file system to HDFS using Unix command line utilities.
- Worked with Pig, HBase, NoSQL database HBASE and Sqoop, for analyzing the Hadoop cluster as well as big data.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Analyzed the SQL scripts and designed the solution to implement using Scala.
- Importing and exporting data job's, to perform operations like copying data from HDFS and to HDFS using Sqoop.
- Defined job flows on EC2 server, load and transform large sets of structured, semi-structured and unstructured data
- Implemented the NoSQL databases like Casandra and later HBase, the management of the other tools and process observed running on YARN.
- Wrote and Implemented Apache PIG scripts to load data from and to store data into Hive.
- Wrote Hive UDFS to extract data from staging tables and analyzed the web log data using the Hive. QL.
- Involved in creating Hive tables, loading data and writing hive queries, which runs map reduce in backend and further Partitioning and Bucketing was done when required.
- Used UML for the dataflow design for testing and filtering data.
- UsedZookeeper for various types of centralized configurations.
- Tested the data coming from the source before processing.
- Tested and solved the critical problem faced into project.
- Managed Hadoop log files using Flumes and Kafka.
- Designed Oozie jobs for the auto processing of similar data.
- Provided design recommendations and thought leadership to stakeholders that improved review processes and resolved technical problems.
- Debugged the technical issues and errors was resolved.
Environment: Java 8, Eclipse, Hadoop, Hive, HBase, Cassandra, Linux, Map Reduce, Pig, HDFS, Oozie, Shell Scripting, MySQL.
Confidential, Plano, TX
Big Data Developer
Responsibilities:
- Responsible for loading customer’s data and event logs into HBase.
- HTML data loaded into S3 repository using Apache Nifi.
- Created HBase tables to store variable data formats of input data coming from different sources.
- Involved in adding huge volumes of data in rows and columns to store data in HBase.
- Created job using Apache Spark to pre-process the data.
- Performance tuning of Apache Spark code by tuning GC instance, memory configuration parameters, spark default configuration parameters and kyro serialization.
- Developed data pipeline to store data from Amazon AWS to HDFS.
- Implemented Kafka consumers to move data from Kafka partitions to Spark code for it to be analysed and processed.
- Tuning of Kafka to increase consumer throughput.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
- Used Hive QL to find correlations between customer’s browser log data and analysed them to build risk profile for such sites.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
- Involved in designing the row key in Hbase to store Text and JSON as key values in Hbase table and designed row key in such a way to get/scan it in a sorted order.
- Worked on Cluster co-ordination services through Zookeeper .
- Build applications using Maven and integrated with CI servers like Jenkins to build jobs.
- Job scheduled using Cron.
Environmen t: Hadoop, Spark, HBASE, Apache Nifi, Hive, Kafka, Cron, AutoSys, GitHub, Maven.
Confidential
Oracle PL SQL Developer
Responsibilities:
- Enhancing 10g features and their addition to database objects.
- Making database objects compatible with Oracle 10g.
- Addition of new 10g features to database objects.
- Testing of all database objects to make Flexcube qualify for supporting and meeting Oracle 10g standards.
Environment: Oracle 10g, SQL Navigator, PL SQL Developer, MS-VSS (for source code control), Edit Plus.