- Big Data developer with 7+ years of experience in Software industry, including experience developing applications in Java.
- Expertise in installation, configuring, and administering clusters of major Hadoop distributions
- Hands on experience in installing, configuring and using Hadoop ecosystem components like Hadoop, MapReduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper, Oozie, Flume, Spark, Kafka and Storm
- Skilled in managing and reviewing Hadoop log files.
- Expert in writing Java, Scala and Python MapReduce jobs
- Expert in working with Hive data warehouse tool - creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Experienced in loading data to Hive partitions and creating buckets in Hive
- Experience in using Apache Sqoop to import and export data to and from HDFS and Hive.
- Expert in importing and exporting data into HDFS and Hive using Sqoop.
- Good at writing Pig scripts (python) and Hive Queries
- In-depth understanding of Data Structures and Algorithms.
- Experience in Hadoop MapReduce programming, PigLatin, HiveQL and HDFS.
- Experience with Oozie workflow engine to run multiple Hive and Pig jobs independently with time and data availability.
- Experience in writing Shell Scripts (bash, SSH, Python).
- Strong Java/JEE application development background with experience in defining technical and functional specifications
- Proficient in developing strategies for Extraction, Transformation and Loading (ETL) mechanism and UNIX shell scripting.
- Experienced in source control repositories viz. SVN, GitHub.
- Experienced in detailed system design using use case analysis, functional analysis, modelling program with class and sequence, activity and state diagrams using UML.
- Worked with Data-Warehouse Architecture and Designing Star Schema, Snow flake Schema, Fact and Dimensional Tables, Physical and Logical Data Modeling.
- Designed Mapping documents for Big Data Application.
- Expertise in successful implementation of projects by following Software Development Life Cycle, including Documentation, Implementation, Unit testing, System testing, build and release.
- Experience in dealing with databases Oracle 9i/10g, MySQL, SQL Server
- Experience using Agile and Extreme Programming methodologies.
Confidential, Chicago, IL
- Work on open source cluster computing framework based on Apache Spark.
- Participates in the design and development of large-scale changes to enterprise data warehouses.
- Partner with solution and data architecture team to create flexible, agile and impactful data solutions
- Collected real time data from IoT devices installed in the trucks through Kafka into HDFS
- Partner with the risk data management to define business and data requirement
- Co-ordinated with information management and business intelligence department
- Designed and developed a new module which will be used for doing predictive analysis and inferring the data in distributed environment. Used Java, Hive, Sqoop, Spark.
- Enhanced existing components to work on the Data intensive System from traditional to High availability Scalable system.
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and Impala.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Analysis of the SQL scripts and designing the solutions to implement using Pyspark
- Effective utilization of Hive for Ad Hoc and instant results.
- Prepare Low Level Design Document for all the development with minor and major changes.
- Prepare High Level Design Document to give the overall picture of system integration
- Prepare Unit test document for each release and clearly indicate the steps followed while unit testing with different scenarios and captured.
- Debug the log files whenever a problem come in the system and try to do the root cause analyses.
- Reviewed code and suggested improvements.
Technologies Used: Scala, Python, SparkSQL, Hive, Sqoop, Spark, Oracle, Cloudera, YARN, HDFS, Kafka, Impala, XML, XSL, UML, Multi-threading, Servlets, JUnit 4.8, MR Unit, Linux, Zookeeper, Ganglia Monitoring, JSP, Java Script, Apache Log 4j
Confidential, Richmond, VA
Hadoop/ Spark Developer
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
- Troubleshooting the cluster by reviewing Log files.
- Involved in performance tuning of spark applications for fixing right batch interval time and memory tuning.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Created reports for the cluster usage using Data node/ Name node / Resource manager and Navigator log data.
- Using the memory computing capabilities of spark using scala, performed advanced procedures like text analytics and processing.
- Imported data using Sqoop from Teradata using Teradata connector.
- Used Oozie to orchestrate the work flow.
- Developed programs in Spark based on the application for faster data processing than standard MapReduce programs in Java and Scala
- Creating Hive tables and working on them for data analysis in order to meet the business requirements.
- Designed and implemented large-scale parallel relation-learning system .
- Installed and benchmarked Hadoop/HBase clusters for internal use.
- Written HBASE Client program in Java and Webservices.
- Model, serialize and manipulate data in multiple forms (XML).
- Shared responsibility for administration of Hadoop ecosystem.
- Supported post production enhancements.
- Experience with data model concepts-star schema dimensional modeling Relational design (ER).
Technologies Used: Hadoop, MapReduce, HDFS, Hive, Spark, Java, Scala, Cloudera, HBase, Linux, XML, MySQL Workbench, Java 6, Eclipse, Oracle 10g, PL/SQL, SQL*PLUS
Confidential, Dublin, OH
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hbase database and Sqoop.
- Responsible for building scalable distributed data solutions using Hadoop.
- Implemented nine nodes CDH3 Hadoop cluster on Red hat Enterprise Linux 5&6.
- Involved in loading data from LINUX file system to HDFS using Sqoop and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Worked on installing cluster, commissioning & decommissioning of datanode, namenode recovery, capacity planning, and slots configuration.
- Created Hbase tables to store variable data formats of PII data coming from different portfolios.
- Implemented best income logic using Pig scripts and UDFs and Implemented test scripts to support test driven development and continuous integration.
- Worked on tuning the performance Pig queries and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Responsible to manage data coming from different sources and involved in loading data from UNIX file system to HDFS.
- Load and transform large sets of structured, semi structured and unstructured data and cluster coordination services through Zookeeper.
- Experience in managing and reviewing Hadoop log files and job management using Fair scheduler.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Installed Oozie workflow engine to run multiple Hive and pig jobs and analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop.
Technologies Used: Hadoop, HDFS, Pig, Sqoop, Hbase, Shell Scripting, Ubuntu, Linux (RHEL)
Hadoop Admin/ Developer
- Created Hive Tables, loaded retail transactional data from Teradata using Scoop.
- Loaded home mortgage data from the existing DWH tables (SQL Server) to HDFS using Scoop.
- Responsible for Operating system and Hadoop Cluster monitoring using tools like Nagios, Ganglia, Cloudera Manager.
- Worked on POC and implementation & integration of Cloudera & Hortonworks for multiple clients.
- Involved in Hadoop administration on Cloudera, Hortonworks and Apache Hadoop 1.x & 2.x for multiple projects.
- Build and maintained a bill forecasting product that will help in reducing electricity consumption by leveraging the features and functionality of Cloudera Hadoop.
- Created ETL jobs to load Twitter JSON data into MongoDB and jobs to load data from MongoDB into Data warehouse.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Kafka, Pig, Hive and Map Reduce.
- Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Real time streaming the data using Spark with Kafka.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala
- Importing and exporting data into HDFS using Sqoop and Kafka.
- Wrote Hive Queries to have a consolidated view of the mortgage and retail data.
- Data is loaded back to the Teradata for the BASEL reporting and for the business users to analyze and visualize the data using Datameer.
- Orchestrated hundreds of sqoop scripts, pig scripts, hive queries using oozie workflows and sub-workflows.
- Loaded the load ready files from mainframes to Hadoop and files were converted to ASCII format.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
- Responsible for software installation, configuration, software upgrades, backup and recovery, commissioning and decommissioning data nodes, cluster setup, cluster performance and monitoring on daily basis, maintaining cluster on healthy on different Hadoop distributions (Hortonworks & Cloudera).
- Wrote Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some preaggregations before storing into the HDFS.
- Developed MapReduce programs to write data with headers and footers and Shell scripts to convert the data to fixed-length format suitable for Mainframes CICS consumption.
- Used Maven for continuous build integration and deployment.
- Agile methodology was used for development using XP Practices (TDD, Continuous Integration).
- Participated in daily scrum meetings and iterative development.
- Supported team using Talend as ETL tool to transform and load the data from different databases.
- Exposure to burn-up, burn-down charts, dashboards, velocity reporting of sprint and release progress.
Technologies Used: Hadoop, MapReduce, Cloudera, Hive, Pig, Kafka, Sqoop, Avro, ETL, Hortonworks, Datameer, Teradata, SQL Server, IBM Mainframes, Java 7.0, Log4J, Junit, MRUnit, SVN, JIRA.
Confidential, Greenville, SC
Java / Hadoop Developer
- Provisioning, building and support of Linux servers both Physical and Virtual using VMware for Production, QA and Developers environment.
- Responsible for implementation and ongoing administration of Hadoop infrastructure.
- Deploy new hardware and software environments required for Hadoop and to expand existing environments.
- HDFS support and maintenance.
- Diligently teaming with the infrastructure, network, database, application and business intelligence teams to guarantee high data quality and availability.
- Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
- Screen Hadoop cluster job performances and capacity planning.
- Monitoring Hadoop cluster connectivity and security.
- Working with data delivery teams to setup new Hadoop users which also include setting up Linux users and testing HDFS, Hive, Pig and Map Reduce access for the new users.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from HBase through Sqoop and placed in HDFS for further processing.
- Performing Linux systems administration on production and development servers (Red Hat Linux, CentOS and other UNIX utilities).
- Installing Patches and packages on Unix/Linux Servers.
- Installation and Configuration of VMware vSphere client, Virtual Server creation and resource allocation.
- Performance Tuning, Client/Server Connectivity and Database Consistency Checks using different Utilities.
- Shell scripting for Linux/Unix Systems Administration and related tasks.
Technology Used: Red hat Linux/Centos 4, 5, 6, Logical Volume Manager, Hadoop, VMware ESX 5.1/5.5, Apache and Tomcat Web Server, Oracle 11,12, Oracle Rac 12c, HPSM, HPSA.
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Experience of gathering data for requirements and use case development.
- Reviewed the functional, design, source code and test specifications
- Involved in developing the complete frontend development using Java Script and CSS
- Implemented backend configuration on DAO, and XML generation modules of DIS
- Used JDBC for database access, and also used Data Transfer Object (DTO) design patterns
- Unit testing and rigorous integration testing of the whole application
- Written and executed the Test Scripts using JUNIT and also actively involved in system testing
- Developed XML parsing tool for regression testing
- Worked on documentation that meets with required compliance standards. Also, monitored end-to-end testing activities.