- Experienced Hadoop developer with 8+ years of Software Development experience and 4+ years of experience and strong foundation in distributed file systems like Hdfs, HBase in big data environment.
- Excellent understanding of the complexities associated with big data with experience in developing modules and codes in MapReduce, Hive, Pig and Spark to address those complexities.
- Worked with Data Import and export using Sqoop, Flume and familiar with messaging system like Kafka
- Comprehensive experience in Big Data processing using Hadoop Ecosystem including Pig, Hive, Hdfs, Map Reduce (MRV1 and YARN), Mahout, HBase, Sqoop, Flume, Kafka, Oozie, Zookeeper, Scala, Storm, Impala, Apache Drill, Solr & Lucene, Talend, Teradata.
- Experience in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.
- Worked extensively on different Hadoop distributions like Cloudera’s CDH and Hortonworks HDP.
- Experience on Yarn environment with Storm, Spark, Kafka and Avro.
- Extensively used ETL methodology for supporting Data Extraction, transformations and loading using Informatica.
- Excellent technical and analytical skills with clear understanding of ETL design and project architecture based on reporting requirements.
- Implemented Proof of concepts on Hadoop stack and different big data analytic tools, migration from different databases (i.e. Netezza) to Hadoop.
- Excellent Programming skills at a higher level of abstraction using Scala and Spark.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data.
- Good understanding of real time analytics with Apache Spark (RDD and Datasets API).
- Capable of designing Spark SQL based on functional specifications.
- Expertise on Yarn environment with Storm, Scala, Spark, and Kafka.
- Strong experience working with real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming, Kafka, Flume, MapReduce, Hive.
- Configured Flume to extract the data from the web server output files to load into Hdfs.
- Extensive hands on experience in writing MapReduce jobs in Java.
- Performed data analysis using Hive and Pig. Experience in analyzing large datasets using HiveQL and Pig Latin.
- Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
- Good at analyzing data using Pig scripts by grouping, sorting and joining functions
- Good understanding of NoSQL databases like MongoDB, Cassandra, and HBase.
- Built real - time Big data solutions using HBase with billions of records.
- Skilled in Cassandra maintenance and performance tuning on both database and server.
- Managing and scheduling batch Jobs on a Hadoop Cluster using Oozie.
- Skilled in using Zookeeper to provide coordination services to cluster.
- Worked on different file formats like Avro, Parquet, RC file format, JSON format.
- Experienced in creating tables on top of Parquet format in Impala.
- Involved in writing scripts for building disaster recovery process for current
- Good working knowledge in cloud integration with Amazon Web Services components like EMR, EC2, S3 etc.,
- Plan, deploy, monitor, and maintain Amazon AWS cloud infrastructure consisting of multiple nodes.
- Experienced in working with Hadoop/Big-Data storage and analytical frameworks over Amazon AWS cloud using tools like SSH, Putty.
- Used Apache NiFi for loading PDF Documents from Microsoft SharePoint to HDFS.
- Used Avro serialization technique to serialize data for handling schema evolution.
- Experienced in using analytics packages like R and the algorithms provided by Mahout.
- Familiar in using Apache Drill data-intensive distributed applications for interactive analysis of large-scale datasets.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Expertise in search technology’s like Solr& Lucene.
- Built scripts using MAVEN and deployed the application on the JBoss application server.
- Involved in creating mappings, active transformations and also reusable transformations.
- Expertise in distributed and web environments focused in Core Java technologies like Collections, Multithreading, IO, Exception Handling and Memory Management.
- Highly skilled in Planning, Designing, Developing and Deploying Data Warehouses/Data Marts.
- Experienced in analyzing business requirements and translating them into functional and technical
- Developed monitoring and notification tools using Python.
Bigdata Technologies: HDFS and Map Reduce, Pig, Hive, Sqoop, Flume, Hue, Impala, YARN, Oozie, Zookeeper, MapR Converged Data Platform, CDH, HDP, EMR, Apache Spark, Apache Kafka, Apache STORM, Apache Crunch, Avro, Parquet, Apache NiFi.
Databases: Netezza, SQL Server, MySQL, ORACLE, DB2.
Development Methodologies: Waterfall, Agile Methodologies (Scrum).
Frameworks: MVC, Struts, Hibernate, Spring.
IDE Development Tools: Eclipse, Net Beans, Visual Studio.
Java Technologies: Java, J2EE, JDBC, JUnit, Log4j.
NoSQL Databases: HBase, MongoDB, Cassandra.Neo4J
Operating Systems: Windows, Linux, Unix.
Programming Languages: C, Java, Python, Unix, Shell Scripting, C++.
Software Management Technologies: SVN, Git, Jira, Maven.
Web Servers: Web Logic, Web Sphere, Apache Tomcat, JBOSS.
Sr. Hadoop Developer/ Spark Developer
- Led a team of 3 offshore and 2 onshore resources in successfully planning, designing and building the solution end to end which enables user driven analytics on top of the dealer data
- Worked with the key stakeholders of different business groups to identify the core requirements in building the next generation analytic solution using impala as the processing framework and Hadoop for storage on the current dealer data lake.
- Involved in migrating MapReduce jobs into Spark jobs and used Spark SQL and load structured and semi-structured data into Spark clusters.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
- Extracted Web logs by using Spark Streaming jobs, which are written in Java Script.
- Converted all the vap processing from Netezza and implemented by using Spark and RDDs
- Developed Kafka producer and consumers, Spark and Hadoop MapReduce jobs.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Pair RDDs, Spark YARN.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Involved in using HCATLOG to access Hive table metadata from MapReduce And Pig Code.
- Uploaded streaming data from Kafka to Hdfs, HBase and Hive by integrating With Storm.
- Orchestrated hundreds of Sqoop scripts, Python scripts, hive queries using Oozie workflows and sub- workflows.
- Used AWS cloud services to launch Linux and windows machines, created security groups and written basic PowerShell scripts to take backups and mount network shared drives.
- Analyzed HBase data in Hive by creating external partitioned and bucketed tables.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Processed large data sets utilizing Hadoop cluster. The data that are stored on Hdfs were preprocessed/validated using Pig and then processed data was stored into Hive warehouse which enabled Business analysts to get the required data from Hive.
- Written generic extensive data quality check framework to be used by the application using impala.
- Performance tuning in Hive, Impala using multiple methods but not limited to Dynamic partitioning, bucketing, indexing, file compressions, vectorization, and cost based optimization.
- Developed impala scripts for end user / analyst requirements for adhoc analysis.
- Designed and presented plan on impala.
- Performed data validation, identified and resolved the issues between Oracle and Hadoop and helped the client in retiring oracle which used to host the DDSW solution.
- Implemented Fair Scheduler on the job tracker to allocate the fair amount of resources to small jobs.
- Hands on experience in database performance tuning and Data modeling.
- Implemented automatic failover Zookeeper and zookeeper failover controller.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Designed &Implemented Java MapReduce programs to support distributed data processing.
- Managed works including indexing data, tuning relevance, developing custom tokenizers and filters, adding functionality includes playlist, custom sorting and regionalization with Solr search engine.
- Designed and developed automation test scripts using Python.
- Worked with cache data stored in Cassandra.
- Developed data pipeline using Flume, Sqoop, pig and java MapReduce to ingest customer behavioral data and financial histories into Hdfs for Used Apache Hue web interface to monitor the Hadoop cluster and run the jobs.
- Using Mahout, MapReduce to parallelize a single iteration. Responsible for the implementation of application system with core Java and spring framework.
- Ingested the Log data into ETL pipeline which transforms and loads the text format data to Hdfs participated in daily scrum meetings and iterative development.
- Worked with a small team to develop an initial prototype of a NiFi big data pipeline. This pipeline demonstrated an end to end scenario of data ingestion, processing.
- Worked with Apache NiFi to Develop Custom tasks for the purpose of transforming and disturbing data among cloud systems.
- Involved in troubleshooting and performance tuning of reports and resolving issues within Tableau Server and Reports.
- Analyzing the source data to know the quality of data by using Talend Data Quality.
- Used various Teradata Index techniques to improve the query performance.
- Participated in choosing the tech stack (Scala, Akka, Cassandra) for the new microservices.
- Stored the thermostat data as CSV files in storage blobs. Developed PS script to move the data from Azure blob to Azure Blob under same subscription and between different subscriptions.
- Performed Configuring, Managing of Azure Storage with PowerShell, Azure Portal, Azure virtual Machines for High Availability Solutions.
ENVIRONMENT: Hadoop, MapReduce, Hive, HDFS, PIG, Sqoop, Flume, HBase, Spark, Zookeeper AMBARI(Hortonworks), AWS, MYSQL, Impala, Python, UNIX.
Confidential - Kansas City, KS
- Collected logs are organized from multiple services and make them available in a standard format to multiple consumers.
- To extract Producers Big data from various data sources into Hadoop Hdfs . This included data from Excel , ERP systems , databases , CSV and also log data from sensors/meters.
- Used Sqoop to efficiently transfer data between databases and HdfS and used Flume to stream the log data from servers/sensors
- Developed MapReduce programs to cleanse and parse data in Hdfs obtained from various data sources.
- Used Hive data warehouse tool to analyze the unified historic data in Hdfs to identify issues and behavioral patterns.
- The Hive tables created as per requirement were internal or external tables defined with proper static and dynamic partitions, intended for efficiency.
- Used the RegEx , JSON and Avro for serialization and de-serialization packaged with Hive to parse the contents of streamed log data.
- Implemented Hive custom UDF’s to integrate the Weather and geographical data with producer’s business data to achieve comprehensive data analysis.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce , Hive and Sqoop as well as system specific jobs.
- Worked along with the Hadoop Operations team in Hadoop cluster planning, installation, maintenance, monitoring and upgrades.
- Reviewed the Hdfs usage and system design for future scalability and fault-tolerance.
- All small files will be merged and loaded into Hdfs using java code and tracking history related to merge files are maintained in Hbase .
- Used Pig as a ETL tool to do Transformations, even joins and some pre-aggregations before storing data into Hdfs and developed Map Reduce program for parsing and loading into HDFS information.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX , NoSQL and a variety of portfolios.
- Used Apache Storm for extracting the data by designing a topology as per client requirement.
- Integrated supervisor node in storm using Zookeeper .
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into Hdfs .
- Extensive experience in writing Pig scripts to transform raw data from several data sources into forming baseline data.
- Developing and running MapReduce jobs on YARN and Hadoop clusters to produce daily and monthly reports as per users need.
- Worked on NoSQL including MongoDB, Cassandra and HBase .
- Worked with multiple Input Formats such as Text File, Key Value, Sequence File input format.
- Involved in developing Multi-Threading environment to improve the performance of merging operations.
- Design and develop analytic systems to extract meaningful data from large scale structured and unstructured health data.
- Developed several Shell/ Python scripts for file validation and to call the jar files to run specific processes on the data.
- Created Cloud Formation templates and deployed AWS resources using it.
- Wrote scripts to automate the importing and exporting of data using Sqoop from Hdfs to Netezza and vice-versa.
- Developed a core ETL framework using Informatica power designer and Netezza as the backend database, which is metadata driven and stages the source files on the fly based on the file naming convention to run the desired workflow for the process.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau .
- Implemented custom error handling in Talend jobs and also worked on different methods of logging.
- Worked exclusively with Teradata SQL assistant to interface with Teradata .
- Added a script to find fixed tickets for the Akka repository using shell scripting, and converted Akka classes to use new functionalities over deprecated functionalities.
- Worked on Hue integration with LDAP for authentication/authorization.
- Automation of OS builds and Application Installation through Puppet & Chef.
- Utilized Ansible, Puppet, Git and Rundeck to install and configure Linux environments with successful production.
- Test possible use of Graph database (Neo4J) by combing different sources and find the relevant path for a node.
ENVIRONMENT: : Hortonworks Hadoop, HDFS, Hive, HQL scripts, Kafka, Map Reduce, Storm, Java, HBase, Pig, Sqoop, Shell Scripts, Oozie coordinator, MySQL, Linux.
Confidential - Detroit, MI
- Involved in all phases of SDLC during the development of the AHA solution at Confidential .
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in Installing and configuring Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Wrote Simple to complex Map/reduce Jobs using Hive and Pig.
- Developed Map/Reduce Jobs to use Hdfs efficiently by using various compression mechanisms.
- Integrated MapReduce with HBase to import bulk data using MR programs.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into Hdfs and Extracted the data from MySQL into Hdfs using Sqoop.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Collected the logs data from web servers and integrated in to Hdfs using Flume.
- Developed the processes to load data from server logs into Hdfs using Flume and loading from Unix file system to Hdfs.
- Implemented Name Node backup using NFS. This was done for High availability.
- Created Hive External tables and loaded the data in to tables and query data using HQL.
- Used UDF’s to implement business logic in Hadoop.
- Implemented business logic by writing UDFs in Java and used various UDFs from Resources.
- Hands on experience in monitoring and managing the Hadoop cluster using Cloudera Manager.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Review and modification of Unix Scripts used for batch jobs.
- Shell Scripting for automation of administration tasks.
- Developed Scala scripts, UDF’S using both SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into Rdbms through Sqoop.
- Created plugins in apache drill for low latency queries and used it with Tableau using JDBC.
- Developed Cluster coordination services through Zookeeper.
- Experience in Ambari and HDP upgrades with very less downtime.
- Wrote shell scripts for rolling day-to- day processes and it is automated.
- Experience in managing and reviewing Hadoop log files.
- Did Aggregations and analysis on large set of log data, collection of log data done using custom built Input Adapters.
- Provided connections using JDBC to the database and developed SQL queries to manipulate the data.
- Wrote and executed various MYSQL database queries from Python using Python-MySQL connector and MYSQL dB package.
- Extract data from source system and transform into newer systems using Talend DI Components.
- Excellent knowledge on ETL tools such as Informatica to load data to Teradata by making various connections to load and extract data to and from Teradata efficiently.
- Contributed to an open source project Akka, which is a framework for simplifying the construction of concurrent and distributed applications on JVM (Java Virtual Machine).
- Has Set up High Availability for Name node, Resource manager, Hive, Oozie and Hue.
- Implemented Infrastructure automation through Puppet, for auto provisioning, code deployments, software installation and configuration updates.
ENVIRONMENT: : Hortonworks Hadoop, HDFS, Hive, HQL scripts, Scala, Kafka, Map Reduce, Storm, Java, HBase, Pig, Sqoop, Shell Scripts, Oozie coordinator, MySQL, Linux, Tableau.
- Involved in the systems study and designing of the project.
- Used JDBC to connect to the Oracle Database.
- Created JSP, Servlets Pages to the Analytical Engine.
- Created JSP pages for Payroll Processing, Human Resource Solutions, Retirement Services, Time & Labor Management, Tax & compliance Management, Employee Benefits Administration, Screening and Selection Services and Professional Employer Organization.
- Developed complex SQL stored procedures, complex views and database triggers using Oracle to calculate various values necessary in project development.
- Used OO Techniques - UML methodology (use cases, sequence diagrams, activity diagrams)
- Developed server side applications using Servlets, EJBs, and JDBC.
- Implemented the business layer by using Hibernate with Spring DAO and also developed mapping files and
- Implemented Error logging aspects (Log 4J) by using Spring AOP.
- Used SVN version control tool.
- Responsible for cross browser compatibility and hence exposure to popular browsers.
- Involved in fixing bugs and unit testing with test cases using Junit.
- Developed user and technical documentation.
- Used Eclipse for writing code and CVS for version control.
- Involved in fixing bugs and minor enhancements for the front-end module.
- Resolving scalability and performance issues both in Applications as well as in WebLogic Application Server.
- Coordinated work with DB team, QA team, Business Analysts and Client Reps to complete the client requirements efficiently.
ENVIRONMENT: Java, J2EE, JDBC, Spring, Hibernate, Java Servlets, Struts MVC, Oracle, HTML, CVS, PL/SQL, SOAP, Eclipse.