- Having 8+ years of experience in Analysis, Design, Development, Testing, Maintenance, and User of software application which includes around 5 Years in Big Data, Hadoop Framework and HDFS, Hive, Pig, MapReduce, Sqoop, Oozie, MongoDB, Cassandra, AWS, ETL, Cloudera environment and years of experience in JAVA/J2EE.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Excellent working experience with Hadoop distributions such as Hortonworks, Cloudera, and IBM Big Insights.
- Strong hands on experience with Hadoop ecosystem components like Hadoop Map Reduce, YARN, HDFS, Hive, Pig, Hbase, Storm, Sqoop, Impala, Oozie, Kafka, Spark, and ZooKeeper.
- Expertise in loading and transforming large sets of structured, semi - structured and unstructured data.
- Experienced in utilizing analytical applications like R and Python to identify trends and relationships between different pieces of data draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Expertise in optimizing Map Reduce algorithms using Mappers, Reducers, and combiners to deliver the best results for the large datasets.
- Experienced with cloud: Hadoop-on-Azure, AWS/EMR, Cloudera Manager (also direct-Hadoop-EC2 (non EMR)).
- Very good experience in writing Map Reduce jobs using Java native code, Pig, and Hive for various business use cases.
- Strong Experience in writing Pig scripts and Hive Queries and Spark SQL queries to analyze large datasets and troubleshooting errors.
- Well versed in Relational Database Design/Development with Database Mapping, PL/SQL Queries, Stored Procedures and Packages using Oracle, DB2, Teradata and MySQL Databases.
- Excellent working experience on designing and implementing complete end-to-end Hadoop Infrastructure including PIG, HIVE, Sqoop, Oozie, Flume and zookeeper.
- Have extensive knowledge and working experience on Software Development Life Cycle (SDLC), Service-Oriented architecture (SOA), Rational Unified Process (RUP), Object Oriented Analysis and Design (OOAD), UML and J2EE Architecture.
- Extensive knowledge of OOPS, OOAD, UML concepts (Use Cases, Class Diagrams, Sequence Diagrams, Deployment Diagrams etc), SEI-CMMI and SixSigma.
- Proficiency in using frameworks and tools like Struts, Ant, JUnit, WebSphere Studio Application Developer (WSAD5.1), JBuilder, Eclipse, IBM Rapid Application Developer (RAD)
- Expertise in designing and coding Stored Procedures, Triggers, Cursers and Functions using PL/SQL.
- Expertise in developing XML documents with XSD validations, SAX, DOM, JAXP parsers to parse the data held in XML documents.
- Experienced in GUI/IDE Tool using Eclipse, Jbuilder and WSAD5.0 and good in writing ANT scripts for development and deployment purposes.
- Expertise in using java performance tuning tools like JMeter and Jprofiler and LOG4J for logging.
- Extensive Experience in using MVC (Model View Controller) architecture for developing applications using JSP, JavaBeans, Servlets.
- Highly Self-motivated and goal oriented team player with strong analytical, debugging and problem solving skills, Strong in object oriented analysis and design. Diversified knowledge and ability to learn new technologies quickly.
- Knowledge in implementing enterprise Web Services, SOA, UDDI, SOAP, JAX-RPC, XSD, WSDL and AXIS.
- Expertise in working with various databases like Oracle and SQLServer using Hibernate, SQL, PL/SQL, Stored procedures.
Hadoop/Big Data: HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Pig, Impala, Oozie, Kafka, Spark, Zookeeper, Storm, Yarn, AWS, AWS S3, AWS EMR, PySpark, Nifi, AWS Glue.
Java & J2EE Technologies: Core Java, Servlets, JSP, JDBC, JNDI, Java Beans
IDE's: Eclipse, Net beans, IntelliJ
Frameworks: MVC, Struts, Hibernate, Spring
Databases: Oracle MySQL, DB2, Teradata, MS-SQL Server.
Nosql Databases: Hbase, Cassandra, MongoDB
Web Servers: Web Logic, Web Sphere, Apache Tomcat
Network Protocols: TCP/IP, UDP, HTTP, DNS, DHCP
ETL Tools: Informatica BDM, Talend.
Web Development: HTML, DHTML, XHTML, CSS, Java Script, AJAX
XML/Web Services: XML, XSD, WSDL, SOAP, Apache Axis, DOM, SAX, JAXP, JAXB, XMLBeans.
Methodologies/Design Patterns: OOAD, OOP, UML, MVC2, DAO, Factory pattern, Session Facade
Operating Systems: Windows, AIX, Sun Solaris, HP-UX.
Sr. Big Data Developer/Engineer
Confidential, Nashville, TN
- Responsible for design and development of analytic models, applications and supporting tools, which enable Developers to create algorithms/models in a big data ecosystem.
- Build Hadoop Datalakes and developed the architecture and used in implementations within the organization.
- Collaborate with various stakeholders (Domain Architects, Solution Architects, and Business Analysts) and provide Initial datasets and founding feature sets to Data Scientists for building Machine learning predictive models using Pyspark.
- Installed and Setup Hadoop CDH clusters for development and production environment and installed and configured Hive, Pig, Sqoop, Flume, Cloudera manager and Oozie on the Hadoop cluster.
- Planning for production cluster hardware and software installation on production cluster and communicating with multiple teams to get it done.
- Ingest Legacy datasets into HDFS using Sqoop Scripts and populate Enterprise DataLake by importing tables from Oracle, Greenplum databases and Mainframe Sources and store them in partitioned HIVE tables using ORC and Zlib compression.
- Monitored multiple Hadoop clusters environments using Cloudera Manager. Monitored workload, job performance and collected metrics for Hadoop cluster when required.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users and implementing solutions using services like (EC2, S3, RDS, Redshift, VPC)
- Involved in implementing High Availability and automatic failover infrastructure to overcome single point of failure for Namenode utilizing zookeeper services.
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL) and Hive UDF's in Python.
- Involved in installing EMR clusters on AWS and used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Identify query duplication, complexity and dependency to minimize migration efforts Technology stack: Oracle, Hortonworks HDP cluster, Attunity Visibility, Cloudera Navigator Optimizer, AWS Cloud and Dynamo DB.
- Performed an upgrade in development environment from CDH 4.2 to CDH 4.6 and automated end to end workflow from Data preparation to presentation layer for Artist Dashboard project using Shell Scripting.
- Converted Informatica ETLs to Spark scala etls and stored data in Hive External tables for end user / analyst requirements to perform ad hoc analysis.
- Design & Develop ETL workflow using Oozie for business requirements, which includes automating the extraction of data from MySQL database into HDFS using Sqoop scripts.
- Extensive experience in Spark Streaming through core Spark API running Scala, Java to transform raw data from several data sources into forming baseline data.
- Design and create the Complete "ETL" process from end-to-end using Talend jobs and create the test cases for validating the Data in the Data Marts and in the Data Warehouse.
- Everyday Capture the data from OLTP Systems and various sources of XML, EXCEL and CSV and load the data into Talend ETL Tools.
- Lead design of high-level conceptual and logical models that facilitate a cross-system/cross functional view of data requirements and Involved in understanding and creating Logical and Physical Data model using Erwin Tool.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data and used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Developed Map reduce program which were used to extract and transform the data sets and result dataset were loaded to Cassandra and vice-versa using kafka and using Kafka messaging system registered to Cassandra brokers and pulled the data to HDFS.
- Involved in querying data using Spark SQL on top of Spark engine and involved in managing and monitoring Hadoop cluster using Cloudera Manager.
- Used Talend ETL tool for monitoring and managing complex deployments with ease to generate the response.
- Conducting RCA to find out data issues and resolve production problems and proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real time analysis.
- Performed data analytics in Hive and then exported this metrics back to Oracle Database using Sqoop and used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Involved in designing and architecting data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
- Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
- Collaborating with business users/product owners/developers to contribute to the analysis of functional requirements.
Environment: Cloudera Hadoop, Talend open studio, MapReduce, Informcatica BDM, Python, HDFS, Nifi, JanusGraph, Hive, Pig, Sqoop, Oozie, Flume, Zookeeper, LDAP, MongoDB, HBase, Erwin, Cassandra, Python, Spark, Scala, AWS EMR, S3, Kafka, SQL, Data Warehousing, Java, Tableau, XML, PL/SQL, RDBMS and Pyspark.
Confidential, Chicago, IL
- Involved in Installing, Configuring Hadoop Eco System, Cloudera Manager using CDH3, CDH4 Distributions.
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shel lscripts (for scheduling of few jobs) extracted and loaded data into DataLake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
- Responsible to manage data coming from various sources and involved in HDFS maintenance and loading of structured and unstructured data and visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
- Generation of business reports from DataLake using Hadoop SQL (Impala) as per the Business Needs and automation of Business reports using Bash scripts in UNIX on Datalake by sending them to business owners.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
- Configured and monitored MongoDB cluster in AWS and establish connections from Hadoop to MongoDB data transfer.
- Used Erwin Data Modeler and Erwin Model Manager to create Conceptual, Logical and Physical data models and maintain the model versions in Model Manager for further enhancements.
- Used ScalaAPI for programming in ApacheSpark and imported data using Sqoop from Teradata using Teradata connector.
- Developed multiple POCs using Scala and Pyspark and deployed on the Yarn cluster, compared the performance of Spark, and SQL.
- Developed Spark scripts by using Scala shell commands as per the requirement and analyzed the data using Amazon EMR.
- Developed export framework using Python, Sqoop, Oracle & Mysql and Created Data Pipeline of Map Reduce programs using Chained Mappers.
- Worked on POC of Talend integration with Hadoop where Created Talend Jobs to extract data from Hadoop.
- Installed KAFKA on Hadoop cluster and configured producer and consumer coding part in java to establish connection from twitter source to HDFS.
- Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, ZooKeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis and implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
- Worked on social media (Facebook, Twitter etc) data crawling using Java and R language and MongoDB for unstructured data storage.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing BigData technologies such as Hadoop, MapReduce Frameworks, HBase, Hive, Oozie, Flume, Sqoop etc
- Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing
- Integrated Quartz scheduler with Oozie work flows to get data from multiple data sources parallels using fork
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Created Hive Generic UDF's to process business logic that varies based on policy and Imported Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
- Used SparkAPI over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Import the data from different sources like HDFS/Hbase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop
- Used Spark as an ETL tool to remove Duplicates, Joins and aggregate the input data before storing in a Blob.
- Extensively worked on developing Informatica Mappings, Mapplets, Sessions, Worklets and Workflows for data loads.
- Developed code in Java which creates mapping in ElasticSearch even before data is indexed into.
- Worked in Monitoring Cluster using Cloudera manager and developed Unit test cases using Junit, Easy Mock and MRUnit testing frameworks.
Environment: Hadoop, HDFS, HBase, Spark, MapReduce, Teradata, Informatica BDM, MySQL, Java, Python, Hive, Pig, Data Warehousing, Sqoop, Flume, Oozie, SQL, Cloudera Manager, Erwin, MongoDB, Cassandra, Scala, Python, AWS EMR, S3, EC2, RDBMS, SQL, Java, XML, Elastic Search, Kafka, MySQL, Tableau, ETL.
Confidential, Conshohocken, PA
- Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
- Imported Bulk Data into HBase Using Map Reduce programs and perform analytics on Time Series Data exists in HBase using HBaseAPI.
- Designed and implemented Incremental Imports into Hive tables and used Rest API to Access HBase data to perform analytics.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, Oozie, Zookeeper and Sqoop.
- Created POC to store Server Log data in MongoDB to identify System Alert Metrics.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS& Extracted the data from MySQL into HDFS using Sqoop.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked in Loading and transforming large sets of structured, semi structured and unstructured data
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Hadoop and MongoDB, Cassandra.
- Involved in Installation and configuration of Cloudera distribution Hadoop, NameNode, Secondary NameNode, JobTracker, TaskTrackers and DataNodes.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used S3 Bucket to store the jar's, input datasets and used Dynamo DB to store the processed output from the input data set.
- Worked with Cassandra for non-relational data storage and retrieval on enterprise use cases and wrote MapReduce jobs using Java API and Pig Latin.
- Improving the performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive and involved in creating Hive tables and loading them into dynamic partition tables.
- Involved in managing and reviewing the Hadoop log files and migrated ETL jobs to Pig scripts to do Transformations, even joins and some pre-aggregations before storing the data to HDFS.
- Worked on Talend ETL tool and used features like context variable and database components like input to oracle, output to oracle, tFile compare, tFile copy, to oracle close ETL components.
- Worked on NoSQL databases including HBase and MongoDB. Configured MySQL Database to store Hive metadata.
- Deployment and Testing of the system in Hadoop MapR Cluster and worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing and imported data from RDBMS environment into HDFS using Sqoop for report generation and visualization purpose using Tableau.
- Developed the ETL mappings using mapplets and re-usable transformations, and various transformations such as source qualifier, expression, connected and un-connected lookup, router, aggregator, filter, sequence generator, update strategy, normalizer, joiner and rank transformations in Power Center Designer.
- Worked on Oozie workflow engine for job scheduling and created and maintained Technical documentation for launching HADOOP Clusters and for executing PigScripts.
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Primarily responsible for design and development using Java, J2EE, XML, Oracle SQL, PLSQL and XSLT.
- Involvement of gathering data for requirements and use case development
- Implemented DAO's using Spring Jdbc support to interact with the RMA database. Spring framework was used for transaction handling.
- Reviewed the functional, design, source code and test specifications
- Worked with Spring Core, Spring AOP, Spring Integration Framework with JDBC.
- Implemented backend configuration on DAO, and XML generation modules of DIS
- Developed persistence layer using ORM Hibernate for transparently store objects into database.
- Implemented RESTful web services using spring which supports JSON data formats.
- Used JDBC for database access, and also used Data Transfer Object (DTO) design patterns
- Unit testing and rigorous integration testing of the whole application
- Implemented user interface using Struts2 MVC Framework, Struts Tag Library, HTML, CSSand JSP.
- Written and executed the Test Scripts using JUNIT and also actively involved in system testing
- Developed XML parsing tool for regression testing and worked on documentation that meets with required compliance standards, also, monitored end-to-end testing activities.