- Around 8 years of professional IT experience in the fields of Big Data and Java in Financial, Insurance and Digital Services Industries.
- Worked with Big Data distributions like Cloudera (CDH 3 and 4) with Cloudera Manager.
- Hands - on experience with major components in Hadoop Ecosystem like Map Reduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra, Impala and Flume.
- Experience with new Hadoop 2.0 architecture YARN and developing YARN Applications on it.
- Experience with Apache Spark’s Core, Spark SQL, Streaming and MlLib components.
- Experience with distributed systems, large-scale non-relational data stores and multi-terabyte data warehouses.
- Experienced in developing UDFs for Hive using Java.
- Firm grip on data modeling, database performance tuning and NoSQL map-reduce systems.
- Responsible for setting up processes for Hadoop based application design and implementation.
- Experience in managing HBase database and using it to update/modify the data.
- Experience in running MapReduce and Spark jobs over YARN.
- Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
- Handling data in various file formats such as Sequential, AVRO, RC, Parquet and ORC.
- Strong knowledge on the scalability and applications of Spark and its components - Core, SQL and Dataframes.
- Good experience on general data analytics on distributed computing cluster like Hadoop using Apache Spark, Impala, and Scala.
- Hands-on experience in complete project life cycle (design, development, testing and implementation) of Client Server and Web applications.
- Collect all the logs from source systems into HDFS using Kafka and perform analytics on it.
- Involved in developing complex ETL transformation & performance tuning.
- Extensively worked with Teradata utilities like BTEQ, Fast Export, Fast Load, Multi Load to export and load data to/from different source systems including flat files.
- Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
- Good understanding on Spark SQL and Spark Streaming.
- Worked in developing a Nifi flow prototype for data ingestion in HDFS.
- Hands on experience using query tools like Teradata SQL Assistant, TOAD, PLSQL developer and Query man.
- Good understanding of NoSQL databases such as HBase, Cassandra and MongoDB.
- Experience in Object Oriented Analysis and Design (OOAD) and development of software using UML Methodology, good knowledge of J2EE design patterns and Core Java design patterns.
- Experience using middleware architecture using Sun Java technologies like J2EE, JSP, Servlets, and application servers like Web Sphere and Web logic.
- Have good interpersonal, communicational skills, strong problem-solving skills, explore to new technologies with ease and a good team member.
Big Data: HDFS, MapReduce, Hive, Pig, ZooKeeper, Apache Spark, Nifi Core, MlLib, Spark SQL and Dataframes
Utilities: Sqoop, Flume, Kafka, Oozie and AutoSys
No SQL Databases: Hbase, Cassandra
Languages: C, C++, Java, Python, J2EE, PL/SQL, MR, Pig Latin, HiveQL, Unix shell scripting and Scala
Operating Systems: Sun Solaris, RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8
Web Technologies: HTML, DHTML, XML, AJAX, WSDL, SOAP
Databases and Datawarehousing: Teradata, DB2, Oracle 9i/10g/11g, SQL Server, MySQL
Tools: and IDE: Maven, Toad, Eclipse, NetBeans, ANT, Hudson, Sonar, JDeveloper, Assent PMD, DB Visualizer
Confidential, Harvey, IL
- Extracted and updated the data into HDFS using Sqoop import and export command line utility interface.
- Responsible for developing data pipeline using Flume, Sqoop, and Pig to extract the data from weblogs and store in HDFS.
- Develop transformations using custom MapReduce, Pig and Hive
- Perform Map side joins in both Pig and Hive
- Optimize joins in Hive using techniques such as Sort-Merge join and Map side join
- Control parallelism at relational level and script level in Pig
- Implement partitioning and bucketing techniques in Hive
- Developed Spark programs using Scala API’s to compare the performance of Spark with Hive and SQL.
- Built a Ingestion Framework that would ingest the files from SFTP to HDFS using Apache NIFI and ingest Financial data into HDFS
- Worked with Senior Engineer on configuring Kafka for streaming data.
- Worked in Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
- Develop script to create external tables and updated partitioning information on a daily basis
- Convert MR algorithms into Spark transformations and actions by creating RDDs, pair RDDs
- Build reusable Hive UDF libraries for business requirements which enabled users to use these UDFs in Hive querying
- Involved in converting Hive/SQL queries into Spark functionality and analyze them using Scala API
- Built Spark Scripts by utilizing Scala shell commands depending on the requirement.
- Responsible for developing scalable distributed data solutions using Hadoop.
- Loaded cache data into HBase using Sqoop.
- Build Spark Data frames to process huge amounts of structured data
- Use JSON to represent complex data structure within a map reduce job
- Store and preprocess the logs and semi structured content on HDFS using MapReduce and import it into Hive warehouse
- Loaded all data-sets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Migrated the computational code in hql to PySpark.
- Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.
- Develop Pig Latin scripts to extract the data from the web server output files to load into HDFS
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Streamline Hadoop jobs and workflow operations using Oozie workflow and scheduled through AutoSys on a monthly basis
- Worked extensively on AWS components like Elastic MapReduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3)
- Used Amazon Cloud Watch to monitor and track resources on AWS.
- Perform data analysis on NoSQL databases such as HBase and Cassandra
- Analyzed HBase data in Hive by creating external partitioned and bucketed tables
- Perform POC on single member debug on Spark and Hive
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
Environment: Hadoop 2x, Apache Spark, Spark-SQL, Dataframes, Scala, HDFS, HIVE, Oozie, Kafka, Autosys, Oracle, Teradata, Python/PySpark, Map Reduce, Sqoop, HBase, Shell Scripting, PIG, Core Java, Cassandra, Nifi, Cloudera Hadoop Distribution, PL/SQL, Toad, Windows NT, LINUX
Confidential, River Woods, IL
- Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Worked Big data processing of clinical and non-clinical data using MapReduce.
- Develop ETL Process using Spark, Scala, Hive and HBase. Closely collaborated with both the onsite and offshore team
- Configured Kafka to read and write messages from external programs.
- Configured Kafka to handle real time data.
- Scheduled several times based Oozie workflow by developing Python scripts
- Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
- Customized BI tool for manager team that perform Query analytics using HiveQL.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Experienced in Monitoring Cluster using Cloudera manager.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
- Analyzing the requirements to develop the framework.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed Java Spark streaming scripts to load raw files and corresponding.
- Processed metadata files into AWS S3 and Elastic search cluster.
- Developed Python Scripts to get the recent S3 keys from Elastic search.
- Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON and TXT.
- Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
- Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services.
- Involved in POC to migrate map reduce jobs into Spark RDD transformations using Scala.
- Worked on Spark using Python and Spark SQL for faster testing and processing of data.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Wrought with data investigation, discovery and mapping tools to scan every single data record from many sources.
- Implemented Shell script to automate the whole process.
- Fine-tuning PySpark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Knowledge of writing Hive queries and running both scripts in Tez mode to improve performance on Hortonworks Data Platform.
- Written PySpark job in AWS Glue to merge data from multiple table
- Utilized Crawler to populate AWS Glue data Catalog with metadata table definitions
- Generated a script in AWS Glue to transfer the data
- Utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.
- Integrated Apache Storm with Kafka to perform web analytics.
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
Environment: AWS S3, Java, Maven, Python, Spark, Scala, Kafka, Elastic search, MapR Cluster, Nifi, Sqoop, Oozie, Flume, Hortonworks, Cloudera, Cassandra, MapReduce, HDFS, Azure, MongoDB, Amazon Redshift DB, Shell script, pandas, PySpark, Pig, Hive, JSON, AWS GLUE.
Big Data Engineer
- Extracted data from relational databases such as SQL Server and MySql by developing Scala and SQL code
- Uploaded it to Hive and combined new tables with existing databases
- Developed code to pre-process large sets of various types of file formats such as Text, Avro, Sequence files, XML, JSON and Parquet
- Configured big data workflows to run on the top of Hadoop which comprises of heterogeneous jobs like Pig, Hive, Sqoop and MapReduce
- Loaded various formats of structured and unstructured data from Linux file system to HDFS
- Used Combiners and Partitioners in MapReduce programming
- Written Pig Scripts to ETL the data into NOSQL database for faster analysis
- Read from Flume and involved in pushing batches of data to HDFS and HBase for real time processing of the files
- Parsing XML data into structured format and loading into HDFS
- Scheduled various ETL process and Hive scripts by developing Oozie workflow
- Utilized Tableau to visualize the analyzed data and performed report design and delivery
- Created POC for Flume implementation
- Processed metadata files into AWS S3 and Elastic search cluster.
- Involved in reviewing both functional and non-functional aspects of the business model
- Championed to communicate and present the models to business customers and executives, using the same
Environment: Hadoop, HDFS, Map Reduce, Sqoop, HBase, Shell Scripting, PIG, HIVE, Scala, Oozie, Core Java, Hortonworks Distribution, LINUX
- Involved in the core product development using J2EE, JSF and Hibernate
- Actively involved in the full life cycle Object Oriented application development - Object Modeling, Database Mapping, GUI Design.
- Worked on requirement gathering, high level design and Waterfall model to get best result
- Created data access using SQL and PL/SQL stored procedures
- Used Hibernate annotations with Java for various stages in the application
- Built web services upon SOAP to export and import attachments from file to associated applications
- Developed DAO (data access objects) using Spring Framework
- Deployed the components in to Web Sphere Application server
- Written Sql queries including Joins, Triggers, Stored procedures, Views using MySql
- Implemented the JSPs and EJBs in the JSF Framework to handle the workflow of the application
- Developed Unit Test Cases, used JUnit for unit testing of the application
Environment: Java, J2EE, Struts, SQL, JAX RPC, XML, RAD, Web sphere, MQ, Agile, JSPS, SOAP
- Understanding and updating Functional Specification.
- Developing JSP Web Pages.
- Developing JAX-WS Services.
- Developed Web Services using WSDL, SOAP, XML to provide facility to obtain quote, receive updates to the quote, customer information, status updates and confirmations.
- Used SAX/DOM XML Parser for parsing the XML file
- Developed Build Automation and Install Automation tool for nightly build and deploy of Total Defense Product in all available virtual machines in ESX server using Ant, Core Java, Vix APIand autoit.
- Writing test cases using JUNIT, doing test first development.
- Used Log4J to create log files to debug as well as trace application.
- Giving training to freshers in Core Java.
Environment: Core Java, Servlets & JSPs, JDBC, JAX-WS, WSDL, SOAP, XML, Oracle, Ant, Autoit.
JR Software Engineer
- Work with team of developers on python applications for RISK management
- Created SQL queries to pull data from the relational databases
- Gathered business requirements and converted it into SQL stored procedures for database specific projects
- Developed the DAO layer for the application using Spring Hibernate Template support
- Managed connectivity using JDBC for querying/inserting & data management including triggers and stored procedures.
- Developed various EJBs for handling business logic and data manipulations from database.
- Involved in design of JSP’s and Servlets for navigation among the modules.
- Designed cascading style sheets and XML part of Order Entry Module & Product Search Module and did client side validations with java script.
- Developed Tableau visualizations and dashboards using Tableau Desktop
- Designed and developed data management system using MySQL
- Wrote python scripts to parse XML documents and load the data in database
- Expertise in writing Constraints, Indexes, Views, Stored Procedures, Cursors, Triggers and User Defined function
- Created unit test/regression test framework for working/new code
- Interfaced with third-party vendors to customize UI/UX solutions
- Elegantly implemented page designs in standards-compliant dynamic XHTML and CSS