- 8+ years of IT experience in various domains with Hadoop Ecosystems and Java J2EE technologies.
- Very good hands - on in Spark Core,Spark Sql,Spark Streaming and Spark machine learning using Scala and Python programming language.
- Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions,Persistence (Caching),Accumulators, Broadcast Variables, Optimising Broadcasts.
- In depth understanding of Apache spark job execution Components like DAG,lineage graph,Dag Schedular, Task schedular, Stages and task.
- Experience in exposing Apache Spark as web services.
- Good understanding of Driver,Executor Spark web UI.
- Experience in submitting Apache Spark job and map reduce jobs to YARN.
- Experience in real time processing using Apache Spark and Kafka.
- Migrated Python Machine learning modules to scalable, high performance and fault-tolerant distributed systems like Apache Spark.
- Strong experience in Spark SQL UDFs, Hive UDFs, Spark SQL Performance, Performance Tuning. Hands on experience in working with input file formats like orc, parquet, json, avro.
- Good expertise in coding in Python, Scala and Java.
- Good understanding of the mapreduce framework architectures (MRV1 & YARN Architecture).
- Good Knowledge and understanding of Hadoop Architecture and various components in Hadoop ecosystems - HDFS, Map Reduce, Pig, Sqoop and Hive.
- Developed various Map Reduce applications to perform ETL workloads on terabytes of data.
- Hands on experience in cleansing semi-structured and unstructured data using Pig Latin scripts
- Good working knowledge in creating Hive tables and worked using HQL for data analysis to meet the business requirements.
- Experience in managing and reviewing Hadoop log files.
- Having good working experience of No SQL database like Cassandra and MangoDB
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
- Experience in working with flume to load the log data from multiple sources directly into HDFS
- Experience in scheduling time driven and data driven Oozie workflows.
- Used Zookeeper on a distributed Hbase for cluster configuration and management.
- Worked with Avro Data Serialization system.
- Experience in fine-tuning Mapreduce jobs for better scalability and performance.
- Experience in writing shell scripts do dump the shared data from landing zones to HDFS.
- Experience in performance tuning the Hadoop cluster by gathering and analyzing the existing infrastructure.
- Expertise in Client Side designing and validations using HTML and Java Script.
- Excellent communication and inter-personal skills detail oriented, analytical, time bound, responsible team player and ability to coordinate in a team environment and possesses high degree of self-motivation and a quick learner.
Big Data Frameworks: Hadoop, Spark, Scala, Hive, Kafka, AWS, Cassandra, HBase, Flume, Pig, Sqoop, MapReduce, Cloudera, Mongo DB.
Bigdata distribution: Cloudera,Hortonworks, Amazon EMR
Programming languages: Core Java, Scala, Python, SQL, Shell Scripting
Operating Systems: Windows, Linux (Ubuntu)
Databases: Oracle, SQL Server
Designing Tools: Eclipse
Java Technologies: JSP, Servlets, Junit, Spring, Hibernate
Linux Experience: System Administration Tools, Puppet, Apache
Web Services: Web Service (RESTful and SOAP)
Development methodologies: Agile, Waterfall
Logging Tools: Log4j
Application / Web Servers: Cherrypy,Apache Tomcat, Websphere
Messaging Services: ActiveMQ, Kafka, JMS
Version Tools: Git, SVN and CVS
Hadoop and Data Science Platform Engineer
- Performed benchmarking of federated queries in Spark and compared their performance by running the same queries on Presto.
- Defined Spark confs for optimization of federated queries by maneuvering the number of executors, executor-memory and executor-cores.
- Created partitions and buckets defined Hive tables for data analysis.
- Successfully migrated data from Hive to Memsql db via Spark engine where the largest table being 1.2T.
- Successfully ran benchmarking queries on Memsql database and calculated the performance of each query.
- Compared the performance of each benchmark query among different solutions like Spark, Teradata, Memsql, Presto, Hive (using Tez engine) by creating a bar graph in Numbers.
- Successfully migrated data from Teradata to Memsql using Spark by persisting Dataframe to Memsql.
- Provided a solution using HIVE, SQOOP (to export/ import data), for faster data load by replacing the traditional ETL process with HDFS for loading data to target tables.
- Developed Spark scripts by using Scala as per the requirement.
- Developed Java scripts using both RDD and Data frames/SQL/Data sets in Spark 1.6 and Spark 2.1 for Data Aggregation, queries and writing data.
- Used Grafana for analyzing the usage of spark executors for different queues on different clusters.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, efficient Joins, Transformations and other during ingestion process itself.
- Developed Hive queries to process the data and generate the data cubes for visualizing
- Converting SQL codes to Spark codes using Java and Spark-SQL/Streaming for faster testing and processing of data and Import and index data from HDFS for secure searching, reporting, analysis and visualizations in Splunk.
- Working extensively on Hive, SQL, Scala, Spark, and Shell.
- Used to complete the Assigned radar's in time, used to store the code in GIT repository.
- Tested python, R, livy, teradata jdbc interpreters but executing sample paragraphs.
- Performed CI/CD builds of Zeppelin, Azkaban and Notebook using Ansible.
- Built a new version of Zeppelin by applying Git patches, changing the artifacts using Maven.
- Worked on shell scripting to determine the status on various components in data science platform.
- Performed data copying activities in a distributed environment using Ansible.
- Built Apache Nifi flow for migration of data from mssql and mysql databases to the Staging table.
- Setup the control table that used to generate package id, batch id and status for each batch.
- Performed batch processing on large sets of data.
- Performed transformations on large data sets using Apache Nifi expression language.
- Unit tested the migration of mysql and mssql tables using the built Nifi flow.
- Used Dbeaver for connecting to the different databases that are on different sources.
- Performed queries for verifying the data types of different columns that are being migrated to staging table.
- Responsible for monitoring data from source to target.
- Successfully populated the staging tables in mysql database without any data mismatch errors.
- Worked on Agile Version One methodology by attending the scrums and scrum plannings.
Environment: SparkCore,SparkSQL,Memsql,Presto,Teradata,Hive,ApacheZeppelin,Maven,Github,Intellij,Nginx,Redis,Monit,Linux,Shell Scripting,Ansible,Apache NiFi.
Confidential, Peoria, Illinois
- Used Pysaprk dataframe to read text data, CSV data,image data from HDFS, S3 and Hive.
- Worked closely data scienctist for building predictive model using Pyspark.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Cleaned input text data using Pyspark Machine learning feature exactions API.
- Created features to train algorithms.
- Used various algorithms of Pyspark ML API.
- Trained model using historical data stored in HDFS and Amazon S3.
- Used Spark Streaming to load the trained model to predict on real time data from kafka.
- Stored the result in MongoDB .
- Web application can picks data which is stored in MongoDB.
- Used Apache Zeppelin to vizualization of Big Data.
- Fully automated job scheduling, monitoring, and cluster management without human.
- Intervention using Air flow.
- Build apache spark as Web service using flask.
- Migrated python scikit learn machine learning to data frame based spark machine learning algorithms.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
Environment: Spark core, SparkSQL, Spark streaming, Spark machine learning, Python, Scikit learn, Pandas dataframe, AWS, Kafka, Hive, MongoDB, Github, Airflow, Amazon s3, Amazon EMR .
Confidential, Madison, WI
- Extracted the data from Teradata into HDFS using Sqoop.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
- Implemented MapReduce programs on log data to transform into structured way to find user information.
- Extensive experience in writing Pig scripts to transform raw data from several data sources into forming baseline data.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views and visit duration.
- Utilized Flume to filter out the JSON input data read from the web servers to retrieve only the required data needed to perform analytics.
- Developed UDF functions for Hive and wrote complex queries in Hive for data analysis.
- Developed a well-structured and efficient ad-hoc environment for functional users.
- Export the analyzed data to relational databases using Sqoop for visualizations and to generate reports for the BI team.
- Loaded cache data into HBase using Sqoop.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Wrote ETLs using Hive and processed the data as per business logic.
- Hands on experience on Amazon EC2 Spot integration & and Amazon S3 integration.
- Optimizing the EMRFS for Hadoop to directly read and write in parallel to AWS S3 performantly.
- Extensive work in ETL process consisting of data transformation, data sourcing, mapping, conversion and loading using Informatica.
- Extensively used ETL processes to load data from flat files into the target database by applying business logic on transformation mapping for inserting and updating records when loaded.
- Created Talend ETL jobs to read the data from Oracle Database and import in HDFS.
- Worked on data serialization formats for converting complex objects into sequence bits by using Avro,
- RC and ORC file formats.
Environment: Apache Hadoop, Hortonworks HDP 2.0, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, HBaseOozie, Teradata, Talend, Avro, Java, Linux
Confidential - Omaha, NE
- Imported data from our relational data stores to Hadoop using Sqoop.
- Created various Mapreduce jobs for performing ETL transformations on the transactional and application specific data sources.
- Wrote PIG scripts and executed by using Grunt shell.
- Worked on the conversion of existing MapReduce batch applications for better performance.
- Big data analysis using Pig and User defined functions (UDF).
- Worked on loading tables to Impala for faster retrieval using different file formats.
- The system was initially developed using Java. The Java filtering program was restructured to have business rule engine in a jar that can be called from both java and Hadoop.
- Created Reports and Dashboards using structured and unstructured data.
- Upgrade operating system and/or Hadoop distribution as and when new versions released by using Puppet.
- Performed joins, group by and other operations in MapReduce by using Java and PIG.
- Processed the output from PIG, Hive and formatted it before sending to the Hadoop output file.
- Used HIVE definition to map the output file to tables.
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Wrote data ingesters and map reduce programs
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance
- Wrote MapReduce/HBase jobs
- Worked with HBase, NOSQL database.
Environment: Apache Hadoop 2.x, MapReduce, HDFS, Hive, Pig, Hbase, Sqoop, Flume, Linux, Java 7, Eclipse, NOSQL.
- Full life cycle experience including requirements analysis, high level design, detailed design, UMLs, data model design, coding, testing and creation of functional and technical design documentation.
- Used Spring Framework for MVC architecture with Hibernate to implement DAO code and also used Web Services to interact other modules and integration testing.
- Developed and implemented GUI functionality using JSP, JSTL, Tiles and AJAX.
- Designed database and involved in developing SQL Scripts.
- Used SQL navigator as a and involved in testing the application.
- Implementing the Design Patterns like MVC-2, Front Controller, Composite view and all Struts framework design patterns to improve the performance.
- Used Clear case, and also subversion for maintaining the source version control.
- Wrote Ant scripts to automate the builds and installation of modules.
- Involved in writing Test plans and conducted Unit Tests using JUnit.
- Used Log4j for logging statements during development.
- Design and implementation of log data indexing and search module, and optimization for performance and accuracy. To provide a full text search capability for archived log data, utilizing Apache Lucene library.
- Involved in the testing and integrating of the program at the module level.
- Worked with production support team in debugging and fixing various production issues.
Environment: Java 1.5,AJAX,XML,Spring3.0,Hibernate2.0,Struts1.2,Webservices,Websphere7.0,Junit,Oracle10g,SQL, PL/SQL, log4j, RAD 7.0/7.5, Clear case, Unix, HTML, CSS, Java script.
- Worked with the business community to define business requirements and analyze the possible technical solutions.
- Requirement gathering, Business Process flow, Business Process Modeling and Business Analysis.
- Extensively used UML and Rational Rose for designing to develop various use cases, class diagrams and sequence diagrams.
- Developed application using Spring MVC architecture.
- Developed custom tags for table utility component
- Used various Java, J2EE APIs including JDBC, XML, Servlets, and JSP.
- Designed and developed web pages using Servlets and JSPs and also used XML/XSL/XSLT as repository.
- Involved in Java application testing and maintenance in development and production.
- Involved in developing the customer form data tables. Maintaining the customer support and customer data from database tables in MySQL database.
- Involved in mentoring specific projects in application of the new SDLC based on the Agile Unified Process, especially from the project management, requirements and architecture perspectives.
- Designed and developed Views, Model and Controller components implementing MVC Framework.