Sr. Hadoop & Data Science Platform Engineer Resume
Sunnyvale, CA
SUMMARY:
- Result - driven IT Professional with Over 10+ years of experience on Hadoop Ecosystems and Java J2EE technologies.
- Hands-on experience in Spark Core, Spark SQL, Spark Streaming and Spark machine learning using Scala and Python programming languages.
- Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
- In depth understanding of Apache Spark job execution Components like DAG, Lineage Graph, Dag Scheduler, Task Scheduler, Stages and Tasks.
- Strong experience in real time processing using Apache Spark and Kafka.
- Strong experience in Spark SQL UDFs, Hive UDFs, Spark SQL Performance, Performance Tuning. Hands on experience in working with input file formats like ORC, Parquet, JSon and Avro.
- Good expertise in coding in Python, Scala and Java.
- Good understanding of the MapReduce framework architectures (MRV1 & YARN Architecture).
- Good understanding of Hadoop Architecture and various components in Hadoop ecosystems - HDFS, Map Reduce, Pig, Sqoop and Hive.
- Hands on experience in cleansing semi-structured and unstructured data using Pig Latin scripts
- Strong knowledge in creating Hive tables and worked using HQL for data analysis to meet the business requirements.
- Experience in managing and reviewing Hadoop log files.
- Good working experience of No SQL database like Cassandra and MangoDB
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
- Experience in working with flume to load the log data from multiple sources directly into HDFS
- Experience in scheduling time driven and data driven Oozie workflows.
- Experience in fine-tuning MapReduce jobs for better scalability and performance.
- Experience in writing shell scripts do dump the shared data from landing zones to HDFS.
- Experience in performance tuning the Hadoop cluster by gathering and analyzing the existing infrastructure.
- Expertise in Client Side designing and validations using HTML and Java Script.
- Excellent communication and inter-personal skills detail oriented, analytical, time bound, responsible team player and ability to coordinate in a team environment and possesses high degree of self-motivation and a quick learner.
TECHNICAL SKILLS:
BigData Frameworks: Hadoop, Spark, Scala, Hive, Kafka, AWS, Cassandra, HBase, Flume, Pig, Sqoop, MapReduce, Cloudera, MongoDB
BigData Distribution: Cloudera, Hortonworks, Amazon EMR
Languages: Core Java, Scala, Python, SQL, Shell Scripting
Operating Systems: Windows, Linux (Ubuntu)
Databases: Oracle, SQL Server
Designing Tools: Eclipse
Java Technologies: JSP, Servlets, Junit, Spring, Hibernate
Web Technologies: XML, HTML, JavaScript, JVM, JQuery, JSON
Linux Experience: System Administration Tools, Puppet, Apache
Web Services: Web Service (RESTful and SOAP)
Development methodologies: Agile, Waterfall
Logging Tools: Log4j
Application / Web Servers: CherryPy, Apache Tomcat, WebSphere
Messaging Services: ActiveMQ, Kafka, JMS
Version Tools: Git, SVN and CVS
PROFESSIONAL EXPERIENCE:
Confidential- Sunnyvale, CA
Sr. Hadoop & Data Science Platform Engineer
Responsibilities:
- Performed benchmarking of federated queries in Spark and compared their performance by running the same queries on Presto.
- Defined Spark confs for optimization of federated queries by maneuvering the number of executors, executor-memory and executor-cores.
- Created partitions and buckets defined Hive tables for data analysis.
- Successfully migrated data from Hive to MemSQL DB via Spark engine where the largest table being 1.2T.
- Successfully ran benchmarking queries on MemSQL database and calculated the performance of each query.
- Compared the performance of each benchmark query among different solutions like Spark, Teradata, MemSQL, Presto, Hive (using Tez engine) by creating a bar graph in Numbers.
- Successfully migrated data from Teradata to MemSQL using Spark by persisting Dataframe to MemSQL.
- Provided a solution using HIVE, SQOOP (to export/ import data), for faster data load by replacing the traditional ETL process with HDFS for loading data to target tables.
- Developed Spark scripts by using Scala as per the requirement.
- Developed Java scripts using both RDD and Data frames/SQL/Data sets in Spark 1.6 and Spark 2.1 for Data Aggregation, queries and writing data.
- Used Grafana for analyzing the usage of spark executors for different queues on different clusters.
- Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, efficient Joins, Transformations and other during ingestion process itself.
- Developed Hive queries to process the data and generate the data cubes for visualizing
- Converting SQL codes to Spark codes using Java and Spark-SQL/Streaming for faster testing and processing of data and Import and index data from HDFS for secure searching, reporting, analysis and visualizations in Splunk.
- Working extensively on Hive, SQL, Scala, Spark, and Shell.
- Used to complete the Assigned radar's in time, used to store the code in GIT repository.
- Tested python, R, Livy, Teradata, JDBC interpreters by executing sample paragraphs.
- Performed CI/CD builds of Zeppelin, Azkaban and Notebook using Ansible.
- Built a new version of Zeppelin by applying Git patches, changing the artifacts using Maven.
- Worked on shell scripting to determine the status on various components in data science platform.
- Performed data copying activities in a distributed environment using Ansible.
- Built Apache Nifi flow for migration of data from MS SQL and MySQL databases to the Staging table.
- Setup the control table that used to generate package id, batch id and status for each batch.
- Performed batch processing on large sets of data.
- Performed transformations on large data sets using Apache Nifi expression language.
- Unit tested the migration of MySQL and MS SQL tables using the built Nifi flow.
- Used DBeaver for connecting to the different databases that are on different sources.
- Performed queries for verifying the data types of different columns that are being migrated to staging table.
- Responsible for monitoring data from source to target.
- Successfully populated the staging tables in MySQL database without any data mismatch errors.
- Worked on Agile Version One methodology by attending the scrums and scrum planning.
Environment: SparkCore,SparkSQL,Memsql,Presto,Teradata,Hive,ApacheZeppelin,Maven,Github,Intellij,Nginx,Redis,Monit,Linux,Shell Scripting, Ansible, Apache NiFi
Confidential - Peoria, IL
Hadoop Scala/Spark Developer
Responsibilities:
- Used PySpark data frame to read text data, CSV data, and image data from HDFS, S3 and Hive.
- Worked closely data scientist for building predictive model using PySpark.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Cleaned input text data using PySpark Machine learning feature exactions API.
- Created features to train algorithms.
- Used various algorithms of PySpark ML API.
- Trained model using historical data stored in HDFS and Amazon S3.
- Used Spark Streaming to load the trained model to predict on real time data from Kafka.
- Stored the result in MongoDB.
- Web application can picks data which is stored in MongoDB.
- Used Apache Zeppelin to visualization of Big Data.
- Fully automated job scheduling, monitoring, and cluster management without human.
- Intervention using Airflow.
- Build apache spark as Web service using flask.
- Migrated python Scikit learn machine learning to data frame based spark machine learning algorithms.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
Environment: Spark core, SparkSQL, Spark streaming, Spark machine learning, Python, Scikit learn, Pandas Dataframe, AWS, Kafka, Hive, MongoDB, GitHub, Airflow, Amazon s3, Amazon EMR
Confidential - Charlotte, NC
Hadoop Spark Developer
Responsibilities:
- Imported data from our relational data stores to Hadoop using Sqoop .
- Created various MapReduce jobs for performing ETL transformations on the transactional and application specific data sources.
- Wrote PIG scripts and executed by using Grunt shell .
- Worked on the conversion of existing MapReduce batch applications for better performance.
- Big data analysis using Pig and User defined functions (UDF).
- Worked on loading tables to Impala for faster retrieval using different file formats.
- The system was initially developed using Java. The Java filtering program was restructured to have business rule engine in a jar that can be called from both java and Hadoop .
- Created Reports and Dashboards using structured and unstructured data.
- Upgrade operating system and/or Hadoop distribution as and when new versions released by using Puppet.
- Performed joins, group by and other operations in MapReduce by using Java and PIG .
- Processed the output from PIG, Hive and formatted it before sending to the Hadoop output file.
- Used HIVE definition to map the output file to tables.
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Wrote data ingesters and map reduce programs
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance
- Wrote MapReduce/HBase jobs
- Worked with HBase, NOSQL database.
Environment: Apache Hadoop 2.x, MapReduce, HDFS, Hive, Pig, HBase, Sqoop, Flume, Linux, Java 7, Eclipse, NoSQL
Confidential - Salt Lake City, UT
Hadoop Developer
Responsibilities:
- Developed Big Data Solutions that enabled the business and technology teams to make data-driven decisions on the best ways to acquire customers and provide them business solutions.
- Installed and configured Apache Hadoop, Hive, and HBase.
- Worked on Hortonworks cluster, which was used to process the big data.
- Developed multiple map reduce jobs in java for data cleaning and pre-processing.
- Sqoop was used to pull data into Hadoop distributed file system from RDBMS and vice versa
- Defined workflows using Oozie.
- Used Hive to create partitions on hive tables and analyzes this data to compute various metrics for reporting.
- Created Data model for Hive tables
- Involved in managing and reviewing Hadoop log files
- Used Pig as ETL tool to do transformations, joins and pre-aggregations before loading data onto HDFS.
- Worked on large sets of structured, semi structured and unstructured data
- Responsible to manage data coming from different sources
- Installed and configured Hive and also developed Hive UDFs to extend core functionality of hive
- Responsible for loading data from UNIX file systems to HDFS.
Environment: Apache Hadoop 2.x, MapReduce, HDFS, Hive, HBase, Pig, Oozie, Linux, Java 7, Eclipse
Confidential - Bridgewater, NJ
Java Developer
Responsibilities:
- Involved in requirements analysis, high level design, detailed design, UMLs, data model design, coding, testing and creation of functional and technical design documentation.
- Used Spring Framework for MVC architecture with Hibernate to implement DAO code and also used Web Services to interact other modules and integration testing.
- Developed and implemented GUI functionality using JSP, JSTL, Tiles and AJAX.
- Designed database and involved in developing SQL Scripts.
- Used SQL navigator and involved in testing the application.
- Implementing the Design Patterns like MVC-2, Front Controller, Composite view and all Struts framework design patterns to improve the performance.
- Used Clear case, and also subversion for maintaining the source version control.
- Wrote Ant scripts to automate the builds and installation of modules.
- Involved in writing Test plans and conducted Unit Tests using JUnit .
- Used Log4j for logging statements during development.
- Design and implementation of log data indexing and search module, and optimization for performance and accuracy. To provide a full text search capability for archived log data, utilizing Apache Lucene library.
- Involved in the testing and integrating of the program at the module level.
- Worked with production support team in debugging and fixing various production issues.
Environment: Java 1.5, AJAX, XML, Spring 3.0, Hibernate 2.0, Struts 1.2, Webservices, WebSphere 7.0, Junit, Oracle 10g, SQL, PL/SQL, log4j, RAD 7.0/7.5, Clear case, Unix, HTML, CSS, JavaScript
Confidential
Java Developer
Responsibilities:
- Worked with the business community to define business requirements and analyze the possible technical solutions.
- Requirement gathering, Business Process flow, Business Process Modeling and Business Analysis.
- Extensively used UML and Rational Rose for designing to develop various use cases, class diagrams and sequence diagrams.
- Used JavaScript for client-side validations, and AJAX to create interactive front-end GUI.
- Developed application using Spring MVC architecture.
- Developed custom tags for table utility component
- Used various Java, J2EE APIs including JDBC, XML, Servlets, and JSP.
- Designed and implemented the UI using Java, HTML, JSP and JavaScript.
- Designed and developed web pages using Servlets and JSPs and also used XML/XSL/XSLT as repository.
- Involved in Java application testing and maintenance in development and production.
- Involved in developing the customer form data tables. Maintaining the customer support and customer data from database tables in MySQL database.
- Involved in mentoring specific projects in application of the new SDLC based on the Agile Unified Process, especially from the project management, requirements and architecture perspectives.
- Designed and developed Views, Model and Controller components implementing MVC Framework.
Environment: JDK 1.3, J2EE, JDBC, Servlets, JSP, XML, XSL, CSS, HTML, DHTML, JavaScript, UML, Eclipse 3.0, Tomcat 4.1, MySQL