Data Engineer Resume
Palo, AltO
PROFESSIONAL SUMMARY:
- 7+ years of technical expertise in all phases of SDLC (Software Development Life Cycle) which includes a major concentration on Big Data analyzing frame works, various Relational Databases, NoSQL Databases and Java/J2EE technologies with highly recommended software practices.
- 3+ years of industrial IT experience in Data manipulation using BigDataHadoop Eco system components Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, Hbase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, AWS, Spark integration with Cassandra, Solr and Zookeeper.
- Extensive Experience in working with Cloudera (CDH4 & 5), and HortonworksHadoop distros and AWSAmazonEMR, to fully leverage and implement new Hadoop features.
- Experience in Azure Data Factory (ADF) creating multiple pipelines and activities using Azure for full and incremental data loads.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume. Good experience in writing Spark applications using Python and Scala.
- Worked on replacing MR jobs and Hive scripts with Spark SQL and Spark data transformations for efficient data processing.
- Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD’s and Scala.
- Knowledge about unifying data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies
- Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
- Good understanding and knowledge of Hadoop architecture and Hands on experience with Hadoop components such as Name Node, Data Node and Map Reduce concepts, Spark Execution Concepts and HDFS Framework.
- Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
- Expert in working with Hive data warehouse tool -creating tables, data distribution by implementing Partitioning and Bucketing, writing and optimizing the HiveQL queries.
- Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design.
- Implemented Ad-hoc query using Hive to perform analytics on structured data.
- Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
- Experienced in optimizing Hive queries by tuning configuration parameters.
- Involved in designing the data model in Hive for migration of ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
- Experience in developing data ingestion, data processing and analytical pipelines for Big data, relational databases, NoSQL databases.
- Compared performance on hive and Big SQL for our data warehousing systems.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Extensively used Apache Flume to collect the logs and error messages across the cluster.
- Experience in composing shell scripts to dump the shared information from MySQL servers to HDFS.
- Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
- Developed graphs using Graphical Development Environment (GDE) with various Ab Initio components and migrated few graphs to Hadoop.
- Team player with good Inter personnel skills, communication and presentation skills.
- Exceptional ability to learn and master new technologies and to deliver outputs in short deadlines.
- Detailed understanding of Software Development Life Cycle (SDLC) and experience in project implementation methodologies including Waterfall and Agile.
TECHNICAL SKILLS:
Big Data Ecosystems: Hadoop, Map Reduce, HDFS, Zookeeper, Hive, Pig, Sqoop, Oozie, Flume, Yarn, Spark, NiFi
Database Languages: SQL, PL/SQL, Oracle
Programming Languages: Java, Scala, Python( can read and understand)
Frameworks: Spring, Hibernate, JMS
Scripting Languages: JSP, Servlets, JavaScript, XML, HTML, Python
Web Services: RESTful web services
Databases: RDBMS, HBase, Cassandra
IDE: Eclipse, IntelliJ
Platforms: Windows, Linux, Unix
Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss
Methodologies: Agile, Waterfall
ETL Tools: Talend
PROFESSIONAL EXPERIENCE:
Confidential, Palo Alto
Data Engineer
Responsibilities:
- Migrating the code from Ab-initio (ETL tool) to Hadoop using hive and spark according to the complexity of the Ab-initio graphs
- Developed some SQL logics using Spark-SQL for matching business requirements.
- Used various Windowing functions and developed advanced clustered queries in Spark-SQL .
- Experienced in querying data using SparkSQL on top of Spark engine for faster data sets processing.
- Design and Develop the efficient architecture to process data using Py-Spark programs.
- Developed optimized distributed applications with Spark Core and Spark SQL in Python integrating Rest, fact and dimensional data, and feed the data to HDFS and SQL server.
- Optimize, Migrate data intensive batch jobs from AbInitio into spark ETL’s.
- Developed Scalable Transformation/Aggregation/rollup Operations with Hive and Optimized the SLA’s utilizing hive-based partitions, buckets and storing the data in different file formats (Parquet, Avro, ORC) using suitable compression codecs (snappy, lz4, gzip, lzo, bzip) based on application needs.
- Developed graphs using Graphical Development Environment (GDE) with various Ab Initio components.
- Developed MapReduce batch jobs in java for loading the data to HDFS in sequential format.
- Ingested structured data from RDBMS to HDFS as incremental import using Sqoop .
- Developed Sqoop scripts to import export data from relational sources and handled incremental loading on the transaction data by date.
- Involved in writing Pig scripts to wrangle the raw data and store it to HDFS , load the data to Hive tables using HCatalog.
- Created Hive external tables with clustering and partitioning on the date for optimizing the performance of ad-hoc queries.
- Involved in creating Hive tables on wide range of data formats like text, sequential, avro, parquet and orc.
- Transformed the semi-structured log data to fit into the schema of the Hive tables using Pig.
- Evaluated suitability of Hadoop and its ecosystem to the project and implementing / validating with various proof of concept ( POC ) applications to eventually adopt them to benefit from the Big Data Hadoop initiative.
- Coordinated with Hadoop Admin team on implementing the DDLs for new applications.
- Worked on Incident and Change Management for creating tickets and CRQ using ASK NOW.
- Worked on Agile framework to tasks on Sprint basis using JIRA board.
- Worked on ESP and D-series to create collections for scheduling Job Docs in Production DCs.
- Follow the D2P process to test and debug the scripts from lower to higher environments.
- Worked with Distributed copy for applications to move data cross clusters.
Confidential, TX
Hadoop developer
Responsibilities:
- Developed Spark applications using Python utilizing Data frames and spark SQL API for faster processing of data.
- Worked with Spark Librariesfor improving performance and optimization of existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, Pair RDD's, Spark YARN
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
- Worked on Batch processing and Real-time data processing on Spark Streaming .
- Developed Spark Applications using Python and Implemented Apache Spark data processing project to handle data from No-SQL DB’sand Streaming sources.
- Worked with cloud services like Azure and involved in ETL, data integration and migration.
- Created Azure data Factory pipelines to consume data from external sources and load it into Azure SQL databases.
- Experience in Azure Data Factory (ADF) creating multiple pipelines and activities using Azure for full and incremental data loads.
- Extract Transform and load data from source systems to Azure data storage services using a combination of Azure Data factory and ingest data Azure Blob storage.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Collaborated with analytics and business teams, to improve Data models, increasing Data accessibility performed data analysis to troubleshoot Data quality issues on the source and assisted business teams.
- Responsible for understanding the business analytics requirements for HDinsight , analyze and understand the data and correlate with business requirements, build data pipelines to generate the data.
- Worked on solving performance issues in Hive with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Expertise in creating Hive Tables, loading and analyzing data using hive queries.
- Performed transformations, cleaning and filtering on imported data using Hive and loaded final data into HDFS.
- Developed Hive queries on different tables for finding insights. Automated the process of building data pipelines for data scientists to predict, classify, descriptive and prescriptive analytics.
- Built NiFi system for replicating the whole database.
- Created NiFi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
Confidential, WA
Big data engineer
Responsibilities:
- Worked with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Serializing JSON data and storing the data into tables using Spark SQL.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and Scala.
- Knowledge of cloud infrastructure technologies in Azure.
- Experience with Confidential Azure Cloud services, Storage Accounts, Azure date storage, Azure Data Factory, Data Lake and Virtual Networks.
- Part of a team which helps Confidential customers build Big data and advanced analytics solutions on Confidential Azure cloud using Azure data services or open source software
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Worked with Azure Monitoring and Data Factory.
- Supported migrations from on premise to Azure.
- Providing support services to enterprise customers related to MicrosoftAzureCloud networking and experience in handling critical situation cases.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Experience in writing Shell scripts to automate the process flow.
- Experience in performing business analytical scripts using HiveSQL.
- Provided consulting and cloud architecture for premier customers and internal projects running on MSAzure platform for high-availability of services, low operational costs.
- Optimized test content and process with a reduction of 20% in false positives. Used SQL and excel to pull, analyze, polish and visualize data.
- Followed agile methodology and SCRUM meetings to track, optimize and tailored features to customer needs.
Confidential, IL
Hadoop consultant
Responsibilities:
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
- Worked on Spark using Python and Spark SQL for faster testing and processing of data.
- Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
- Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and created applications, which monitors consumer lag within Apache Kafka clusters.
- Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model.
- Involved in Cassandra Cluster planning and had good understanding in Cassandra cluster mechanism.
- Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families.
- Developed efficient MapReduce programs for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and preprocessing on Hortonworks.
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
- Maintained ELK (Elastic Search, Logstash, and Kibana) and Wrote Spark scripts using Scala shell.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Strong experience in working with ELASTIC MAPREDUCE (EMR) and setting up environments on Amazon AWS EC2 instances.
- Written Oozie workflow to run the Sqoop and HQL scripts in Amazon EMR.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL).
- Developed shell scripts to generate the hive create statements from the data and load data to the table.
- Involved in writing custom Map-Reduce programs using java API for data processing.
- The Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
Confidential, Naperville, Illinois
Big Data Developer
Responsibilities:
- Experienced in migrating and transforming of large sets of Structured, semi structured and Unstructured RAW data from RDMS, Oracle DB, Tera Data through Sqoop and placed in HDFS for further processing.
- Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other codec file formats.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Created multiple Hive tables, running hive queries in those data, implemented Partitioning, Dynamic Partitioningand Buckets in Hive for efficient data access, Got good experience with NOSQL database like MongoDB, HBase.
- Implemented Sqoop for large data transfers from RDMS to HDFS/HBase/Hive.
- Used PIG to perform data validation on the data ingested using Sqoop and Flume and the cleansed data set is pushed into MongoDB.
- Designed and implemented the MongoDB schema, wrote services to store and retrieve user data from the MongoDB for the application on devices, Used Mongoose API to access the MongoDB from NodeJS.
- Written Java program to retrieve data from HDFS and providing it to REST Services.
- Implemented partitioning, bucketing in Hive for better organization of the data.
- Involved in using HCATALOG to access Hive table metadata from Map Reduce or Pig code
- Installed, configured and maintained Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Created multiple Hive tables, running hive queries in those data, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.
- Hands on experience in Developing optimal strategies for distributing the web log data over the cluster, importing and exporting of stored web log data into HDFS and Hive using Scoop.
- Developed several REST webs services which produces both XML and JSON to perform tasks, leveraged by both web and mobile applications.
- Developed Unit test cases for Hadoop M-R jobs and driver classes with MR Testing library.
- Continuously monitored and managed the Hadoop cluster using Cloudera managerand Web UI.
- Designed the logical and physical data model, generated DDL scripts, and wrote DML scripts for Oracle 10g database, Followed Agile Methodology for entire project and supported testing teams.
Confidential
Java Developer
Responsibilities:
- Involved in all the phases of the life cycle of the project from requirements gathering to quality assurance testing.
- Developed Class diagrams, Sequence diagrams using Rational Rose.
- Responsible in developing Rich Web Interface modules with Struts tags, JSP, JSTL, CSS, JavaScript.
- Implemented UI Screens using JSP, client-side java script validations, struts action forms.
- Implemented cookie handling at servlet initialization level.
- Developed presentation layer using Struts framework, and performed validations using Struts validated plug-in.
- Developed the application by using the Spring MVC framework.
- Created SQL script for the Oracle database
- Implemented persistence layer using Spring JDBC to store and update data in database.
- Produced web service using WSDL/SOAP standard.
- Implemented J2EE design patterns like Singleton Pattern with Factory Pattern.
- Extensively involved in the creation of the Session Beans and MDB, using EJB.
- Experienced in writing SQL queries, triggers, functions and stored procedures to implement DAO layer using JDBC.
- Involved in various phases of Software Development Life Cycle (SDLC) as design development and unit testing.
- Developed and deployed UI layer logics of sites using JSP, XML, JavaScript, HTML/DHTML, and Ajax.
- Designed use case diagrams, class diagrams, and sequence diagrams as a part of Design Phase using Rational Rose.
- Actively participated in requirements gathering, analysis, design, and testing phases.
- Extensively involved in writing Stored Procedures for data retrieval and data storage and updates in Oracle database using JDBC.
- Built and deployed the application using ant.
- Used JIRA to track bugs.
Confidential
Java Developer
Responsibilities:
- Actively involved from fresh start of the project, requirement gathering to quality assurance testing.
- Coded and Developed Multi-tier architecture in Java, J2EE, Servlets.
- Involved in gathering business requirements, analyzing the project and created UML diagrams such as Use Cases, Class Diagrams, Sequence Diagrams and flowcharts
- Working on developing client-side Web Services components using Jax-Ws technologies.
- Extensively worked on JUnit for testing the application code of server-client data transferring.
- Developed front end using JSTL, JSP, HTML, and Java Script.
- Creating new and maintaining existing web pages build in JSP, Servlet.
- Extensively worked on Views, Stored Procedures, Triggers and SQL queries and for loading the data (staging) to enhance and maintain the existing functionality.
- Involved in developing Web Services using SOAP for sending and getting data from external interface.
- Involved in Database design and developing SQL Queries, stored procedures on MySQL.
- Consumed Web Services (WSDL, SOAP, and UDDI) from third party for authorizing payments to/from customers.
- Developed Hibernate Mapping file (. hbm.xml) files for mapping declarations.
- Writing/Manipulating the database queries, stored procedures for Oracle9i.
Confidential
Jr Java Developer
Responsibilities:
- Involved in Requirements Analysis and design an Object-oriented domain model.
- Involvement in the detailed Documentation, written functional specifications of the module.
- Involved in development of Application with Java and J2EE technologies.
- Develop and maintain elaborate services-based architecture utilizing open source technologies like Hibernate, ORM and Spring Framework.
- Designed and documented REST/HTTP APIs, including JSON data formats and API versioning strategy.
- Developed server-side services using Java multithreading, Struts MVC, JavaWeb Services (SOAP,AXIS).
- Used Micro Services as communicating medium for different APIs, processed large number of small Processes also Involved in creating and configuring of build files using Ant.
- Development of Controller Servlet a Framework component for Presentation.
- Investigated MVC framework technologies including JSF based (ICEfaces,RichFaces) and to implement the MVC architecture of the product.
- Developed application using JSF,Spring, JDO technologies which communicated with Mainframe software.
- Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application, reference implementation.
