Data Lake Engineer Resume
Framingham, MA
SUMMARY
- Overall 10 years of IT industry experience in product Development, Implementation and Maintenance of various applications using Big Data ecosystems on Linux environment
- Overall 6 years of experience working with analytics using Big Data technologies. Have hands - on experience in Storing, Querying, Processing and Data Analysis
- Comprehensive work experience in implementing Big Data projects using Apache Hadoop, Pig, Hive, HBase, Spark, Sqoop, Flume, Zookeeper, Oozie
- Experience with distributed systems, large-scale non-relational data stores and multi-terabyte data warehouses
- Excellent knowledge on Hadoop architecture : Hadoop Distributed File system (HDFS), Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
- Hands-on experience building data pipelines using Hadoop components Sqoop, Hive, Pig, MapReduce, Spark, Spark SQL
- Hands on experience in various Big Data application phases like Data Ingestion, Data Analytics and Data Visualization
- Experience in developing efficient solutions to analyze large data sets
- Experience working on Hortonworks / Cloudera / MapR distributions
- Extensively worked on MRV1 and MRV2 Hadoop architectures
- Experience working on Spark, RDD’s, DAG’s, Spark SQL and Spark Streaming
- Experience in importing and exporting data using Sqoop between HDFS and Relational Database Management Systems
- Populated HDFS with huge amounts of data using Apache Kafka and Flume
- Excellent knowledge of data mapping, extracting, transforming and loading from different data sources
- Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing
- Experience in developing custom MapReduce Programs in Java using Apache Hadoop for analyzing Big Data as per the requirement; writing Python automation scripts for applications
- Well experienced in data transformation using custom MapReduce, Hive and Pig scripts for different types of file formats
- Expertise in extending Hive and Pig core functionality by writing custom UDFs and UDAF’s
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
- Experience building solutions with NoSQL databases, such as HBase, Cassandra, MongoDB
- Experienced in Apache Spark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written in Scala
- Experience in Kafka installation & integration with Spark Streaming
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing
- Experience in designing both time driven and data driven automated workflows using Oozie
- Good understanding of ZooKeeper for monitoring and managing Hadoop jobs
- Good understanding of ETL tools and how they can be applied in a Big Data environment
- Monitoring Map Reduce Jobs and YARN Applications
- Experience working with Microsoft Azure Cloud services: Azure Data Lake storage Gen1 & Gen2, Azure Data Factory & other services
- Experience working with Databricks notebooks & integration of the notebooks with Azure Data Factory
- Hands -on experience with Amazon Elastic MapReduce (EMR), Storage S3, EC2 instances and Data Warehousing
- Experience with RDBMS and writing SQL and PL/SQL scripts used in stored procedures
- Used Git for source code and version control management
- Strong understanding in Agile and Waterfall SDLC methodologies
- Experience in working with small and large groups and successful in meeting new technical challenges and finding solutions to meet the needs of the customer
TECHNICAL SKILLS
Big Data Technologies: HDFS, YARN, Map Reduce, Pig, Hive, HBase, Spark, Spark SQL, Spark Streaming, Sqoop, Flume, Kafka, ZooKeeper, Oozie
Big Data Distributions: Hortonworks, Cloudera, MapR, Amazon Elastic MapReduce (EMR)
Programming Languages: Java, Python, Scala, C++, R, JavaScript, Shell Script
Operating Systems: Linux, Windows, Unix
RDBMS: Oracle, MySQL, MS SQL Server
NoSQL Databases: HBase, Cassandra, MongoDB
Frame works: Spring, Hibernate, Struts
Web Servers: Apache Tomcat, Web Sphere, Web Logic
Version Control: Git, SVN, CVS
Integrated Development Environments (IDEs): Spyder, Java Eclipse IDE, NetBeans, Microsoft SQL Studio
Web Technologies: HTML, CSS, Bootstrap, Java Script, DOM, XML, Servlets
PROFESSIONAL EXPERIENCE
Confidential, Framingham MA
Data Lake Engineer
Responsibilities:
- Worked in developing data lake for the GBT (Global Business Transactions) reporting team
- Worked in developing hierarchy application for the ECH (Enterprise Customer Hierarchy) team
- Worked in developing a unified data platform for the SVC (Single View Customer) team
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely with the business team to gather their requirements
- Assisted in designing & developing data lake and ETL using python and Hadoop ecosystem
- Coordinated with clients’ developers in tuning up the query performance for all services
- Involved with developing queries in MySQL, Oracle & DB2
- Worked with Hadoop’s components: HDFS, MapReduce, Hive, Sqoop, Hue, Kafka for Couchbase NoSQL data extract
- Worked with Microsoft Azure cloud services for migrating on-premises data from RDBMS sources (PostgreSQL) and FTP servers (cloud-based) to the Azure Data Lake Storage Gen1 & Gen2
- Worked with Azure Databricks notebooks, for computing - using spark RDDs and spark SQL processing, integrating them as part of Azure Data Factory’s pipelines
- Tested the code performance in development and Quality Assurance environments
- Responsible in supporting the client after production release
- Followed Agile Methodologies while working on the project
Environment: Hadoop, Spark, HDFS, MapReduce, Yarn, Hive, Hue, Sqoop, Kafka, SQL, GitHub, Python scripts, Linux, Tidal Scheduler, Microsoft Azure - Data lake storages (Gen1 & Gen2), Databricks notebooks, Spark RDDs, Spark SQL, Oracle, MySQL, Postgres & DB2 relational databases, Couchbase NoSQL database
Confidential, Chicago IL
Sr. Hadoop Developer
Responsibilities:
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely with the business team to gather their requirements and new support features
- Involved in running POC’ s on different use cases of the application and maintained a standard document for best coding practices
- Developed a 16-node cluster in designing the Data Lake with the Hortonworks distribution
- Responsible for building scalable distributed data solutions using Hadoop
- Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, ZooKeeper)
- Implemented Kerberos for authenticating all the services in Hadoop Cluster
- Configured ZooKeeper to coordinate the servers in clusters to maintain the data consistency
- Involved in designing the Data pipeline from end-to-end, to ingest data into the Data Lake
- Wrote scripts to automate application deployments and configurations monitoring YARN
- Configured and developed Sqoop scripts to migrate the data from relational databases like Oracle, Teradata to HDFS
- Used Flume for collecting and aggregating large amounts of streaming data into HDFS
- Wrote MapReduce jobs in Java to parse the raw data populate staging tables and store the refined data
- Developed Map Reduce programs as a part of predictive analytical model development
- Built re-usable Hive UDF libraries for business requirements which enabled various business analysts to use these UDF’s in Hive querying
- Created different staging tables like ingestion tables and preparation tables in Hive environment
- Optimized Hive queries and used Hive on top of Spark engine
- Worked on Sequence files, Map side joins, Bucketing, Static and Dynamic Partitioning for Hive performance enhancement and storage improvement
- Tested Apache TEZ , an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL, Scala
- Worked on the Spark core and Spark SQL modules of Spark extensively
- Created tables in HBase to store the variable data formats of data coming from different upstream sources
- Leveraged AWS cloud services such as EC2; auto-scaling; and VPC (Virtual Private Cloud) to build secure, highly scalable and flexible systems that handled expected and unexpected load bursts and can quickly evolve during development iterations
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Spark framework using Scala
- Configured various workflows to run on top of Hadoop using Oozie and these workflows comprises of heterogeneous jobs like Hive, Sqoop and MapReduce
- Experience in managing and reviewing Hadoop log files
- Utilized capabilities of Tableau such as Data extracts, Data blending, Forecasting, Dashboard actions and table calculations to build dashboards
- Followed Agile Methodologies while working on the project
- Performed bug fixing and 24X7 production support for running the processes
Environment: Java, Scala, Hadoop, Hortonworks, AWS, HDFS, YARN, Map Reduce, Hive, Spark, Kafka, Sqoop, Oozie, Zookeeper, Oracle, Teradata, MySQL
Confidential, Washington DC
Hadoop Developer
Responsibilities:
- Experience with complete SDLC process staging code reviews, source code management and build process
- Implemented Big Data platforms as data storage, retrieval and processing systems
- Developed data pipeline using Kafka, Sqoop, Hive and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis
- Involved in managing nodes on Hadoop cluster and monitor Hadoop cluster job performance using Cloudera manager
- Wrote Sqoop scripts for importing and exporting data into HDFS and Hive
- Wrote MapReduce jobs to discover trends in data usage by the users
- Load and transform large sets of structured, semi structured and unstructured data Pig
- Experienced working on Pig to do transformations, event joins, filtering and some pre-aggregations before storing the data onto HDFS
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
- Involved in developing Hive UDF’s for the needed functionality that is not available out of the box from Hive
- Created Sub-Queries for filtering and faster execution of data
- Experienced in migrating Hive QL into Impala to minimize query response time
- Used HCATALOG to access the Hive table metadata from MapReduce and Pig scripts
- Experience in writing and tuning Impala queries, creating views for ad-hoc and business processing
- Experience loading and transforming large amounts of structured and unstructured data into HBase and exposure handling Automatic failover in HBase
- Ran POC's in Spark to take the benchmarking of the implementation
- Developed Spark jobs using Scala in test environment for faster data processing and querying
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala
- Configured big data workflows to run on the top of Hadoop using Oozie and these workflows comprises of heterogeneous jobs like Pig, Hive, Sqoop Cluster co-ordination services through ZooKeeper
- Hands on experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions
- Involved in developing test framework for data profiling and validation using interactive queries and collected all the test results into audit tables for comparing the results over the period
- Documented all the requirements, code and implementation methodologies for reviewing and analyzation purposes
- Extensively used GitHub as a code repository and Phabricator for managing day to day development process and to keep track of the issues
Environment: Java, Scala, Hadoop, Spark, HDFS, MapReduce, Yarn, Hive, Pig, Impala, Oozie, Sqoop, Flume, Kafka, Teradata, SQL, GitHub, Phabricator, Amazon Web Services
Confidentia
Hadoop Developer
Responsibilities:
- Worked on Hortonworks cluster, which is responsible for providing open source platform based on Apache Hadoop for analyzing, storing and managing big data
- Worked with analyst to determine and understand business requirements
- Load and transform large datasets of structured, semi structured and unstructured data using Hadoop/Big Data concepts
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer data and financial histories into HDFS for analysis
- Used MapReduce and Flume to load, aggregate, store and analyze web log data from different web servers
- Created MapReduce programs to handle semi/unstructured data like XML, JSON, AVRO data files and sequence files for log files
- Involved in submitting and tracking MapReduce jobs using Job Tracker
- Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimizations of exists scripts
- Written Hive UDF to sort Structure fields and return complex data types
- Created Hive tables from JSON data using data serialization framework like AVRO
- Experience writing reusable custom Hive and Pig UDF’s in Java and using existing UDF’s from Piggybank and other sources
- Experience in working with NoSQL database HBase in getting real time data analytics
- Integrated Hive tables to HBase to perform row level analytics
- Developed Oozie workflows for daily incremental loads, which Sqoop’s data from Teradata, Netezza and then imported into Hive tables
- Involved in performance tuning by using different service engines like TEZ etc.
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files
- Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS using AutoSys and Oozie coordinator jobs
- Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library
Environment: Hortonworks, Java, Hadoop, HDFS, MapReduce, Tez, Hive, Pig, Oozie, Sqoop, Flume, Teradata, Netezza, Tableau
Confidential
Hadoop Developer
Responsibilities:
- Installed Cloudera distribution of Hadoop Cluster and services HDFS, Pig, Hive, Sqoop, Flume and MapReduce
- Responsible for providing open source platform based on Apache Hadoop for analyzing, storing and managing big data
- Loaded and transformed large sets of structured, semi-structured and unstructured data
- Responsible for managing data coming from different sources
- Imported and exported data into HDFS and Hive using Sqoop
- Wrote Hive queries
- Involved in loading data from UNIX file system to HDFS
- Created Hive tables, loaded with data and wrote queries which will run internally in MapReduce and performed data analysis as per the business requirements
- Worked with analysts to determine and understand business requirements
- Loaded and transformed large datasets of structured, semi structured and unstructured data using Hadoop/Big Data concepts
- Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer data and financial histories into HDFS for analysis
- Used MapReduce and Flume to load, aggregate, store and analyze web log data from different web servers
- Created MapReduce programs to handle semi/unstructured data like XML, JSON, AVRO data files and sequence files for log files
- Involved in submitting and tracking MapReduce jobs using Job Tracker
- Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimizations of exists scripts
- Written Hive UDF to sort Structure fields and return complex data types
- Created Hive tables from JSON data using data serialization framework like AVRO
- Experience writing reusable custom Hive and Pig UDF’s in Java and using existing UDF’s from Piggybank and other sources
- Experience in working with NoSQL database HBase in getting real time data analytics
- Integrated Hive tables to HBase to perform row level analytics
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files
- Developed Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library
- Supported operations team in Hadoop cluster maintenance including commissioning and decommissioning nodes and upgrades
- Provided technical assistance to all development projects
- Hands-on experience with Qlik Sense for Data Visualization and Analysis on large data sets, drawing various insights
- Created dashboards using Qlik Sense and performed Data extracts, Data blending, Forecasting, and table calculations
Environment: Hortonworks, Java, Hadoop, HDFS, MapReduce, Hive, Pig, Oozie, Sqoop, Flume, Netezza, Qlik Sense
Confidential
Java Developer
Responsibilities:
- Built the application based on Rational Unified Process (RUP)
- Analyzed and developed UML’s with Rational Rose including development of class diagrams, sequence diagrams, use case diagrams and activity diagrams
- Implemented the Middle-Tier employing design patterns like MVC, Business Delegate, Service Locator, Session Façade, Data Access Objects (DAO’s)
- Developed using MVC architecture and employed the Struts Framework and used Validator Framework and Tiles Framework as a plug-in with struts
- Developed user interface using JSP, JSP Tag libraries (JSTL) and Struts Tag Libraries
- Used EJB’s in the application and developed Session beans to house business login at the middle tier level
- Used Java Message Service (JMS) for reliable and asynchronous exchange of important information
- Used Hibernate in data access layer to access and update the information in database
- Implemented various XML technologies like XML schemas, JAXB parsers for cross platform data transfer
- Used JSON to pass objects between web pages and server-side application
- Used XSL-FO to generate PDF reports
- Extensively worked on XML parsers (SAX/DOM)
- Used WSDL and SOAP protocol for Web Services implementation
- Used JDBC to access DB2 UDB database for accessing customer information
- Developed application level logging using Log4J
- Used CVS for version controlling and Junit for unit testing
- Involved in development of Tables, Indices, Stored procedures, Database Triggers and Functions
- Involved in documenting the application
Environment: J2EE 1.7, WebSphere Application Server v8.0, RAD, JSP 2.0, EJB 3.1, Struts 2.0, JMS, JSON, JDBC, JNDI, XML, XSL, XSLT, XSL-FO, WSDL, SOAP, Hibernate 4.0, RUP, Rational Rose (2000), Log4J, Junit, CVS, IBM DB2 v8.2, Red Hat LINUX, RESTful web services