We provide IT Staff Augmentation Services!

Sr. Hadoop Engineer / Talend Developer Resume

4.00/5 (Submit Your Rating)

Chicago, IL

SUMMARY

  • Result - driven IT Professional with nearly 10 years of experience in development, implementation, deployment and maintenance using Hadoop Ecosystem related technologies
  • Excellent experience inHadoopEcosystem technologies like HDFS, MapReduce, Yarn, Spark, Hive, Pig, Oozie, Sqoop, Flume, Zookeeper, HBase.
  • Experience in Hadoop Big Data Integration with TalenD ETL on performing data extract, loading and transformation process.
  • Experience in scheduling TalenD jobs using TalenD Administration Console (TAC)
  • Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Good understanding of Ralph Kimball and Bill Inmon methodologies.
  • Experience with data modeling tools like ER Studio, ERWin and Power designer
  • Experience in working with Data Warehousing Concepts like OLAP, OLTP, Star Schema, Snow Flake Schema, Logical Data Modeling, Physical Modeling and Dimension Data Modeling and utilizing t-Stats Catcher, t-Die, t-Log Row to create a generic job to store processing stats.
  • In depth understanding of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, and MapReduce concepts.
  • Extensive experience working in Teradata, Oracle, Netezza, SQL Server and MySQL database.
  • Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
  • Strong experience working with different Hadoop distributions like Cloudera, Hortonworks, MapR and Apache distributions.
  • Experience in installation, configuring, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services (AWS).
  • Experience in Amazon AWS services such as EMR, EC2, S3, Cloud Formation, and RedShift which provides fast and efficient processing of Big Data.
  • Experience in delivering project with varying timelines using Agile (SCRUM), Waterfall or RUP or Kanban methodology and working with remote team members and drive the project to success
  • Experience in Managing the Change Request during the Product/System Development Lifecycle (SDLC) and in creating Work Breakdown Structure (WBS). Followed the project management guidelines as specified by Project Management Body of Knowledge (PMBOK).

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, ZooKeeper, Spark, Solr, Storm, Drill, Ambari, Mahout, MongoDB, Cassandra, Avro, Parquet and Snappy, Hadoop Distributions Cloudera, MapR, Hortonworks, IBM Big Insights

Languages: Java, Scala, Python, JRuby, SQL, HTML, DHTML, JavaScript, XML and C/C++

No SQL Databases: Cassandra, MongoDB, HBase

Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB, Struts

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB, Web Design Tools HTML, DHTML, AJAX, JavaScript, JQuery and CSS, Angular.Js, ExtJS, JSON

Development / Build Tools: Eclipse, Ant, Maven, Gradle, IntelliJ, Junit, log4J

Frameworks: Struts, Spring, Hibernate

App/Web servers: WebSphere, WebLogic, JBoss, Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL, Oracle

RDBMS: Teradata, Oracle … MS SQL Server, MySQL, DB2

Operating systems: UNIX, LINUX, Mac OS, Windows Variants

Data analytical tools: R, SAS, MATLAB

ETL Tools: Tableau, TalenD, Informatica, Ab Initio, Hyperion

PROFESSIONAL EXPERIENCE

Confidential, Chicago IL

Sr. Hadoop Engineer / Talend Developer

Responsibilities:

  • Worked closely with business users for requirement gathering, understanding intent and defining scope and am responsible for project status updates to Business users.
  • Performed analysis and providing summary for the business questions, initiating proactive investigations into data issues that impact reporting, business analysis or program execution.
  • Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Created TalenD Spark jobs, which collects data from regular relational database, and load the data in to HBase
  • Worked on extracting and enriching HBase data between multiple tables using joins in spark.
  • Worked on writing APIs to load the processed data to HBase tables.
  • Replaced the existing MapReduce programs into Spark application using Scala.
  • Built on premise data pipelines using Kafka and Spark streaming using the feed from API streaming Gateway REST service.
  • Processed extremely large volumes of XML data in Hadoop thru oracle.
  • Integrated and tested on big data appliance thru oracle.
  • Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
  • Utilized Agile methodologies tools such as Kanban andScrumto track all project management processes
  • Tracked product progress including bug reports using Jira and MS Project
  • Wrote Sqoop scripts to import data into Hive/HDFS from RDBMS.
  • Worked on Kafka streams API for data transformation.
  • Creating and updating Test Plans, Test Cases, and Design steps, as well as defect tracking, bug tracking using Test Management Tool HP QC/ALM, Bugzilla, JIRA
  • Creating and updating Requirements Traceability Matrix(RTM) to link test cases to requirements
  • Implemented logging framework - ELK stack (Elastic Search, LogStash & Kibana) on AWS.
  • Setup Spark EMR to process huge data which is stored in Amazon S3.
  • Developed Oozie workflow for scheduling & orchestrating the ETL process.
  • Used TalenD tool to create workflows for processing data from multiple source systems.
  • Created sample flows in TalenD, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
  • Wrote Queries, Stored Procedures, Functions,PL/SQLPackages and Triggers in Oracle and reports and scripts
  • Created sessions and batches to movedataat specific intervals & on demand using Server Manager
  • Authenticated access with Kerberos on Oracle Big Data Appliance
  • Worked on Mobile application and performed manual testing on SDLC methodology following Agile (Scrum) which includes 3 week-sprints, daily stand up meetings.
  • Involved in analyzing requirement specifications and developed Test Plans, Test Scenarios and Test Cases to cover overall quality assurance testing.
  • Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
  • Created indexes for various statistical parameters on Elastic Search and generated visualization using Kibana
  • Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
  • Deployed applications using Jenkins framework integrating Git- version control with it.
  • Participated in production support on a regular basis to support the Analytics platform
  • Used Rally for task/bug tracking.
  • Used GIT for Version Control.

Environment: MapR, Hadoop, HBase, HDFS, AWS, PIG, Erwin, Hive, Drill, Spark SQL, MapReduce, Spark Streaming, Kafka, Flume, Sqoop, Oozie, Jupyter Notebook, Python 2.9, PL/SQL, Docker, Hyperion, Kafka, Spark, Scala, HBase, HP ALM, TalenD Big Data Studio 6.0, Shell Scripting, Java, Oracle Data Integrator 12c

Confidential - Richmond VA

Hadoop Engineer

Responsibilities:

  • Worked with highly unstructured and semi structured data of 100TB+ in size
  • Involved in loading data from UNIX file system to HDFS using command and scripts.
  • Created Hive tables using HiveQL, loading data and writing hive queries which will run internally in map reduce way.
  • Loaded data from different source (database & files) into Hive using TalenD tool.
  • Data migration from relational (Oracle & Teradata) databases or external data to HDFS using Sqoop and Flume & Spark.
  • Analyzed the data by performing Hive queries and running Pig scripts.
  • Designed both Managed and External tables in Hive to optimize performance.
  • Regular monitoring of Hadoop Cluster to ensure installed applications are free from errors and warnings.
  • Optimized Hive queries using Partitioning and Bucketing techniques
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Developed Pig and Hive scripts to be used by end user / analyst / product manager's requirements for Adhoc analysis.
  • Managed External tables in Hive for optimized performance using Sqoop jobs.
  • Solved performance issues in Hive and Pig scripts with understanding of joins, group and aggregation and how it translates to MapReduce jobs.
  • Explored Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark SQL, Data Frame, pair RDD's, Spark on YARN.
  • Worked with Hadoop-Kerberos security environment is supported by the Cloudera team.
  • Worked on loading and transforming of large sets of structured, semi structured, and unstructured data.
  • Developed functions, views and triggers for automation.
  • Responsible for gathering data migration requirements.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Implemented Spark Scala application using higher order functions for both batch and interactive analysis requirement.
  • Developed Spark scripts for data analysis in both python and Scala.
  • Built on premise data pipelines using Kafka and spark for real time data analysis.
  • Created reports in TABLEAU for visualization of the data sets created and tested native Drill, Impala and Spark connectors.
  • Indexed documents using Elastic Search.
  • Analyzed Test Strategy and Test Plan documents to generate logical Test Scenarios and Test Cases.
  • Wrote test cases and executed them manually from HP ALM to test the application for its functionality, system integration, smoke, Regression, Stress testing.
  • Involved in creating gap analysis document, clearly identifying the data, business process and workflows of the organization with respect to salesforce.com implementation.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Implemented Hive complex UDF's to execute business logic with Hive Queries.
  • Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.
  • Evaluated performance of Spark SQL vs IMPALA vs DRILL on offline data as a part of PoC.
  • Worked on Solr configuration and customizations based on requirements.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
  • Involved in writing T-SQL programming for implement Stored Procedures and Functions for differenttasks.
  • Responsible for creating Databases, Tables, Index, Unique/Check Constraints Views, Stored Procedures,Triggers, Rules.
  • Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
  • Exporting of result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Responsible for developing data pipeline by implementing Kafka producers and consumers.
  • Exported the analyzed data to Impala to generate reports for the BI team.
  • Worked on managing and reviewing Hadoop Log files to resolve any configuration issues.
  • Developed a program to extract the name entities from OCR files.
  • Fixed defects as needed during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
  • Used Mingle and later moved to JIRA for task/bug tracking.
  • Used GIT for version control

Environment: Hadoop 2.2, Informatica Power Center 9.x,Erwin, HDFS, HBase, Flume 1.4, Sqoop 1.4.3, Hive 0.13.1, Avro 1.7.4, Parquet 1.4, MapR, Cloudera, AWS, PIG, Impala, Drill, Spark SQL, Hyperion, OCR,, ZooKeeper, PL/SQL, Cosmos DB, PL/SQL, Tableau, HP ALM, Shell Scripting, Gerrit, Java, HP ALM, Redis, Elastic Search, Oracle Data Integrator

Confidential - Atlanta GA

Hadoop Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Handled importing of data from multiple data sources (Oracle, SQL Server) using Sqoop, performed Cleaning, Transformations and Joins using Pig.
  • Push data as delimited files into HDFS using TalenD Big data studio.
  • Involved in writing Map Reduce program using Java.
  • Load and transform data into HDFS from large set of structured data /Oracle/SQL Server using TalenD Big data studio.
  • Exported analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Provided support to data analyst in running Hive queries.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Created Hive tables, partitions to store different Data formats.
  • Involved in loading data from UNIX file system to HDFS.
  • Worked on managing and reviewing Hadoop log files.
  • Consolidated all defects, report it to PM/Leads for prompt fixes by development teams and drive it to closure.
  • Supported existing BI solution, data marts and ETL processes.
  • Migration of 100+ TBs of data from different databases (i.e. Oracle, SQL Server) to Hadoop.
  • Worked on various file formats Avro, SerDe, Parquet, and Text by using snappy compression.
  • Used Pig Custom Loaders to load different forms of data files such as XML, JSON and CSV.
  • Designed dynamic partition mechanism for optimal query performance of system using HIVE to reduce report time generation under SLA requirements.
  • Analyzed the requirement to setup a cluster.
  • Worked on analyzing Hadoop cluster and different big data analytic tools including MapReduce, Hive and Spark.
  • Involved in loading data from LINUX file system, servers, Java web services using Kafka Producers, partitions.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
  • Implemented Storm topologies to pre-process data before moving into HDFS system.
  • Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
  • Implemented POC to migrate MapReduce programs into Spark transformations using Spark and Scala.
  • Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.
  • Developed the MapReduce programs to parse the raw data and store the pre-Aggregated data in the partitioned tables.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data with MapReduce, Hive and pig.
  • Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
  • Implemented Python scripts for writing MapReduce programs using Hadoop Streaming.
  • Involved in using HCATALOG to access Hive table metadata for MapReduce or Pig code.
  • Implemented custom sterilizer, interceptor, source and sink as per the requirement in flume to ingest data from multiple sources.
  • Setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Implemented monitoring on all the NiFi flows to get notifications if there is no data flowing through the flow more than the specific time.
  • Converted unstructured data to structured data by writing Spark code.
  • Indexed documents using Apache Solr.
  • Set up Solr Clouds for distributing indexing and search.
  • Worked on No-SQL databases like Cassandra, MongoDB for POC purpose in storing images and URIs.
  • Integrating bulk data into Cassandra file system using MapReduce programs.
  • Worked on MongoDB for distributed storage and processing.
  • Designed and implemented Cassandra and associated RESTful web service.
  • Implemented Row Level Updates and Real time analytics using CQL on Cassandra Data.
  • Worked on analyzing and examining customer behavioral data using Cassandra.
  • Created partitioned tables in Hive, mentored analyst and SQA team for writing Hive Queries.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Involved in cluster setup, monitoring, test benchmarks for results.
  • Involved in build/deploy applications using Maven and integrated with CI/CD server Jenkins.
  • Involved in agile methodologies, daily scrum meetings, Sprint planning's.

Environment: Hadoop 2.6 cluster, Informatica9.x, HDFS, Flume 1.5, Sqoop 1.4.3, Erwin, Hive 1.0.1, pig, Hive, NiFi, Spark 1.4, HBase, XML, JSON, Teradata, Oracle, MongoDB, AWS Redshift, Python, Spark, Scala, MongoDB, Cassandra, Snowflake, Solr, ZooKeeper, MySQL, TalenD Big Data Studio 6.0/5.5, Shell Scripting, Linux Red Hat, Java, Oracle Hyperion 12c

Confidential

Hadoop & ETL Developer/Administrator

Responsibilities:

  • Responsible for loading the customer's data and event logs from Oracle database, Teradata into HDFS using Sqoop
  • End-to-end performance tuning of Hadoop clusters and Hadoop MapReduce routines against very large data sets.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Loading data from LINUX file system to HDFS.
  • Importing and exporting data into HDFS and Hive using Sqoop and Flume.
  • Used using Cloudera Manager, an end-to-end tool to manage Hadoop operations.
  • Wrote MapReduce jobs to generate reports for the number of activities created on a particular day, during a dumped from the multiple sources and the output was written back to HDFS
  • Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, and Sqoop.
  • Wrote Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
  • Prepared TEZ build from the source code and run the HIVE Query Jobs using TEZ execution engine rather using MR jobs for better performance
  • Participate in requirements gathering and designing, development, testing and analysis phase of the project in documenting the business requirements by conducting workshops/meetings with various business users.
  • Participated in client calls to gather and analyze the requirement.
  • Worked on importing and exporting data into HDFS from database and vice versa using Sqoop.
  • Worked on Resource management ofHadoopCluster including adding/removing cluster nodes for maintenance and capacity needs.
  • Responsible for monitoring theHadoopcluster using Zabbix/Nagios.
  • Converting the existing relational database model to Hadoop ecosystem.
  • Installed and configured Flume, Oozie on theHadoopcluster.
  • Managing, defining and scheduling Jobs on aHadoopcluster.
  • Generate datasets and load to HADOOP Ecosystem.
  • Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
  • Involved in review of functional and non-functional requirements.
  • Implemented Frame works using Java and python to automate the ingestion flow.
  • Responsible to manage data coming from different sources.
  • Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by using Flume.
  • Worked on processing large volume of data and skills in parallel execution of process using TalenD functionality.
  • Involved in loading data from UNIX file system and FTP to HDFS.
  • Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
  • Creating Hive tables and working on them using HiveQL.
  • Developed data pipeline using Kafka and Storm to store data into HDFS.
  • Created reporting views in Impala using Sentry policy files.
  • Developed Hive queries to analyze the output data.
  • Collected the logs data from web servers and stored in to HDFS using Flume.
  • Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Implemented several Akka Actors which are responsible for loading of data into hive.
  • Design and implement Spark jobs to support distributed data processing.
  • Supported the existing MapReduce Programs those are running on the cluster.
  • Developed and implemented two Service Endpoints (end to end) in Java using Play framework, Akka Server Hazelcast.
  • Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
  • Wrote Java code to format XML documents; upload them to Solr server for indexing.
  • Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
  • Developed Power enter mappings to extract data from various databases, Flat files and load into DataMart using the Informatica.
  • Followed agile methodology for the entire project.
  • Installed and configured Apache Hadoop, Hive and Pig environment.

Environment: Hadoop, Hortonworks, HDFS, pig, Hive, Flume, Sqoop, Ambari, Ranger, Python, Akka, Play framework, Informatica, Elastic search, Linux- Ubuntu, Solr

Confidential

Data Analyst

Responsibilities:

  • Acted as a liaison between the IT developers and Business stake holders and was instrumental in resolving conflicts between the management and technical teams.
  • Worked closely with business users for requirement gathering, understanding intent and defining scope and am responsible for project status updates to Business users.
  • Performing analysis and providing summary for the business questions, initiating proactive investigations into data issues that impact reporting, business analysis or program execution.
  • Created views for reporting purpose which involves complex SQL queries with sub-queries, inline views, multi table joins, with clause and outer joins as per the functional needs in the Business Requirements Document (BRD).
  • Involved in performance tuning of slowly running SQL queries and created indexes, constraints and rules on database objects for optimization.
  • Developed functions, views and triggers for automation.
  • Assisted in mining data from the SQL database that was used in several significant presentations.
  • Assisted in offering support to other personnel who were required to access and analyze the SQL database.
  • Worked onPythonModules and Packages.
  • UsedPythonscripts to update the content in the database and manipulate file
  • Analyzed variousbackupcompression tools available and made the recommendations.
  • Performed data analysis and data profiling using complex SQL on various sources systems including Oracle and Teradata.
  • Involved with data profiling for multiple sources and answered complex business questions by providing data to business users.
  • Design and Implementation ofPL/SQLStored Procedures, Functions, Packages, Views, Cursors, Ref Cursors, Collections, Records, Object Types, Database Triggers, Exception Handling, Forms, Reports, Table Partitioning.
  • Involved in writing T-SQL programming for implement Stored Procedures and Functions for differenttasks.
  • Responsible for creating Databases, Tables, Index, Unique/Check Constraints Views, Stored Procedures,Triggers, Rules.
  • Optimized the performance of queries by modifying the existing index system and rebuilding indexes.
  • Coordinated project activities between clients and internal groups and information technology, including project portfolio management and project pipeline planning
  • Worked in close collaboration with the Project Management Office and business users to gather, analyze and document the functional requirements for the project.
  • Responsible for development of workflow analysis, requirement gathering, data governance, data management and data loading.
  • Analyzing and documenting data flow from source systems managed the availability and quality of Data.
  • Root cause analysis of data discrepancies between different business system looking Confidential Business rules, data model and provide the analysis to development/bug fix team.
  • Wrote Queries, Stored Procedures, Functions,PL/SQLPackages and Triggers in Oracle and reports and scripts
  • Evaluated existing practices of storing and handling important financial data for compliance and Ensured corporate compliance with all billing, credit standards and direct responsibility of accounts receivables and supervision of accounts payable.
  • Setup data governance touch points with key teams to ensure data issues were addressed promptly.
  • Responsible for facilitating UAT (User Acceptance Testing), PPV (Post Production Validation) and maintaining Metadata and Data dictionary.
  • Responsible for source data cleansing, analysis and reporting using pivot tables, formulas (v-lookup and others), data validation, conditional formatting, and graph and chart manipulation in Excel.
  • Actively involved in data modeling for the QRM Mortgage Application migration to Teradata and developed the dimensional model.
  • Developed SQL*Loader control programs andPL/SQLvalidation scripts for validatingdatato loaddatafrom staging tables to production tables.
  • Created views for reporting purpose which involves complex SQL queries with sub-queries, inline views, multi table joins, with clause and outer joins as per the functional needs in the Business Requirements Document (BRD).

Environment: Informatica Power Center 8.x (Repository Manager, Designer, Workflow Manager, and Workflow Monitor),Agile, Teradata, Oracle 12c,SQL,PL/SQL, Unix Shell Scripts, Python2.7, MDX/DAX, SAS, PROC SQL, MS Office Tools, MS Project, Windows XP, MDX/DAX, MS Access, Pivot Tables

We'd love your feedback!