We provide IT Staff Augmentation Services!

Hadoop Data Engineer- Developer Resume

4.00/5 (Submit Your Rating)

Madison, WI

SUMMARY:

  • Over 8 Years of diversified IT experience in all phases of Analysis, Design, Development, Implementation, Administration and support of Big Data applications and ETL - Data Warehousing applications.
  • Served on Variety of roles as an Infrastructure Architect, Solutions Engineer, Developer, Analyst, Admin,Prod support on environments including Various ETL tools (Informatica Big Data Edition and Talend), Databases (Oracle,SQL server,Netezza,Teradata),Unix(RHEL,AIX),Scheduling Tools (Autosys and Control-M) both on-premise and AWS Cloud platforms.
  • Experienced on Multi-node Highly Available Hadoop cluster environments, handling all Hadoop environment builds on Horton works Data Platform, Cloudera,Mapr-FS and Amazon-EMR including utilities like HDFS, Map Reduce, Oozie, Spark, Hive, Sqoop, Pig, HBase.
  • Conduct proof of concept/Pilot Implementations on new/evolving software in the Hadoop ecosystem and provide recommendations.
  • Configure and Administer Lab environments with new installations of MAPR/CDH/HDP and other open source tools and 3rd party applications
  • Experience in designing, developing, and implementing Extraction, Transformation, and Load (ETL) techniques on multiple database platforms and operating system environments.
  • Experience in Data Ingestion and populating data lake from Various File and RDBMS data sources into a Hadoop Cluster using Map Reduce/Pig/Hive/Sqoop.
  • Experience in building and deploying Talend ETL solutions over Hadoop/Big data environment.
  • Expertise in implementing Spark applications using Scala and python for both batch and machine learning applications.
  • Expertise in developing, implementing and executing projects in Finance, Health care, Insurance and other Domains.
  • Experience in Integration of various Relational data sources like Oracle, Teradata, SQL Server and Mainframes and Flat Files into Staging Area.
  • Experience with Oracle 12c/11g and in writing Complex SQL queries, PL/SQL stored procedures, functions, packages triggers and materialized views.
  • Expertise in developing and maintaining overall Test Methodology and Strategy, Documenting Test Plans, Test Cases, Executing Test Cases and Test Scripts and perform data validation by Unit Testing, System and Integration Testing and UAT.
  • Experience on implementing projects in Waterfall, Agile and Kanban methodologies.
  • Experience using incident/change management tools like Remedy and ticketing tools like JIRA.
  • Knowledge on Reporting tools like Microstrategy, OBIEE, Business Objects with an idea of Universes, Reports and Adhoc reports.
  • Created UNIX scripts for automation of workflows and experience in using Autosys/Control M job scheduler for automation of Unix Shell and python scripts for batch scheduling.

TECHNICAL SKILLS:

Data Warehousing Tools: Talend Big Data Studio, Informatica BDE, Power Center 10.x/9.x, Power Exchange 9.x, IDQ, IMM-Metadata Manager

Data Modelling Tools: Erwin Data Modeler r7, MS Visio 2010

Databases: Oracle 12c, Teradata 15, IBM DB2, MS SQL Server 2014, Netezza, HIVE, NoSQL,MySQL,HBASE

Reporting Tools: Microstrategy

Scheduling Tools: Autosys, Control-M

RDBMS Utilities: SQL*Plus, SQL*Loader, Toad, PL/SQL Developer, BTEQ, Netezza Workbench

Programming: UNIX Shell Scripting, PL/SQL,PERL,PYTHON,Scala

Operating Systems: UNIX(RHEL,Solaris 10, AIX-7.1), Mainframes, Windows, Apache Hadoop and HDFS

PROFESSIONAL EXPERIENCE:

Confidential, Madison, WI

Environment: Horton Works Data platform (HDP 2.7),Ambari,Hive 1.2,BitBucket,Jenkins,Spark 1.6,PIG 0.16,Grunt shell,Pyspark,Tez 0.7,Hue,HCatalog,Oozie

Hadoop Data Engineer- Developer

Responsibilities:

  • Working as a Data Engineer with an advanced analytical team to design, build, validate and refresh ML predictive models.
  • Responsible for data identification, collection, exploration & cleaning for modeling, participate in model development and Identify additional data sources to augment the ML pipelines.
  • Ingest Legacy datasets into HDFS using Sqoop Scripts and populate Enterprise Data Lake by importing tables from Oracle, Greenplum databases and Mainframe Sources and store them in partitioned HIVE tables using ORC and Zlib compression.
  • Populated Both Internal and External Hive tables using both Hive CLI and Beeline with Mysql as Hive metastore using static and dynamic partitions and bucketing on multiple execution engines such as TEZ and Map reduce.
  • Collaborate with various stakeholders (Domain Architects, Solution Architects, and Business Analysts) and provide Initial training datasets and founding feature sets to Data Scientists for building Machine Learning predictive models using Pyspark.
  • Score the Model outputs for Attributes of interest as defined by Business stakeholders and Apply business rules for thresholding the outputs using hive and Spark.
  • Architected and implemented Hadoop clusters for various clients in both on premise as well as AWS cloud deployments.
  • Experienced in working with structured data using Hive QL, ORC and ZLIB, Hive UDFs, partitions, bucketing and Managed/external tables.
  • Experience working on different HDFS formats and structures like Parquet, Avro and ORC etc.
  • Knowledge of XML/JSON schema design and validation techniques.
  • Used Py-Spark DQ package for profiling Hive datasets using Spark-SQL.
  • Experience on working with Spark tools like RDD transformations, Spark Core, Spark SQL and Spark Streaming.
  • Experience implementing column family schemas in Hbase.
  • Continuous Integration and Continuous Development using Bit bucket and Jenkins.
  • Utilized Oozie Workflows to orchestrate all the flows on the cluster.
  • Integrate and leverage Geographical Information Systems data using hive and python.
  • Perform Unit/Integration testing and User Acceptance Testing and Productionize the Data Science Models using Autosys.
  • Experience monitoring and troubleshooting issues with Linux memory, CPU, OS, storage and network.
  • Monitoring systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.

Confidential, Charlotte, NC

Environment: Informatica 10.1 BDE, IDQ, Oracle 12c,RHEL 6.x, shell (Korn/bash) Scripting, PAC2000,NDM,Autosys R11.3,SVN, Apache Hadoop, HDFS,CDH 5.0,MAPR 5.1, Talend Integration suite for Big Data 5.6

Hadoop Engineer

Responsibilities:

  • Designed and implemented Hadoop clusters for various clients in both on premise physical or virtual platforms as well as cloud deployments.
  • Worked on multi-node HA Cloudera Hadoop Cluster and 10 node MAPR 5.2 cluster on RHEL 5.8
  • Successfully imported and exported files to Hive and HDFS from Oracle, MySql using Sqoop JDBC specifying Mappers and splits.
  • Created Sqoop jobs with incremental load to populate Hive External tables.
  • Analyzed data using Hadoop components Hive and Pig and Responsible for writing Pig Latin scripts to process semi structured and unstructured data of web server logs using grunt shell.
  • Worked on performing transformations & actions on RDDs using Spark Core, Spark-SQL, and Spark-Streaming.
  • Executed Spark SQL queries against data in Hive by using hive context in spark.
  • Developed a POC for spark streaming Framework to process the data from Kafka messaging system as micro-batches and push the processed data into NoSQL Column based store - Hbase using Spark Streaming, Kafka, Scala.
  • Developed Bigdata Jobs leveraging HDFS/Hive for Data Ingestion from various sources on both Blaze and SPARK framework and used Sqoop to import/export to and from HDFS to Relational Database and developed Oozie workflows.
  • Ingested Data into Both Internal and External Hive tables using both Hive CLI and Beeline with Mysql as Hive metastore using static and dynamic partitions and bucketing.
  • Develop a proof of concept (POC) for Informatica connectivity to Splice Machine using ODBC/JDBC on Cloudera Hadoop Distribution 5.0
  • Develop a POC for Informatica Power Exchange for HDFS and Hive on multinode Hadoop distribution on Cloudera
  • Extensively used Informatica Developer Tool in BDE to create Data Ingestion jobs into HDFS using complex data file objects such as AVRO and Parquet and to evaluate dynamic mapping capabilities.
  • To provide tier 2 infrastructures and application technical support for the Information HUB, WISE Matching service and Metadata Manager Informatica and Oracle platform as part of the IHUB Technical Support team.
  • Worked on Manta flow in conjunction with IMM (Informatica Metadata Management) to create Data lineage Diagrams, Business glossary and metadata models for Data Governance and Impact Analysis.
  • Extensively used Informatica Developer Tool in BDE to create Data Ingestion jobs into Hadoop Datalake and to evaluate dynamic mapping capabilities.
  • Architected the technical refresh project intended to migrate the Informatica platform, Oracle databases, Storage(NAS/SAN),Utility Tools (Autosys,NDM) in conjunction with Data Center Alignment.
  • Acting as an ETL liaison between Operating system Engineers (Unix) and NAS engineers with setting up of NAS filers including allocation,replication,retention, snapshots and backups on both Netapp and Isilon platforms and Network Engineers in setting up of F5 load balancers for applications like Oracle,NDM on top of VIP(local) and WIP (global).
  • Migrating/Provisioning new users, groups and adding servers to host groups and ACL using BOKS and setting up single sign on using Active Directory and LDAP Authentication.
  • Configuring NDM Node Names(Connect: Direct) using F5 (Load Balancers) including netmaps and userfiles for both inbound and outbound file/Data transmissions.
  • Design and Configuring the NAS sizing, Backups and replication aspects and export of filers to server hosts.
  • Adding Alias/CNAME updates and IP re-assignments for DNS hosts.
  • Remediating Vulnerability patches as part of safety and soundness on Linux and Windows servers.
  • Design and Implement the BCP Failover and Failback Strategy.

Confidential, Birmingham, AL

Environment: Informatica PowerCenter 9.5.1, Power Exchange 9.5.1, Informatica Developer (IDQ and IDE), Metadata Manager (Thin Client), Data Masking, Talend Open Studio, Oracle 11g, Microsoft SQL Server 2008 R2, Toad for Oracle 12.1, IBM DB2, AIX UNIX 7.1, Red Hat Linux, shell (Korn/bash) Scripting, Remedy,CDH 4.0, JIRA 6.2,Control-M.

Hadoop Developer

Responsibilities:

  • Involved in gathering and analyzing business requirements, and designing Hadoop Stack as per the requirements.
  • Created shell script to ingestion the files from Edge Node to HDFS.
  • Experience in handling data in different file formats like Text, Sequence, Avro and Parquet and snappy compression.
  • Creating the Hive tables and partitioned tables using static,dyamic partitioning and bucketing.
  • Developed PIG Latin scripts in transformations while processing data in data lake.
  • Created test case scenarios, test cases; executed test cases and exceptionally documented the process to perform functional testing of the application
  • Worked with systems engineering team to deploy and test new Hadoop environments and expand existing Hadoop clusters
  • Involved in gathering business requirements, establishing functional specifications and translating them to design specifications.
  • Knowledge on dimensional modeling to design and develop the STAR Schema using Erwin.
  • Worked as an Administrator on creating users, groups, folders, categories and connections in the repository and assigned access privileges.
  • Developed mappings using Informatica Powercenter 9.5.1and Power Exchange to load data from multiple sources and file systems such as Cobol VSAM, Flat files, SEQ files into Oracle tables.
  • Developed PL/SQL procedures using Cursors loops for truncating tables, creating/dropping of indexes on tables using target pre-load and post-load strategies to improve session performance in bulk loading, for gathering statistics and archiving table Data.
  • Worked on adding Error checking code, Exception handlers, User defined Error messages in PL/SQL for better handling of bad data.
  • Extensively Used Informatica client tools like Source Analyzer, Target designer, Mapping Designer, Mapplet Designer, and Transformation Developer for defining Source & Target definitions and coded the process of data flow from source system to data warehouse.
  • Developed PL/SQL procedures, for creating/dropping of indexes on tables using target pre-load and post-load strategies to improve session performance in bulk loading, for gathering statistics and archiving table Data.
  • Created pre-session and post session UNIX scripts for file delete, rename, archive and zipping tasks.
  • Responsible for Testing and Validating the Informatica mappings against the pre-defined ETL design standards.
  • Implemented Test plans and test cases for unit-test and peer-review Test and Fixed bugs and validated mappings.
  • Worked as production support and in finalizing scheduling of workflows using Control M tool and Deployed the Workflows in the production repository and server and supported them through automated Shell scripts.
  • Used Jira for issue and project tracking for Kanban methodology and real time reporting.

We'd love your feedback!