We provide IT Staff Augmentation Services!

Lead Hpc Systems Engineer Resume Profile

3.00/5 (Submit Your Rating)

San, FranciscO

Summary:

  • 7 Years of High Performance Computing System administration experience
  • Excellent knowledge in designing, prototyping and deploying HPC clusters
  • Strong understanding of cluster resource managers, schedulers, clusterware and GPU computing
  • Experience in benchmarking and performance optimization of large-scale HPC systems
  • Experience in installing and managing high performance storage and network interconnects
  • Managed a team 5 of system administrators
  • Strong knowledge of Bio Informatics cluster software and Sequencing instruments
  • Extensive experience in developing Linux installers for cluster software and OS deployment and automation
  • Experience in creating PRD and RFP documents for projects and evaluating hardware from multiple vendors before purchasing
  • Experience in building computer labs ground up, capacity planning and installing racks
  • Experience in purchasing over 30 million dollars worth of equipment from various vendors for various projects, from specification, to negotiation, lease/purchase and delivery/installation on site
  • Extensive experience in troubleshooting Linux OS, filesystems, cluster hardware, scripting and GPU computing hardware
  • Ability to create, maintain, and implement scripts in order to reduce administrative efforts
  • Ability to operate in a multi-platform, multi-operating system, multi-component environment utilizing a large number of server builds and configurations.
  • Experienced in project management skills and the demonstrated ability to drive for results
  • Excellent interpersonal, communication, customer interaction, documentation skills and decision-making ability

Technical Skills:

RHEL, SUSE, Debian, Ubuntu, PBS, LSF, SGE, MAUI, TORQUE, MOAB, ROCKS, SCYLD, Lustre, GPFS, NFS, DNS, NIS, LDAP, DHCP, MySQL, FTP, IPMI, SNMP, VMware, Apache and Tomcat Webservers, Postgres, kickstart, RPM building, OS Install images, shell and python scripting, RAID setup, High performance storage DDN, Xanadu, RAID inc , 10GigE network interconnects and switches ethernet and Infiniband Mellanox, CISCO, DELL DELL 10th,11th and 12th generation hardware, Nvidia Tesla, C2050, M2075, C410x, NextIO

Professional Experience:

Lead HPC Systems Engineer

Confidential

Environment : Dell 10th,11th and 12th Generation Hardware, Penguin Computing hardware, LAN, WAN, DNS, NIS, YP, High performance Cluster technologies ROCKS, Scyld , DHCP, High Performance storage DDN, Xanadu,RAID inc , 10 GigE layer 2 Routers , RHEL 4/5/6, CentOS 4/5/6, Linux flavors, VMware server and WS, Ubuntu, Nvidia Tesla, C2050, M2075, NextIO

  • Tech lead for Next-gen DNA Sequencing instruments compute design and administration SOLiD 1,2,3,4 and SOLiD 5500
  • Evaluated, prototyped, built and supported HPC clusters, servers and workstations for Next Gen DNA Sequencing instruments SOLiD, SOLiD 5500 and R D software development and V V teams.
  • Provided leadership and mentored team of system administrators to support onsite compute hardware for R D software, V V and system validation teams, also supported field engineers.
  • Developed and supported installers for cluster software BioScope, LifeScope to work on heterogeneous HPC clusters involving various operating systems - RHEL, SUSE, Debian and resource managers/schedulers PBS, LSF, SGE, MAUI. Supported it on over 100 different clusters at remote sites ranging from 10-node to 500-node clusters.
  • Provided continued Linux and systems administration support for R D software development, V V, Systems Integration teams and field engineers.
  • Managed and provided system administration support to Genome Sequencing center consisting of 20 SOLiD 5500 instruments
  • Worked with DELL and Nvidia to address hardware issues and developed diagnostic suite for field engineers
  • Built and supported numerous clusters for R D teams including high performance storage and interconnects Dell PE Blade Servers, C410x, DDN GPFS storage, Xanadu lustre, Mellanox 10GigE
  • Maintained custom kickstart and YUM repositories for automated and network-based system and software installations on a fleet of Linux based clusters, workstations and servers
  • Developed python and bash scripts for monitoring system resources to assist with system performance tuning, automate data backups and inventory the systems on the network.
  • Documented as-built installations, configurations, end-to-end system setup procedures, SOP and troubleshooting guidelines
  • Provided remote assistance with Network and System core configurations for the HPC Linux clusters, Linux servers and workstations at various customer locations.
  • Built installers using Redhat Package Manager RPM , rpmbuild for CentOS/Redhat installations and custom installers for other Linux distributions and managed patch builds and maintenance for various packages.

Sr. Systems Admin /HPC Support Engineer

Confidential

Environment : Dell Poweredge 1940, Sun Fire servers, LAN, WAN, DNS, NIS, YP, High availability, Load balancing, High performance Cluster technologies ROCKS, Scylld , DHCP, SAN, Raid 10, 5 and SCSI storage, Gigabit Switches and Routers, RHEL 2/3/4/5 , CentOS 3/4/5 Solaris 8/9, Linux flavors, VMware server and WS.

  • Built HPC production clusters, using ROCKS and Scyld cluster ware software and involved in tech/production support.
  • Emulated HPC cluster using VMware for development team.
  • Integrated 15TB SCSI storage using NFS and Samba over the network for each cluster.
  • Managed user accounts using NIS and YP over LAN, WAN.
  • Extensively used IPMI, SNMP protocols, raid 5, 10 configurations for storage and shell scripts to monitor system health which includes NAGIOS, dell open-manage software.
  • Involved in testing and benchmarking clusters for different environments.
  • Monitor and model system stability, and throughput performance.
  • Maintained SAMBA, Apache, LDAP, DHCP, Mail postfix , Tomcat, MySql and file servers.
  • Worked with IPMI, switched rack PDU's and remote access UPS for monitoring systems health.
  • Duties also include custom scripting, documentation of processes and systems.
  • Created custom rolls for ROCKS OS clusters.
  • Ported ROCKS installation to scyld cluster ware.
  • Implemented SAMBA and NFS over NAS and SCSI attached storages.
  • Performed Kernel Tuning, Performance Analysis/Tuning and troubleshooting.
  • Include administration, integration and maintenance of all HPC cluster configurations, storage systems, backup systems, and network peripheral devises.
  • Analyzed numerous hardware/vendor servers for HPC and production environment.

System and Network Administrator

Confidential

Environment : IBM Xseries Rack servers, Blade centers, Dell power edge, Sun Fire X2 servers, LAN, WAN, DNS, NIS, High availability, Cluster technologies RHAS, Rocks 4.x,SUN , DHCP server, SAN, Cisco Switches and Routers, Veritas volume manager 4.x, RHEL 2/3/4 , CentOS, Solaris 8/9, Linux flavors.

  • Implemented and Managed LAN/WAN heterogeneous networks.
  • Perform hardware/software maintenance for colocated servers, including equipment replacement and capacity planning.
  • Created test-environment and built new servers based on the application requirement as a major task that were implemented on a weekly basis.
  • New UNIX/Linux server setups Kickstart/Jumpstart , Kernel tuning, Hardware upgrading and System installation software installing, patch management.
  • Veritas Volume Manager and NFS are extensively used.
  • Designed and maintained a high availability cluster of IBM Xseries and IBM bladecenters servers to provide high-volume, dynamic WWW content with greater than 99 availability.
  • Maintained internal and customer secondary DNS Linux / Bind, NIS, TCP/IPv4 .
  • Administration of various systems in the heterogeneous network, which includes day to day maintenance using Cron jobs, log checks, scripting Shell , performance monitoring.
  • Created and maintained Cisco router access list / firewall.
  • Communicating, escalating and resolving problems in the integration, staging production environments.

We'd love your feedback!