Lead Hpc Systems Engineer Resume Profile
San, FranciscO
Summary: |
- 7 Years of High Performance Computing System administration experience
- Excellent knowledge in designing, prototyping and deploying HPC clusters
- Strong understanding of cluster resource managers, schedulers, clusterware and GPU computing
- Experience in benchmarking and performance optimization of large-scale HPC systems
- Experience in installing and managing high performance storage and network interconnects
- Managed a team 5 of system administrators
- Strong knowledge of Bio Informatics cluster software and Sequencing instruments
- Extensive experience in developing Linux installers for cluster software and OS deployment and automation
- Experience in creating PRD and RFP documents for projects and evaluating hardware from multiple vendors before purchasing
- Experience in building computer labs ground up, capacity planning and installing racks
- Experience in purchasing over 30 million dollars worth of equipment from various vendors for various projects, from specification, to negotiation, lease/purchase and delivery/installation on site
- Extensive experience in troubleshooting Linux OS, filesystems, cluster hardware, scripting and GPU computing hardware
- Ability to create, maintain, and implement scripts in order to reduce administrative efforts
- Ability to operate in a multi-platform, multi-operating system, multi-component environment utilizing a large number of server builds and configurations.
- Experienced in project management skills and the demonstrated ability to drive for results
- Excellent interpersonal, communication, customer interaction, documentation skills and decision-making ability
Technical Skills: |
RHEL, SUSE, Debian, Ubuntu, PBS, LSF, SGE, MAUI, TORQUE, MOAB, ROCKS, SCYLD, Lustre, GPFS, NFS, DNS, NIS, LDAP, DHCP, MySQL, FTP, IPMI, SNMP, VMware, Apache and Tomcat Webservers, Postgres, kickstart, RPM building, OS Install images, shell and python scripting, RAID setup, High performance storage DDN, Xanadu, RAID inc , 10GigE network interconnects and switches ethernet and Infiniband Mellanox, CISCO, DELL DELL 10th,11th and 12th generation hardware, Nvidia Tesla, C2050, M2075, C410x, NextIO
Professional Experience: |
Lead HPC Systems Engineer Confidential Environment : Dell 10th,11th and 12th Generation Hardware, Penguin Computing hardware, LAN, WAN, DNS, NIS, YP, High performance Cluster technologies ROCKS, Scyld , DHCP, High Performance storage DDN, Xanadu,RAID inc , 10 GigE layer 2 Routers , RHEL 4/5/6, CentOS 4/5/6, Linux flavors, VMware server and WS, Ubuntu, Nvidia Tesla, C2050, M2075, NextIO |
- Tech lead for Next-gen DNA Sequencing instruments compute design and administration SOLiD 1,2,3,4 and SOLiD 5500
- Evaluated, prototyped, built and supported HPC clusters, servers and workstations for Next Gen DNA Sequencing instruments SOLiD, SOLiD 5500 and R D software development and V V teams.
- Provided leadership and mentored team of system administrators to support onsite compute hardware for R D software, V V and system validation teams, also supported field engineers.
- Developed and supported installers for cluster software BioScope, LifeScope to work on heterogeneous HPC clusters involving various operating systems - RHEL, SUSE, Debian and resource managers/schedulers PBS, LSF, SGE, MAUI. Supported it on over 100 different clusters at remote sites ranging from 10-node to 500-node clusters.
- Provided continued Linux and systems administration support for R D software development, V V, Systems Integration teams and field engineers.
- Managed and provided system administration support to Genome Sequencing center consisting of 20 SOLiD 5500 instruments
- Worked with DELL and Nvidia to address hardware issues and developed diagnostic suite for field engineers
- Built and supported numerous clusters for R D teams including high performance storage and interconnects Dell PE Blade Servers, C410x, DDN GPFS storage, Xanadu lustre, Mellanox 10GigE
- Maintained custom kickstart and YUM repositories for automated and network-based system and software installations on a fleet of Linux based clusters, workstations and servers
- Developed python and bash scripts for monitoring system resources to assist with system performance tuning, automate data backups and inventory the systems on the network.
- Documented as-built installations, configurations, end-to-end system setup procedures, SOP and troubleshooting guidelines
- Provided remote assistance with Network and System core configurations for the HPC Linux clusters, Linux servers and workstations at various customer locations.
- Built installers using Redhat Package Manager RPM , rpmbuild for CentOS/Redhat installations and custom installers for other Linux distributions and managed patch builds and maintenance for various packages.
Sr. Systems Admin /HPC Support Engineer Confidential Environment : Dell Poweredge 1940, Sun Fire servers, LAN, WAN, DNS, NIS, YP, High availability, Load balancing, High performance Cluster technologies ROCKS, Scylld , DHCP, SAN, Raid 10, 5 and SCSI storage, Gigabit Switches and Routers, RHEL 2/3/4/5 , CentOS 3/4/5 Solaris 8/9, Linux flavors, VMware server and WS. |
- Built HPC production clusters, using ROCKS and Scyld cluster ware software and involved in tech/production support.
- Emulated HPC cluster using VMware for development team.
- Integrated 15TB SCSI storage using NFS and Samba over the network for each cluster.
- Managed user accounts using NIS and YP over LAN, WAN.
- Extensively used IPMI, SNMP protocols, raid 5, 10 configurations for storage and shell scripts to monitor system health which includes NAGIOS, dell open-manage software.
- Involved in testing and benchmarking clusters for different environments.
- Monitor and model system stability, and throughput performance.
- Maintained SAMBA, Apache, LDAP, DHCP, Mail postfix , Tomcat, MySql and file servers.
- Worked with IPMI, switched rack PDU's and remote access UPS for monitoring systems health.
- Duties also include custom scripting, documentation of processes and systems.
- Created custom rolls for ROCKS OS clusters.
- Ported ROCKS installation to scyld cluster ware.
- Implemented SAMBA and NFS over NAS and SCSI attached storages.
- Performed Kernel Tuning, Performance Analysis/Tuning and troubleshooting.
- Include administration, integration and maintenance of all HPC cluster configurations, storage systems, backup systems, and network peripheral devises.
- Analyzed numerous hardware/vendor servers for HPC and production environment.
System and Network Administrator Confidential Environment : IBM Xseries Rack servers, Blade centers, Dell power edge, Sun Fire X2 servers, LAN, WAN, DNS, NIS, High availability, Cluster technologies RHAS, Rocks 4.x,SUN , DHCP server, SAN, Cisco Switches and Routers, Veritas volume manager 4.x, RHEL 2/3/4 , CentOS, Solaris 8/9, Linux flavors. |
- Implemented and Managed LAN/WAN heterogeneous networks.
- Perform hardware/software maintenance for colocated servers, including equipment replacement and capacity planning.
- Created test-environment and built new servers based on the application requirement as a major task that were implemented on a weekly basis.
- New UNIX/Linux server setups Kickstart/Jumpstart , Kernel tuning, Hardware upgrading and System installation software installing, patch management.
- Veritas Volume Manager and NFS are extensively used.
- Designed and maintained a high availability cluster of IBM Xseries and IBM bladecenters servers to provide high-volume, dynamic WWW content with greater than 99 availability.
- Maintained internal and customer secondary DNS Linux / Bind, NIS, TCP/IPv4 .
- Administration of various systems in the heterogeneous network, which includes day to day maintenance using Cron jobs, log checks, scripting Shell , performance monitoring.
- Created and maintained Cisco router access list / firewall.
- Communicating, escalating and resolving problems in the integration, staging production environments.