# Pick legacy or open, not both
# 550 is the latest branch, 535 is a previous version
# legacy driver
#sudo apt-get install --verbose-versions nvidia-headless-550-server -y
# open driver
sudo apt-get install --verbose-versions nvidia-headless-550-server-open -y
# Fabric Manager and tools
sudo apt-get install --verbose-versions --no-install-recommends \
nvtop python3-pip \
nvidia-utils-550-server libnvidia-nscq-550 nvidia-fabricmanager-550 -y
Reboot
As of May 24th, 2024, Nvidia drivers are automatically installed as part of the allocation process.
Test the Basics
# List GPUs should work and show 8 devices
ubuntu@vp-h100-gpu-node:~$ nvidia-smi --list-gpus
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-bcb693b8-d838-4d24-9ae8-654973e733b8)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-b1d75684-8e00-4d80-aad0-1a688c79340e)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-1f18b2f0-7783-472f-99f3-ed4c216186dc)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-c572de08-32e2-43ba-8eed-858bc1907fac)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-46befe3e-07a0-4d5b-9cdc-553727ac8116)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-9ec0da39-8cff-4797-bfcc-1088ee552f0e)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-5cb37f36-08ea-45a2-9077-09be13e2ef0b)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-7b47282a-2415-4241-bd14-a1c42522776a)
# Fabric state is Completed and in Success status
ubuntu@vp-h100-gpu-node:~$ nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : Completed
Status : Success
# CPU (2x52 cores)
ubuntu@vp-h100-gpu-node:~$ nproc --all
104
# RAM (~1TB)
ubuntu@vp-h100-gpu-node:~$ free -h
total used free shared buff/cache available
Mem: 1.0Ti 4.7Gi 995Gi 6.0Mi 7.5Gi 998Gi
Swap: 8.0Gi 0B 8.0Gi
# Storage (6x2.9TB)
ubuntu@vp-h100-gpu-node:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme1n1 259:1 0 2.9T 0 disk
├─nvme1n1p1 259:2 0 512M 0 part /boot/efi
└─nvme1n1p2 259:3 0 2.9T 0 part /
nvme2n1 259:9 0 2.9T 0 disk
nvme6n1 259:10 0 2.9T 0 disk
nvme3n1 259:11 0 2.9T 0 disk
nvme4n1 259:12 0 2.9T 0 disk
nvme5n1 259:13 0 2.9T 0 disk
On some hosts,/boot/efi and / are mounted on a smaller 500G nvme instead of the first of six larger 2.9T nvme disks
Install CUDA and Optional Tools
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
# CUDA toolkit, other meta packages available - see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages
sudo apt-get --verbose-versions install cuda-toolkit-12-4 -y
# CUDNN
sudo apt-get install nvidia-cudnn -y
# Docker
sudo apt-get install docker.io nvidia-container-toolkit -y
Configure Additional Local Storage
There are 5 unused NVME drives in your system besides the system drive. Based on your storage needs, consider software RAID 10 (mirror), RAID 5, or any other method that works in Ubuntu.
5.5TB RAID 10 - Uses 4 of 5 Disks - Reliable and Fast