First Steps After Placing An Order

NVIDIA Tooling Installation

Driver

# Pick legacy or open, not both
# 550 is the latest branch, 535 is a previous version
# legacy driver
#sudo  apt-get install --verbose-versions nvidia-headless-550-server -y

# open driver
sudo apt-get install --verbose-versions nvidia-headless-550-server-open -y

# Fabric Manager and tools
sudo  apt-get install --verbose-versions --no-install-recommends \
    nvtop python3-pip \
    nvidia-utils-550-server libnvidia-nscq-550 nvidia-fabricmanager-550 -y

Reboot

As of May 24th, 2024, Nvidia drivers are automatically installed as part of the allocation process.

Test the Basics

# List GPUs should work and show 8 devices
ubuntu@vp-h100-gpu-node:~$ nvidia-smi --list-gpus
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-bcb693b8-d838-4d24-9ae8-654973e733b8)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-b1d75684-8e00-4d80-aad0-1a688c79340e)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-1f18b2f0-7783-472f-99f3-ed4c216186dc)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-c572de08-32e2-43ba-8eed-858bc1907fac)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-46befe3e-07a0-4d5b-9cdc-553727ac8116)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-9ec0da39-8cff-4797-bfcc-1088ee552f0e)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-5cb37f36-08ea-45a2-9077-09be13e2ef0b)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-7b47282a-2415-4241-bd14-a1c42522776a)


# Fabric state is Completed and in Success status
ubuntu@vp-h100-gpu-node:~$ nvidia-smi -q -i 0 | grep -i -A 2 Fabric
    Fabric
        State                             : Completed
        Status                            : Success

# CPU (2x52 cores)
ubuntu@vp-h100-gpu-node:~$ nproc --all
104

# RAM (~1TB)
ubuntu@vp-h100-gpu-node:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.0Ti       4.7Gi       995Gi       6.0Mi       7.5Gi       998Gi
Swap:          8.0Gi          0B       8.0Gi

# Storage (6x2.9TB)
ubuntu@vp-h100-gpu-node:~$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme1n1     259:1    0  2.9T  0 disk 
├─nvme1n1p1 259:2    0  512M  0 part /boot/efi
└─nvme1n1p2 259:3    0  2.9T  0 part /
nvme2n1     259:9    0  2.9T  0 disk 
nvme6n1     259:10   0  2.9T  0 disk 
nvme3n1     259:11   0  2.9T  0 disk 
nvme4n1     259:12   0  2.9T  0 disk 
nvme5n1     259:13   0  2.9T  0 disk 

On some hosts,/boot/efi and / are mounted on a smaller 500G nvme instead of the first of six larger 2.9T nvme disks

Install CUDA and Optional Tools

# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# CUDA toolkit, other meta packages available - see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages
sudo apt-get --verbose-versions install cuda-toolkit-12-4 -y

# CUDNN
sudo apt-get install nvidia-cudnn -y

# Docker
sudo apt-get install docker.io nvidia-container-toolkit -y

Configure Additional Local Storage

There are 5 unused NVME drives in your system besides the system drive. Based on your storage needs, consider software RAID 10 (mirror), RAID 5, or any other method that works in Ubuntu.

5.5TB RAID 10 - Uses 4 of 5 Disks - Reliable and Fast

sudo pvcreate /dev/nvme[2345]n1
sudo vgcreate -s 64m raidVG /dev/nvme[2345]n1
sudo lvcreate --type raid10 --name raidLV --extents 100%VG  raidVG
sudo mkfs.ext4 /dev/raidVG/raidLV
sudo mkdir -p /raid && sudo mount /dev/raidVG/raidLV /raid
echo "/dev/raidVG/raidLV   /raid   ext4   defaults   0 0" | sudo tee /etc/fstab

11TB RAID 5 - Uses 5 of 5 Disks - Reliable with Slower Writes

sudo pvcreate /dev/nvme[23456]n1
sudo vgcreate -s 64m raidVG /dev/nvme[23456]n1
sudo lvcreate --type raid5 --nosync --stripes 4 --name raidLV --extents 100%VG raidVG
sudo mkfs.ext4 /dev/raidVG/raidLV
sudo mkdir -p /raid && sudo mount /dev/raidVG/raidLV /raid
echo "/dev/raidVG/raidLV   /raid   ext4   defaults   0 0" | sudo tee /etc/fstab

Alternative NVIDIA Tooling

wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
sudo reboot

Last updated