InfiniBand Basic Setup
This guide provides a basic setup for enabling InfiniBand on your nodes, ensuring proper network configuration, and running NCCL and Torch distributed training over IB.
Introduction
1. Install Required Packages
1.1 Install Mellanox OFED Drivers and Rebootwget -O MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz https://www.mellanox.com/downloads/ofed/MLNX_OFED-24.07-0.6.1.0/MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz tar xvzf MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz cd MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64 sudo ./mlnxofedinstall --without-fw-update --add-kernel-support --force sudo reboot nowdpkg -l | grep -i ofed ofed_info -s lsmod | grep -E 'mlx|ib|rdma'
1.2 Install Essential Networking Toolssudo apt update && sudo apt install -y \ rdma-core \ mlnx-tools \ infiniband-diags \ ibverbs-utils \ perftest \ iproute2 \ net-tools
1.3 Install CUDA and NCCL Supportwget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt install -y \ cuda-toolkit-12-8 \ nvidia-cuda-toolkit \ libnccl2 \ libnccl-dev
2. Verify InfiniBand Device Availabilitylspci | grep Mellanoxlsmod | grep mlxibstatls /sys/class/infiniband/ibdev2netdev
3. Running a Basic InfiniBand Test
3.1 Interactive Pingpong Test#!/bin/bash echo "Starting InfiniBand pingpong test server..." while true; do for n in 0 3 4 5 6 9 10 11; do ibv_rc_pingpong -s 65536 -d mlx5_$n done done#!/bin/bash echo -n "Enter the IP address of the server: " read SERVER_IP if ping -c 1 "$SERVER_IP" > /dev/null 2>&1; then echo "$SERVER_IP is reachable. Continuing..." else echo "Error: $SERVER_IP is not reachable." exit 1 fi for n in 0 3 4 5 6 9 10 11; do echo "Testing device: mlx5_${n} against ${SERVER_IP}" sudo ibv_rc_pingpong -s 65536 -d mlx5_${n} ${SERVER_IP} done
4. Running NCCL Tests with InfiniBandgit clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make -jexport LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.7rc1/lib:$LD_LIBRARY_PATH /usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -np 2 -bind-to none --verbose \ --hostfile ./mpihosts \ -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/libnvvp:/usr/local/lib \ -x NCCL_IB_HCA=mlx5 \ -x UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1 \ -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \ -x NCCL_COLLNET_ENABLE=0 \ /home/ubuntu/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
5. Debugging Common InfiniBand Issues
ibstat
Conclusion
PreviousOptions for Mounting VAST NFS SharesNextLocal Storage (Configuring an LVM Volume on Voltage Park On-Demand Bare Metal Servers)
Last updated
