InfiniBand Basic Setup
This guide provides a basic setup for enabling InfiniBand on your nodes, ensuring proper network configuration, and running NCCL and Torch distributed training over IB.
⚠️ Warning: The following setup guide is a general reference and may not work for every situation. Please adjust configurations based on your specific environment and workload requirements.
IntroductionInfiniBand (IB) is a high-speed networking technology often used for distributed computing, AI workloads, and high-performance computing (HPC) applications. This guide provides a basic setup for enabling InfiniBand on your nodes, ensuring proper network configuration, and running NCCL and Torch distributed training over IB.
1. Install Required Packages
1.1 Install Mellanox OFED Drivers and RebootIf the MLNX_OFED driver is not installed, follow these steps:
wget -O MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz https://www.mellanox.com/downloads/ofed/MLNX_OFED-24.07-0.6.1.0/MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz tar xvzf MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64.tgz cd MLNX_OFED_LINUX-24.07-0.6.1.0-ubuntu22.04-x86_64 sudo ./mlnxofedinstall --without-fw-update --add-kernel-support --force sudo reboot now
Verify installation:
dpkg -l | grep -i ofed ofed_info -s lsmod | grep -E 'mlx|ib|rdma'
1.2 Install Essential Networking ToolsTo install essential tools for InfiniBand communication (some or all may already be installed):
sudo apt update && sudo apt install -y \ rdma-core \ mlnx-tools \ infiniband-diags \ ibverbs-utils \ perftest \ iproute2 \ net-tools
1.3 Install CUDA and NCCL SupportFor CUDA support:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt install -y \ cuda-toolkit-12-8 \ nvidia-cuda-toolkit \ libnccl2 \ libnccl-dev
2. Verify InfiniBand Device AvailabilityTo check if InfiniBand devices are detected on your system, run:
lspci | grep Mellanox
If detected, check if the IB kernel modules are loaded:
lsmod | grep mlx
To list InfiniBand interfaces:
ibstat
To list available InfiniBand devices:
ls /sys/class/infiniband/
Then run:
ibdev2netdev
3. Running a Basic InfiniBand Test
3.1 Interactive Pingpong TestTo verify connectivity between two nodes, use the following interactive scripts:
Server Script:
#!/bin/bash echo "Starting InfiniBand pingpong test server..." while true; do for n in 0 3 4 5 6 9 10 11; do ibv_rc_pingpong -s 65536 -d mlx5_$n done done
How to use: Run this script on the server node to continuously listen for incoming pingpong tests.
Client Script:
#!/bin/bash echo -n "Enter the IP address of the server: " read SERVER_IP if ping -c 1 "$SERVER_IP" > /dev/null 2>&1; then echo "$SERVER_IP is reachable. Continuing..." else echo "Error: $SERVER_IP is not reachable." exit 1 fi for n in 0 3 4 5 6 9 10 11; do echo "Testing device: mlx5_${n} against ${SERVER_IP}" sudo ibv_rc_pingpong -s 65536 -d mlx5_${n} ${SERVER_IP} done
How to use: Run this script on the client node, enter the server IP when prompted, and it will test IB connectivity.
4. Running NCCL Tests with InfiniBandClone and build NCCL tests:
git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make -j
Run an NCCL all-reduce performance test:
export LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.7rc1/lib:$LD_LIBRARY_PATH /usr/mpi/gcc/openmpi-4.1.7rc1/bin/mpirun -np 2 -bind-to none --verbose \ --hostfile ./mpihosts \ -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/libnvvp:/usr/local/lib \ -x NCCL_IB_HCA=mlx5 \ -x UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1 \ -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \ -x NCCL_COLLNET_ENABLE=0 \ /home/ubuntu/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
5. Debugging Common InfiniBand Issues
If
ibdev2netdev
is missing, installmlnx-tools
.If training hangs, ensure MLNX_OFED is installed and all IB devices show "Up".
If NCCL falls back to Ethernet, verify IB interfaces with:
ibstat
Use
lsmod | grep mlx
to verify IB drivers are loaded.
ConclusionInfiniBand provides a powerful interconnect for distributed training, but correct configuration is key to ensuring optimal performance. If you encounter issues, confirm driver installation, check network device mappings, and validate NCCL environment variable settings.
For further assistance, please contact CX Team.
Last updated