InfiniBand Basic Setup
This guide provides a basic setup for enabling InfiniBand on your nodes, ensuring proper network configuration, and running NCCL and Torch distributed training over IB.
⚠️ Warning: The following setup guide is a general reference and may not work for every situation. Please adjust configurations based on your specific environment and workload requirements.
IntroductionInfiniBand (IB) is a high-speed networking technology often used for distributed computing, AI workloads, and high-performance computing (HPC) applications. This guide provides a basic setup for enabling InfiniBand on your nodes, ensuring proper network configuration, and running NCCL and Torch distributed training over IB.
1. Verify InfiniBand Device AvailabilityTo check if InfiniBand devices are detected on your system, run:
If detected, check if the IB kernel modules are loaded:
To list InfiniBand interfaces:
If
ibstat
is not available, installrdma-core
:To list available InfiniBand devices:
If
ibdev2netdev
is missing, install the required package:Then run:
2. Install Required Packages
2.1 Install Mellanox OFED DriversIf the MLNX_OFED driver is not installed, follow these steps:
Verify installation:
2.2 Install Essential Networking ToolsTo install essential tools for InfiniBand communication:
2.3 Install CUDA and NCCL SupportFor CUDA support:
3. Running a Basic InfiniBand Test
3.1 Interactive Pingpong TestTo verify connectivity between two nodes, use the following interactive scripts:
Server Script:
How to use: Run this script on the server node to continuously listen for incoming pingpong tests.
Client Script:
How to use: Run this script on the client node, enter the server IP when prompted, and it will test IB connectivity.
4. Running NCCL Tests with InfiniBandClone and build NCCL tests:
Run an NCCL all-reduce performance test:
5. Debugging Common InfiniBand Issues
If
ibdev2netdev
is missing, installmlnx-tools
.If training hangs, ensure MLNX_OFED is installed and all IB devices show "Up".
If NCCL falls back to Ethernet, verify IB interfaces with:
Use
lsmod | grep mlx
to verify IB drivers are loaded.
ConclusionInfiniBand provides a powerful interconnect for distributed training, but correct configuration is key to ensuring optimal performance. If you encounter issues, confirm driver installation, check network device mappings, and validate NCCL environment variable settings.
For further assistance, please contact CX Team.
Last updated