- two tests allow to compute the bandwidth among the buses, the first is the bandwidthTest:
[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/bandwidthTest -device=all [CUDA Bandwidth Test] - Starting... !!!!!Cumulative Bandwidth to be computed from all the devices !!!!!! Running on... Device 0: NVIDIA A2 Device 1: NVIDIA A2 Quick Mode Host to Device Bandwidth, 2 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 24012.5 Device to Host Bandwidth, 2 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 25735.1 Device to Device Bandwidth, 2 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 319230.6 Result = PASS
the second one is the busGrind:
[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/busGrind -a
Device: 0, NVIDIA A2, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A2, pciBusID: ca, pciDeviceID: 0, pciDomainID:0
P2P Cliques:
Clique: 0 [0 1]
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Pinned
Dir\D 0 1
D2H 13.17 13.18
H2D 12.30 12.31
BiDir 22.58 22.58
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
Dir\D 0 1
D2H 8.07 10.03
H2D 11.64 11.70
BiDir 10.83 10.40
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Pinned
Dir\D 0 1 Total
H2D 12.30 12.30 24.60
D2H 13.18 13.18 26.36
BiDir 22.58 22.57 45.15
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
Dir\D 0 1 Total
H2D 11.83 11.83 23.66
D2H 3.75 3.73 7.48
BiDir 9.58 9.41 18.99
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Enabled
D\D 0 1
0 81.08 11.33
1 11.33 81.58
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Enabled
D\D 0 1
0 81.95 20.81
1 20.81 81.86
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Disabled
D\D 0 1
0 81.86 11.36
1 11.40 81.95
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Disabled
D\D 0 1
0 81.71 19.76
1 19.84 81.88
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Enabled (GB/s)
Dir\D 0<>1 Total
R2L 11.35 11.35
L2R 11.35 11.35
BiDir 20.85 20.85
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Disabled (GB/s)
Dir\D 0<>1 Total
R2L 11.33 11.33
L2R 11.40 11.40
BiDir 19.05 19.05
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
Dir\D 0 1 Total
R2L 11.35 0.00 11.35
L2R 0.00 11.35 11.35
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
Dir\D 0 1 Total
R2L 11.36 0.00 11.36
L2R 0.00 11.38 11.38
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
Dir\D H 0 1 H Total
R2L 7.96 9.08 0.00 13.01 30.05
L2R 13.01 0.00 9.09 7.97 30.06
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
Dir\D H 0 1 H Total
R2L 11.76 9.84 0.00 11.57 33.18
L2R 11.58 0.00 9.83 11.76 33.18
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Enabled (GB/s)
Dir\D 0 1 Total
Sctr 20.81 20.83 41.63
Gthr 20.81 20.80 41.61
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Disabled (GB/s)
Dir\D 0 1 Total
Sctr 19.71 19.88 39.59
Gthr 20.23 19.75 39.98
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Pinned
Dir\D 0 1
D2H 1.45 1.53
H2D 1.42 1.56
BiDir 2.51 2.85
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Unpinned
Dir\D 0 1
D2H 4.45 5.33
H2D 1.14 1.44
BiDir 5.51 6.67
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Enabled (us)
D\D 0 1
0 0.02 0.02
1 0.02 0.02
*************************************************************************
*************************************************************************
*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Disabled (us)
D\D 0 1
0 0.02 0.02
1 0.02 0.02
*************************************************************************
*************************************************************************
- Another interesting test is the nbody application, that does an efficient all-pairs simulation of a gravitational n-body simulation in CUDA (see https://docs.nvidia.com/cuda/pdf/Demo_Suite.pdf). You can compare the results between CPU only, one GPU and two or more GPUs by enabling the options shown below:
[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -cpu -benchmark > Windowed mode > Simulation data stored in video memory > Single precision floating point simulation > 1 Devices used for simulation > Simulation with CPU number of bodies = 10240 10240 bodies, total time for 10 iterations: 4511.497 ms = 0.232 billion interactions per second = 4.648 single-precision GFLOP/s at 20 flops per interaction [root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=1 -benchmark number of CUDA devices = 1 > Windowed mode > Simulation data stored in video memory > Single precision floating point simulation > 1 Devices used for simulation GPU Device 0: "Ampere" with compute capability 8.6 > Compute 8.6 CUDA device: [NVIDIA A2] number of bodies = 10240 10240 bodies, total time for 10 iterations: 9.462 ms = 110.823 billion interactions per second = 2216.450 single-precision GFLOP/s at 20 flops per interaction [root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=2 -benchmark number of CUDA devices = 2 > Windowed mode > Simulation data stored in system memory > Single precision floating point simulation > 2 Devices used for simulation GPU Device 0: "Ampere" with compute capability 8.6 > Compute 8.6 CUDA device: [NVIDIA A2] > Compute 8.6 CUDA device: [NVIDIA A2] number of bodies = 10240 10240 bodies, total time for 10 iterations: 5.336 ms = 196.511 billion interactions per second = 3930.218 single-precision GFLOP/s at 20 flops per interaction
- all the above tests/benchmarks can be executed in the VMs too. In principle this could help in debugging loss of performance in VMs with GPUs by comparing bare metal vs VM results.