• two tests allow to compute the bandwidth among the buses, the first is the bandwidthTest:
[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/bandwidthTest -device=all
[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

 Device 0: NVIDIA A2
 Device 1: NVIDIA A2
 Quick Mode

 Host to Device Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     24012.5

 Device to Host Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     25735.1

 Device to Device Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     319230.6

Result = PASS

the second one is the busGrind:

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/busGrind -a
Device: 0, NVIDIA A2, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A2, pciBusID: ca, pciDeviceID: 0, pciDomainID:0
P2P Cliques:
Clique: 0 [0 1]
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Pinned
   Dir\D       0      1
   D2H     13.17  13.18
   H2D     12.30  12.31
   BiDir   22.58  22.58
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
   Dir\D       0      1
   D2H      8.07  10.03
   H2D     11.64  11.70
   BiDir   10.83  10.40
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Pinned
   Dir\D       0      1  Total
   H2D     12.30  12.30  24.60
   D2H     13.18  13.18  26.36
   BiDir   22.58  22.57  45.15
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
   Dir\D       0      1  Total
   H2D     11.83  11.83  23.66
   D2H      3.75   3.73   7.48
   BiDir    9.58   9.41  18.99
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Enabled
   D\D      0      1
      0  81.08  11.33
      1  11.33  81.58
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Enabled
   D\D      0      1
      0  81.95  20.81
      1  20.81  81.86
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Disabled
   D\D      0      1
      0  81.86  11.36
      1  11.40  81.95
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Disabled
   D\D      0      1
      0  81.71  19.76
      1  19.84  81.88
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D    0<>1  Total
   R2L     11.35  11.35
   L2R     11.35  11.35
   BiDir   20.85  20.85
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D    0<>1  Total
   R2L     11.33  11.33
   L2R     11.40  11.40
   BiDir   19.05  19.05
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D       0      1  Total
   R2L     11.35   0.00  11.35
   L2R      0.00  11.35  11.35
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D       0      1  Total
   R2L     11.36   0.00  11.36
   L2R      0.00  11.38  11.38
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D      H      0      1     H  Total
   R2L      7.96   9.08   0.00  13.01  30.05
   L2R     13.01   0.00   9.09   7.97  30.06
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D      H      0      1     H  Total
   R2L     11.76   9.84   0.00  11.57  33.18
   L2R     11.58   0.00   9.83  11.76  33.18
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Enabled (GB/s)
  Dir\D       0      1     Total
   Sctr      20.81  20.83  41.63
   Gthr      20.81  20.80  41.61
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Disabled (GB/s)
  Dir\D       0      1     Total
   Sctr      19.71  19.88  39.59
   Gthr      20.23  19.75  39.98
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Pinned
   Dir\D       0      1
   D2H      1.45   1.53
   H2D      1.42   1.56
   BiDir    2.51   2.85
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Unpinned
   Dir\D       0      1
   D2H      4.45   5.33
   H2D      1.14   1.44
   BiDir    5.51   6.67
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Enabled (us)
   D\D      0      1
      0   0.02   0.02
      1   0.02   0.02
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Disabled (us)
   D\D      0      1
      0   0.02   0.02
      1   0.02   0.02
*************************************************************************
*************************************************************************
  • Another interesting test is the nbody application, that does an efficient all-pairs simulation of a gravitational n-body simulation in CUDA (see https://docs.nvidia.com/cuda/pdf/Demo_Suite.pdf). You can compare the results between CPU only, one GPU and two or more GPUs by enabling the options shown below:
[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -cpu -benchmark
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
> Simulation with CPU
number of bodies = 10240
10240 bodies, total time for 10 iterations: 4511.497 ms
= 0.232 billion interactions per second
= 4.648 single-precision GFLOP/s at 20 flops per interaction

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=1 -benchmark
number of CUDA devices  = 1
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA A2]
number of bodies = 10240
10240 bodies, total time for 10 iterations: 9.462 ms
= 110.823 billion interactions per second
= 2216.450 single-precision GFLOP/s at 20 flops per interaction

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=2 -benchmark

number of CUDA devices  = 2
> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA A2]
> Compute 8.6 CUDA device: [NVIDIA A2]
number of bodies = 10240
10240 bodies, total time for 10 iterations: 5.336 ms
= 196.511 billion interactions per second
= 3930.218 single-precision GFLOP/s at 20 flops per interaction
  • all the above tests/benchmarks can be executed in the VMs too. In principle this could help in debugging loss of performance in VMs with GPUs by comparing bare metal vs VM results. 
  • No labels