Benchmarking with the CUDA demo_suite

two tests allow to compute the bandwidth among the buses, the first is the bandwidthTest:

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/bandwidthTest -device=all
[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

 Device 0: NVIDIA A2
 Device 1: NVIDIA A2
 Quick Mode

 Host to Device Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     24012.5

 Device to Host Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     25735.1

 Device to Device Bandwidth, 2 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     319230.6

Result = PASS

the second one is the busGrind:

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/busGrind -a
Device: 0, NVIDIA A2, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A2, pciBusID: ca, pciDeviceID: 0, pciDomainID:0
P2P Cliques:
Clique: 0 [0 1]
*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Pinned
   Dir\D       0      1
   D2H     13.17  13.18
   H2D     12.30  12.31
   BiDir   22.58  22.58
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and a single device
*************************************************************************
Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
   Dir\D       0      1
   D2H      8.07  10.03
   H2D     11.64  11.70
   BiDir   10.83  10.40
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Pinned
   Dir\D       0      1  Total
   H2D     12.30  12.30  24.60
   D2H     13.18  13.18  26.36
   BiDir   22.58  22.57  45.15
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between the host and multiple devices concurrently
*************************************************************************
Concurrent Host/Device Bandwidth Matrix (GB/s), memory=Unpinned
   Dir\D       0      1  Total
   H2D     11.83  11.83  23.66
   D2H      3.75   3.73   7.48
   BiDir    9.58   9.41  18.99
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Enabled
   D\D      0      1
      0  81.08  11.33
      1  11.33  81.58
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Enabled
   D\D      0      1
      0  81.95  20.81
      1  20.81  81.86
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices
*************************************************************************
P2P Bandwidth Matrix (GB/s) - Unidirectional, P2P=Disabled
   D\D      0      1
      0  81.86  11.36
      1  11.40  81.95
P2P Bandwidth Matrix (GB/s) - Bidirectional, P2P=Disabled
   D\D      0      1
      0  81.71  19.76
      1  19.84  81.88
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D    0<>1  Total
   R2L     11.35  11.35
   L2R     11.35  11.35
   BiDir   20.85  20.85
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth between pairs of devices running concurrently (assumes devices are paired in order)
*************************************************************************
P2P Concurrent Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D    0<>1  Total
   R2L     11.33  11.33
   L2R     11.40  11.40
   BiDir   19.05  19.05
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D       0      1  Total
   R2L     11.35   0.00  11.35
   L2R      0.00  11.35  11.35
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a 1D exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D       0      1  Total
   R2L     11.36   0.00  11.36
   L2R      0.00  11.38  11.38
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Enabled (GB/s)
   Dir\D      H      0      1     H  Total
   R2L      7.96   9.08   0.00  13.01  30.05
   L2R     13.01   0.00   9.09   7.97  30.06
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for a cycle exchange across devices running concurrently (assumes devices are ordered ideally)
*************************************************************************
P2P Concurrent 1D Exchange Bandwidth Matrix - P2P Disabled (GB/s)
   Dir\D      H      0      1     H  Total
   R2L     11.76   9.84   0.00  11.57  33.18
   L2R     11.58   0.00   9.83  11.76  33.18
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Enabled (GB/s)
  Dir\D       0      1     Total
   Sctr      20.81  20.83  41.63
   Gthr      20.81  20.80  41.61
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus bandwidth for an all to all across all devices running concurrently
*************************************************************************
P2P All to All Bandwidth Matrix - P2P Disabled (GB/s)
  Dir\D       0      1     Total
   Sctr      19.71  19.88  39.59
   Gthr      20.23  19.75  39.98
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Pinned
   Dir\D       0      1
   D2H      1.45   1.53
   H2D      1.42   1.56
   BiDir    2.51   2.85
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between the host and a single device
*************************************************************************
Host/Device Latency Matrix (us), memory=Unpinned
   Dir\D       0      1
   D2H      4.45   5.33
   H2D      1.14   1.44
   BiDir    5.51   6.67
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Enabled (us)
   D\D      0      1
      0   0.02   0.02
      1   0.02   0.02
*************************************************************************
*************************************************************************


*************************************************************************
*************************************************************************
Test Description: Bus latency between pairs of GPUs
*************************************************************************
P2P Latency Matrix - P2P=Disabled (us)
   D\D      0      1
      0   0.02   0.02
      1   0.02   0.02
*************************************************************************
*************************************************************************

Another interesting test is the nbody application, that does an efficient all-pairs simulation of a gravitational n-body simulation in CUDA (see https://docs.nvidia.com/cuda/pdf/Demo_Suite.pdf). You can compare the results between CPU only, one GPU and two or more GPUs by enabling the options shown below:

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -cpu -benchmark
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
> Simulation with CPU
number of bodies = 10240
10240 bodies, total time for 10 iterations: 4511.497 ms
= 0.232 billion interactions per second
= 4.648 single-precision GFLOP/s at 20 flops per interaction

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=1 -benchmark
number of CUDA devices  = 1
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA A2]
number of bodies = 10240
10240 bodies, total time for 10 iterations: 9.462 ms
= 110.823 billion interactions per second
= 2216.450 single-precision GFLOP/s at 20 flops per interaction

[root@cld-dfa-gpu-04 ~]# /usr/local/cuda-11.7/extras/demo_suite/nbody -numbodies=10240 -numdevices=2 -benchmark

number of CUDA devices  = 2
> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA A2]
> Compute 8.6 CUDA device: [NVIDIA A2]
number of bodies = 10240
10240 bodies, total time for 10 iterations: 5.336 ms
= 196.511 billion interactions per second
= 3930.218 single-precision GFLOP/s at 20 flops per interaction

all the above tests/benchmarks can be executed in the VMs too. In principle this could help in debugging loss of performance in VMs with GPUs by comparing bare metal vs VM results.

Space shortcuts

Page tree