GPUltra | CUDA peer memory access

Link copied to clipboard

September 13, 2021·3 min read

CUDA peer memory access

This post is short. However, it's an extremely important one. If you develop a multi-GPU software for a single machine equipped with a number of GPUs, very likely you will come across a situation in which you will need to transfer data from one GPU to another. Similarly you will need to access memory of another device from a kernel. The first solution that could come to one's mind is to transfer from a source GPU to CPU-side memory and then from the CPU-side memory to a destination GPU. This will of course work, but it will be slow as compared to what NVIDIA offers with peer access.

The NVIDIA peer access is extremely fast. It takes far below a millisecond to transfer a framebuffer in an HD resolution. It also takes only a couple of lines to make it work. Actually, it's performed in two steps. First, the peer access has to be enabled. Second, the data can be transferred using dedicated functions.

Peer access does not necessarily have to be supported between two CUDA devices. The function cuDeviceCanAccessPeer allows to check if a device can access a peer device.

int canAccessPeer;
CUresult result = cuDeviceCanAccessPeer(&canAccessPeer,dev,peerDev);
if (result != CUDA_SUCCESS) {
    // Handle error.
}
if (canAccessPeer != 0) {
    // Device cannot access the peer device.
}

The result value canAccessPeer tells if the device dev can access the device peerDev. canAccessPeer equals 1 if the device can access the peer device. Note that the call to this function this way does not tell if peerDev can access dev. You have to swap dev and peerDev and call it again to check.

Enabling peer access

In order to enable peer access, the context of the source device and the context of the peer device are necessary. Then all that needs to be done is set the source context (context) as the current one and call the function cuCtxEnablePeerAccess with the peer context:

if (cuCtxSetCurrent(context) != CUDA_SUCCESS) {
    // Handle error.
}
if (cuCtxEnablePeerAccess(peerContext,0) != CUDA_ACCESS) {
    // Handle error.
}

The second argument of cuCtxEnablePeerAccess is reserved for future use and must be zero. Both of the functions return the result of the call which should be checked if an error is raised.

Peer access gives two possibilities. First, a kernel running on a device can directly access memory of another device. Of course there is some overhead as memory transfers needs to be done. Second, memory can be extremely fast copied between two devices. While the access from a kernel is obvious, the below section explains how to copy memory.

Peer copy

There are two functions dedicated to peer memory copying - synchronous and asynchronous. Both of them take:

source and destination pointers,
source and destination contexts,
size.

The stream in the asynchronous version represents a sequence of operations that execute in order in which the operations are issued.

CUresult cuMemcpyPeer(
    CUdeviceptr dst,CUcontext dstContext,
    CUdeviceptr src,CUcontext srcContext,size_t size)

CUresult cuMemcpyPeerAsync(
    CUdeviceptr dst,CUcontext dstContext,
    CUdeviceptr src,CUcontext srcContext,
    size_t size,CUstream stream)

Again, the result returned by the functions should be checked for errors.

Disabling peer access

As with enabling peer access, disabling requires the context of source and peer device. Disabling peer access is similar to enabling and is performed in two steps: set the source context (context) as the current one and call the function cuCtxDisablePeerAccess.

if (cuCtxSetCurrent(context) != CUDA_SUCCESS) {
    // Handle error.
}
if (cuCtxDisablePeerAccess(peerContext) != CUDA_SUCCESS) {
    // Handle error.
}

Conclusions

As we have shown above, enabling peer access and copying between two CUDA devices is as simple as calling a couple of functions.

GPUltra only copies data between devices sitting on a single machine. Each device renders a framebuffer which is copied to one device which in turn merges the framebuffers and passes the result for further processing.