- Tie up some loose threads from the practicals.
- A taste of more advanced kernel programming.

In the practicals, we’ve been very lazy about error checking.

Instead, the usual idiom is:

```
cudaError_t err = cudaMalloc(...); // or kernel call, kernel<<<...>>>()
if (err != cudaSuccess) {
printf("error: %s\n", cudaGetErrorString(err));
}
```

See Error Handling in the CUDA Runtime API documentation.

In the practicals, we allocated all memory upfront. What about situations where we do not know memory requirements upfront?

- CUDA has
`cudaMallocAsync`

and`cudaFreeAsync`

for streaming allocation of**device memory**. It achieves this with memory pools.

Some alternatives:

In the practicals, we used one stream to enqueue kernels for the GPU. We can in fact use many.

- Multiple streams can execute concurrently, allowing
*concurrent kernel execution*. - When we say that a GPU can do task parallelism, not just data parallelism, this is mostly what we mean.
- Where individual kernels are not large enough to occupy the GPU, launching several concurrently may be.
- Multiple streams can also be used to run on multiple GPUs.

In the practicals, we used

`cudaDeviceSynchronize()`

to wait for the GPU to complete work enqueued in a stream. Events facilitate more fine-grained synchronization.

- We create and destroy events with
`cudaEventCreate()`

and`cudaEventDestroy()`

. - We
*record*an event in a stream with`cudaEventRecord()`

. - We can have the CPU wait for the event with
`cudaEventSynchronize()`

. - We can have another stream wait for the event with
`cudaStreamWaitEvent()`

.

In the practicals, we only looked at kernels that perform simple

transformations, where an operation is applied element-wise to a vector or matrix, and threads do not need to interact.

- We may be interested in kernels that involve interacting threads, such as
*reductions*,*scans*,*sorts*and other such primitive operations. The*shuffle*that we used in the practicals is one such kernel, although we avoided implementing it for GPU. - We can write more complex transformations that involve arbitrary C++ code, including conditionals (
`if`

,`else`

), loops (`for`

,`while`

) and function calls.

When might we need to write our own custom kernel?

This is needed for

automatic marginalizationin Birch.

The problem:

- given a random variable $x$ on the integers $\{0,\ldots,m-1\}$ with a vector of probabilities $\mathbf{p}$ with $p_i = P(x = i)$, and
- another random variable $y$ on the integers $\{0,\ldots,n-1\}$ with a vector of probabilities $\mathbf{q}$ with $q_i = P(y = i)$,
- consider the sum $x + y$ on the integers $\{0, m + n - 2\}$ and compute its vector of probabilities $\mathbf{r}$ with $r_i = P(x + y = i)$.

The basic computation: $r_i = \sum_{j = \max(0,i-n+1)}^{\min (m-1,i)} p_j q_{i-j}.$

We can write this in matrix form as: $\mathbf{r} = \left(\begin{array}{cccccc} p_0 & 0 & \ldots & \ldots & \ldots \\ \vdots & p_0 & 0 & \ldots & \ldots \\ p_{m-1} & \vdots & p_0 & 0 & \ldots \\ 0 & p_{m-1} & \vdots & \ddots & 0 \\ \vdots & 0 & p_{m-1} & \vdots & p_0 \\ \vdots & \vdots & 0 & \ddots & \vdots \\ \vdots & \vdots & \vdots & 0 & p_{m-1} \\ \end{array}\right) \mathbf{q},$ where there are $n$ number of columns.

Or illustrate it for $m = n = 4$:

The numbers are not representing the values of the elements here, but rather the associated $x$ and $y$. We can see that the matrix-vector product accumulates the probabilities of pairs of $x$ and $y$ that sum to the same outcome.

This is the product of a

banded matrixand a vector.

- cuBLAS provides the function family
`gbmv`

for the multiplication of banded matrices and vectors. - Recall from the practicals that we specify a matrix with a base pointer (e.g.
`A`

) and lead (e.g.`ldA`

). -
For this banded matrix we could use a lead of

`-1`

. - Off-band elements are ignored, so it does not matter if they are invalid.

Unfortunately, as with most (all?) BLAS implementations, the lead must be positive and at least as large as the number of rows in the matrix or an error is given. So we’ll need a custom kernel!

The length of $\mathbf{r}$ is $m + n - 1$. We can assign one thread to compute each element of $\mathbf{r}$.

```
__global__ void kernel_enumerate(const int m, const int n, const float* p,
const int incp, const float* q, const int incq, float* r,
const int incr) {
/* element of r for which thread is responsible */
int i = threadIdx.y + blockIdx.y*blockDim.y;
/* sum across elements in each thread */
float result = 0.0f;
for (j = 0; j < n; ++j) {
if (0 <= i - j && i - j < m) {
result += p[(i - j)*incp]*q[j*incq];
}
}
/* write element */
r[i] = result;
}
```

Next issue:Each thread has a different amount of work to do, leading towarp divergence.

- The threads in a warp can execute only one common instruction at a time.
- If threads take different execution paths, those execution paths must be serialized, e.g.
- Threads split between true and false branches of a conditional: first threads on the true branch execute then threads on the false branch execute.
- Threads have different trip counts for a loop: each thread stops after its trip count is reached while others may continue.

- This is called
*warp divergence*.

The fact that warps

candiverge makes programming kernels much easier, but reducing warp divergence will improve performance.

Consider the following transformation of the problem:

- We now use $m$ threads, where each computes two elements of $r$ (besides thread $m - 1$, which only computes one).
- Each thread now has an equal amount of work to do.

```
__global__ void kernel_enumerate(const int m, const int n, const float* p,
const int incp, const float* q, const int incq, float* r,
const int incr) {
/* first element of r for which thread is responsible */
int i = threadIdx.y + blockIdx.y*blockDim.y;
/* sum across elements in each thread */
float result1 = 0.0f, result2 = 0.0f;
for (j = 0; j < n; ++j) {
if (0 <= i - j) {
result1 += p[(i - j)*incp]*q[j*incq];
} else {
result2 += p[(m + i - j)*incp]*q[j*incq];
}
}
/* write elements */
r[i] = result1;
if (i < m - 1) {
r[i + m] = result2;
}
}
```

Next issue:Each thread reads the whole vector $\mathbf{q}$ from device memory.

We can use

shared memoryto read $\mathbf{r}$ from device memory only once per threadblock, i.e. threads in the same block can share a single read.

- Within a thread block, shared memory can be used for communication between threads.
- Shared memory is fast. It takes a chunk of L1 cache, the amount of which can be configured with
`cudaDeviceSetCacheConfig()`

.

```
__global__ void kernel_enumerate(const int m, const int n, const float* p,
const int incp, const float* q, const int incq, float* r,
const int incr) {
__shared__ float* q_shared;
int i = threadIdx.y + blockIdx.y*blockDim.y;
float result1 = 0.0f, result2 = 0.0f;
for (int base = 0; base < n; base += blockDim.y) {
/* read the next batch into shared memory */
int j = threadIdx.y;
q_shared[j] = base + j < n ? q[(base + j)*incq] : 0.0f;
__syncthreads();
/* sum across elements in each thread */
for (j = 0; j < blockDim.y; ++j) {
if (0 <= i - j) {
result1 += p[(base + i - j)*incp]*q_shared[j];
} else {
result2 += p[(base + m + i - j)*incp]*q_shared[j];
}
}
}
/* set the final result */
r[i] = result1;
if (i < m - 1) {
r[i + m] = result2;
}
}
```

When calling the kernel, a third argument is given to specify the amount of shared memory required: `kernel_enumerate<<<grid,block,shared>>>(...)`

.

Next issue:The kernel offers $m$-way parallelism, which may not be enough to occupy the GPU.

The current kernel uses a

threadper two elements of the output, giving $m$-way parallelism. If we use awarpper two elements of the output, we can increase this to $32m$-way parallelism.

Each thread of the warp computes a partial sum, then they interact to compute the total sum. We have a few options for the second step:

- Synchronization functions, the most basic of which is
`__syncthreads()`

. - Fence functions
- Atomic functions
- Warp shuffle functions
- Warp reduce functions

```
__global__ void kernel_enumerate(const int m, const int n,
const float* p, const int incp, const float* q, const int incq, float* r,
const int incr) {
assert(blockDim.x == warpSize && gridDim.x == 1);
__shared__ float* q_shared;
int i = blockIdx.x*blockDim.x;
float result1 = 0.0f, result2 = 0.0f;
for (int base = 0; base < n; base += warpSize*blockDim.y) {
/* read the next batch into shared memory */
int j = threadIdx.y*warpSize + threadIdx.x;
q_shared[j] = base + j < n ? q[(base + j)*incq] : 0.0f;
__syncthreads();
/* sum across elements in each thread */
for (j = threadIdx.x; j < warpSize*blockDim.y; j += warpSize) {
/* read the next batch into shared memory */
if (0 <= i - j) {
result1 += p[(base + i - j)*incp]*q_shared[j];
} else {
result2 += p[(base + m + i - j)*incp]*q_shared[j];
}
}
}
/* sum across threads of warp, using butterfly sum */
for (int k = 16; k >= 1; k /= 2) {
result1 += __shfl_xor_sync(0xffffffff, result1, k, warpSize);
result2 += __shfl_xor_sync(0xffffffff, result2, k, warpSize);
}
/* set the final result, only first thread in each warp */
if (threadIdx.x == 0) {
r[i] = result1;
if (i < m - 1) {
r[i + m] = result2;
}
}
}
```

- Version 0
- We started with a basic implementation of a kernel to enumerate a sum of two discrete random variables.
- Version 1
- We improved performance by reducing warp divergence.
- Version 2
- We improved performance by using shared memory to avoid repeated reads from device memory.
- Version 3
- We increased parallelism by having whole warps, rather than single threads, share the computation of each element of the output.

- If results are different with and without synchronization, there is certainly a problem! You can debug such issues by setting
`CUDA_LAUNCH_BLOCKING=1`

to enable synchronization. - For more focused interventions, insert
`cudaDeviceSynchronize()`

after any kernel launch and print out values for debugging. - Printing those values is much easier if you stick to using managed memory with
`cudaMallocManaged()`

rather than device memory with`cudaMalloc()`

. It avoids the nuisance of copying values back to host memory first. - Using single precision during development can be a good way to detect and debug numerical issues, even if intending to use double precision for actual runs.
- Given lack of extended precision, and especially if tempted by single precision, be careful with numerics:
- Look for opportunities to use
`log-sum-exp`

or`log-softmax`

operations instead of`log(sum(exp(x)))`

and`log(exp(x)/sum(exp(x)))`

. - Look for opportunities to use
`log1p`

and`expm1`

instead of`log(1.0 + x)`

and`exp(x) - 1.0`

. - Avoid large sums.

- Look for opportunities to use

- The CUDA Programming Guide is worth browsing.
- The Nvidia Technical Blog has some good tutorial style articles on more specific topics.
- My work at indii.org and birch.sh.