blog

Matrix Multiplication On GPU: Part 3, Coding for Speed

Tips and tricks to help the compiler optimize

7 min read / 11 Oct 24

Nvidia’s nvcc compiler and ptxas assembler do a good job optimizing CUDA kernels, but we cannot expect magic from them. There are coding patterns that can help produce better results. Ideally, these become habits.

This is the third in a series of posts about implementing a CUDA kernel for matrix-matrix multiplication on GPU. It focuses on code-level tips and tricks to get the best performance by facilitating compiler optimizations. Those tips and tricks may translate to other CUDA kernels too. Previous posts in the series provide the high-level algorithm, empirical evaluation and code and a breakdown of the tiling structure for optimal cache utilization.

The post is structured as a list of tips and tricks that emerged in the course of the kernel development. There are no timing results, as the ideas may or may not improve performance according to surrounding code and context. They are meant as ideas to try in your own code; if they work, great, if they don’t, try something else. Roughly, they are in order of importance, with those earlier in the list more likely to yield improvements, and larger improvements at that.

We might imagine that the compiler can do some or all of these optimizations for us. It may. My experience is that it does not, or that it may not, or that it may, according to reasons that are opaque to me but perhaps not to a compiler engineer. Mostly this is about good habits, or good rules of thumb, that tend to work well.

Organize nested loops around iteration dependencies
Keep static expressions together
Use __restrict__ on pointers
Keep loop indexing simple
Try serial indexing
Futz wicked indexing
Add the --resource-usage option when calling nvcc
Summary

Organize nested loops around iteration dependencies

Nested loops should be arranged to minimize dependencies between iterations of the innermost loops, and so maximize instruction-level parallelism.

Consider the following tight matrix multiplication, where K, N and M are sizes small enough such that the matrices A, B and C fit into registers.

for (int i = 0; i < M; ++i) {
  for (int j = 0; j < N; ++j) {
    for (int k = 0; k < K; ++k) {
      C(i, j) += A(i, k)*B(k, j);
    }
  }
}

The loops can be nested in any of six different configurations (ijk, ikj, jik, jki, kij, kji), all of which are logically equivalent, but differ in their dependency structure. Iterations of the i and j loops are independent, but the k loop accumulates into C(i, j), such that each of its iterations depends on the previous. As the innermost loop, that dependency is across just a few clock cycles, and the processor may stall while awaiting the result of the previous iteration to continue the computation (recall that instructions can take multiple cycles to complete, and other instructions can execute while they do). Furthermore, as the innermost loop, it is run M*N times.

It is much faster to make the k loop the outermost loop:

for (int k = 0; k < K; ++k) {
  for (int j = 0; j < N; ++j) {
    for (int i = 0; i < M; ++i) {
      C(i, j) += A(i, k)*B(k, j);
    }
  }
}

There is now plenty of work between the dependencies of the k loop—the whole j and i loops, and as the k loop is now the outermost, it is run only once. The compiler now has much more scope to parallelize at the instruction level and the overall result is faster.

Keep static expressions together

If i is a variable with value known at runtime, and K1 and K2 are constexpr with values known at compile time, do write i*(M2/M1) and not i*M2/M1. The former will compute the M2/M1 at compile time and perform just one integer multiplication at runtime. The latter will perform one integer multiplication and one integer division at runtime. This matters in tight code.

If M1 and M2 are powers of two, integer multiplications and divisions will convert to cheaper bit shifts, but one instruction is still better than two.

Generally, it is best not to assume that a compiler will make optimizations to integer operations beyond bit shifts for powers of two. The potential for overflow often prevents it from doing so.

Use `restrict` on pointers

This is a common compiler extension that corresponds to the restrict type qualifier in C. Its usage is documented elsewhere, but it does seem to make a difference with nvcc.

Use of the --restrict option to nvcc does not seem to achieve the same as qualifying pointers with __restrict__.

Keep loop indexing simple

A simple loop, with an index that starts at zero and increments by one each time, seems to yield the best performance.

Consider the case where we are striding through an array of length N (known at compile time), starting at index s (only known at runtime), with a stride length of S (known at compile time). Logically, this is just fine:

for (int i = s; i < N; i += S) {
  // ...
}

but this may yield better performance:

for (int j = 0; j < N/S; ++j) {
  int i = s + j*S;
  // ...
}

In the first case, both the dependency of the loop index i on the starting index s, and the non-unit stride of S, can seemingly harm performance. This is likely because it is more difficult to determine whether the loop can be unrolled.

Be careful, as these two loops are not quite equivalent! If s >= N, the original loop is never entered, while the revised one is. If there is a possibility for s >= N, the code may need further revision.

Try serial indexing

We are getting to the obscure end of our tips and tricks, and this one may help or may hinder.

Let x be a two-dimensional array with R rows and C columns, stored in column-major order with a stride of L between columns (sometimes called a lead—thus L). The array may have been declared float x[L*C]. To access the element in the ith row and jth column, we could write x[i + j*L].

Now, to copy this array x to another array y, where both have the same sizes R and C but different strides LX and LY, we could write:

for (int j = 0; j < C; ++j) {
  for (int i = 0; i < R; ++i) {
    y[i + j*LY] = x[i + j*LX];
  }
}

In the case that R equals LY, the elements of y are stored contiguously. Further to this, if R (== LY) is a power of two known at compile time, an alternative serial indexing pattern that may improve performance is:

for (int s = 0; s < R*C; ++s) {
  int i = s%R;
  int j = s/R;
  y[s] = x[i + j*LX];
}

Of course, if LX also equals R then x is also stored contiguously, and the whole copy can be done serially, as though both y and x are one-dimensional arrays.

Futz wicked indexing

Even more obscure is leaning into the SASS assembly IADD3 instruction that adds three integers in one hit.

To introduce this, the matrix-matrix multiplication kernel uses union types to offer a float4 view of x, denoted x4, to facilitate 128-bit loads. It looks something like:

template<int R, int C>
union tile {
  float x[R*C];
  float4 x4[R*C/4];
}

When using x4 instead of x, two-dimensional indexing becomes x4[(i + j*L)/4)], i.e. just divide the index by four.

The hierarchical partitioning of matrices into tiles can lead to some wicked indexing. Prototypical code is something like:

x4[(i0 + i1 + (j0 + j1)*M)/4];

Here, i0 and j0 are some indices that give the corner of a tile of the original x4, then we’re indexing further into that tile with the i1 and j1 indices. These are indices for a float array, which are then translated for a float4 array with the division by four.

There are numerous ways of rewriting that indexing formula, and it can be worth futzing around to see what gives the best performance, to squeeze out a few extra percent in performance overall.

Instruction-level parallelism plays a role again. Currently the code must compute i0 + i1 + (j0 + j1)*M before dividing by 4 (which will compile to a bit shift—we can certainly expect that of the compiler). Perhaps counterintuitively, the following can be faster:

x4[(i0 + i1)/4 + (j0 + j1)*(M/4)];

And a further curiosity is that IADD3 instruction for adding together three integers in one instruction. So the following may be even faster again, facilitating IADD3 for the final addition of three terms:

x4[i0/4 + i1/4 + (j0 + j1)*(M/4)];

Success here may depend on instruction mix and pressure on the various units of the processor (e.g. the integer unit and floating point unit).

Add the `--resource-usage` option when calling `nvcc`

When running nvcc, adding the --resource-usage option will output the number of registers used by each kernel as it is compiled. Dramatic and surprising slowdowns are often caused by crossing an occupancy boundary (e.g. 64, 96 or 128 registers), or spilling registers into memory. The reasons for can be opaque, but at least --resource-usage makes the diagnosis clear.

If it is necessary to reduce register use slightly to prevent a spill, try precomputing or recomputing values to reduce the number of live registers required to keep the state of the program.

Summary

This post has presented some code-level tips and tricks to improve the performance of CUDA kernels by revealing optimization opportunities to the compiler that may otherwise not be apparent. Some tricks require fiddly work, but even the fiddly work can yield several percentage points improvement in the overall performance of a kernel.