blog

Foundations of GPU Computing: Practical Exercises #2

Profiling code in Nsight Systems and refactoring to improve performance.

8 min read / 12 Feb 23

In this practical we continue with the example of a deep neural network implemented in C. We learn to profile with Nsight Systems, measure headroom on concurrency, and refactor the code to improve performance.

This practical is part of the course Foundations of GPU Computing.

Not familiar with C?

The practical does not require you to write any C code from scratch, only make specific modifications. Tips are provided along the way.

Prerequisites
Profiling (20 minutes)
Aim (5 minutes)
Headroom (5 minutes)
First attempt: sharing memory (10 minutes)
Second attempt: replicating the training data (10 minutes)
Third attempt: prefetching (10 minutes)
Summary
Possible extensions

Prerequisites

Make sure that your system satisfies the prerequisites given on the main course page. We will use both Visual Studio Code and Nsight Systems in this practical.

Start Visual Studio code and, if you are using a remote system, connect to it. Start Nsight Systems and File > New Project. Click Select target for profiling… and, if you are using a remote system, Configure targets… to connect to it, otherwise select your local system. If you are using a remote system, Nsight Systems will take a minute or two to configure it.

Profiling (20 minutes)

In the previous practical we only looked at overall execution time. To obtain more detailed performance information we can profile the code using Nsight Systems.

Start Nsight Systems, create a new project, and enter the following settings:

Command line with arguments: ./main

Working directory: /home/👤user/gpu-course where 👤user is your username.

Check the boxes:

Collect CPU IP/backtrace samples

Collect CUDA trace, and under it:

Collect UM CPU page faults

Collect UM GPU page faults

Collect cuBLAS trace

You can uncheck all the other boxes—they won’t hurt, but will make for busy reports with a lot of extraneous information for the purposes of this exercise.

Click Start to run the program and collect profiling information.

After a few moments the timeline view should appear.

On the left, expand CUDA GW > Context 1 > Kernels and Threads > main > CUDA API to inspect the activity of the GPU and CPU, respectively.

You can navigate the timeline view using the following basic controls:

To zoom in, click and drag to select an interval, then right-click and choose Zoom into Selection (shortcut: Shift+Z).
To return to the previous zoom level, right-click and choose Undo Zoom (shortcut: Backspace).
To reset the view, right-click and choose Reset Zoom.
To run again, return to the Project tab and click the Start button. This will create a new report in a new tab (it will not overwrite previous reports).

Zoom in until you can make out a repeating pattern like the below.

Diagram

We can understand this as the training and testing loop, implemented in src/main.c:

The CPU loops over minibatches of the training and test data, enqueuing kernels to a stream to perform the forward pass, backward pass, and gradient updates.
Meanwhile, the GPU executes kernels from the stream asynchronously.
Once all kernels are enqueued, the CPU waits for the GPU to finish with a call to cudaDeviceSynchronize().
The CPU shuffles the training data for the next epoch. Because this does not involve any CUDA calls, the work may not be visible in the profile, but we understand that it occurs in this apparent gap.
The next epoch begins, repeating steps 1-4.

Aim (5 minutes)

There is an opportunity to improve performance if we can swap steps 3 and 4, so that the CPU shuffles data for the next epoch while the GPU is still on the training and testing loop for the current epoch (step 2). This should allow us to compress the timeline to the following: Diagram

For smaller models, the time taken for 3 may be longer than the time taken for 4. This is because the time taken for the shuffle is a function of the data size, while the time taken to perform training and testing is a function of both the data size and model size. For such cases, the timeline that we can achieve is more like this: Diagram

Headroom (5 minutes)

We can assess the headroom for such an improvement by disabling the shuffle entirely.

In Visual Studio Code, comment out the line data_shuffle(&d) in src/main.c, then rebuild and rerun. How much does performance improve? Once you have established the headroom, uncomment the line again.

Not familiar with C?

Comment out a line of code by adding // in front of it, or by wrapping the whole line in /* and */.

In the current implementation, we cannot allow the execution of steps 2 and 4 to overlap, as the GPU would be reading the training data while the CPU is modifying it, and the results would be incorrect. Nonetheless, let’s start that way, as it leads to some interesting observations.

In src/main.c, move the following two lines:
/* shuffle data for next time */
data_shuffle(&d);
to above the two lines:
/* finalize loss and report progress */
cudaDeviceSynchronize();
Rebuild with rerun.

In addition to being incorrect, as expected, this also runs considerably slower, which may be surprising. Let’s investigate with Nsight Systems again.

In Nsight Systems, return to the project tab, and click the Start button to collect a new profile. Switch between the tab for the first report and the second report. Can you diagnose what has happened here?

Hint: Expand the UVM page fault group on the left. UVM is Unified Virtual Memory.

We are seeing many more page faults than before. Recall that we are using managed memory, which means that pages of virtual memory swap between main and device memory on demand. A page fault occurs when either the CPU or GPU attempts to access a page not in its memory, triggering the swap. Too much swapping harms performance. On this occasion the swaps occur because the CPU and GPU are accessing the training data (d.X_train) concurrently.

Second attempt: replicating the training data (10 minutes)

To fix both the correctness and performance issue we can proceed as follows:

Maintain two copies of the training data, call them A and B.
On every odd numbered epoch, the GPU reads from A while the CPU shuffles B.
On every even numbered epoch, the CPU reads from B while the GPU shuffles A.

In this way, the CPU and GPU are always operating on different memory to minimize swaps.

Create a copy of the training data. Just before /* start timer */, insert the following code:
float* X_train = NULL;
size_t bytes = d.M*d.N_train*sizeof(float);
cudaMallocManaged((void**)&X_train, bytes, cudaMemAttachGlobal);
cudaMemcpyAsync(X_train, d.X_train, bytes, cudaMemcpyDefault, cudaStreamDefault);
Just after /* clean up */, insert the following code:
cudaFree(X_train);
Rebuild.

The original data is in d.X_train and the copy in X_train. This code allocates memory for X_train and copies d.X_train into it. The second code deallocates the memory.

Just before the data shuffle, swap the X_train and d.X_train pointers:
/* shuffle data for next time */
float* tmp = d.X_train;
d.X_train = X_train;
X_train = tmp;
data_shuffle(&d);
Rebuild and rerun.

The results should now be correct, but may still run slower than before. Let’s investigate with Nsight Systems again.

In Nsight Systems, go to the tab for your project, reconnect if necessary, and click the Start button to collect a new profile. Can you diagnose what has happened here?

We are still seeing many page faults! The issue here is more subtle than before. The CPU and GPU are no longer accessing the same memory simultaneously, but they do share it alternately. On every odd numbered epoch the GPU swaps A back from main to device memory on demand, while the CPU swaps B back from device to main memory on demand (conversely for even numbered epochs). The inefficiency here is that these swaps occur page-by-page, on demand, each time forcing the CPU or GPU to pause.

Third attempt: prefetching (10 minutes)

One way to improve this situation is to advise the CUDA runtime to pre-fetch memory onto CPU or GPU ahead of its use with cudaMemPrefetchAsync(). This allows bulk swaps instead of on-demand.

After data_shuffle(&d), enter the following code:

cudaMemPrefetchAsync(X_train, bytes, cudaCpuDeviceId, cudaStreamDefault);
cudaMemPrefetchAsync(d.X_train, bytes, device, cudaStreamDefault);

Rebuild and rerun.

The program should run considerably faster than the previous version, and a little faster than the original version, depending on hardware and problem size.

Run in Nsight Systems one last time.

You should see fewer page faults than before.

How close did we get to the headroom?

Summary

In this practical we have learnt how to use Nsight Systems to profile a CUDA program, used the timeline view to find a performance opportunity, and updated the code to address it. We should now have a better understanding of CPU and GPU interaction via streaming computation and virtual memory.

Possible extensions

Further features of Nsight Systems: In addition to the timeline view, which provides insight on concurrency and occupancy, Nsight Systems provides information on individual kernel performance. Expanding CUDA HW > Context 1 > Kernels reveals the most expensive kernels on which to focus optimization efforts. Right-clicking a kernel and selecting Show in Events View provides further detail, including execution configuration, timing and memory use information.
Three buffers: There is an even faster way of doing this, where both d.X_train and X_train are allocated as pinned main memory (see cudaMallocHost(), cudaFreeHost()), and a third, call it X_train_device, is allocated as device memory only (see cudaMalloc(), cudaFree()). Prefetch is removed, d.X_train and X_train swap as before, but at the start of each training epoch the active of them copies to X_train_device. The GPU works on X_train_device only. This is faster because it uses a one-way copy from main to device memory, rather than two-way swaps between.
Shuffle on GPU: Ideally, the shuffle would be performed on the GPU as well, which would avoid copying entirely. A parallel permutation function for GPU is nontrivial to implement. As a starting point, pseudorandom numbers can be generated on the GPU with cuRAND.