In this practical we continue with the example of a deep neural network implemented in C. We learn to profile with Nsight Systems, measure headroom on concurrency, and refactor the code to improve performance.
This practical is part of the course Foundations of GPU Computing.
Not familiar with C?
The practical does not require you to write any C code from scratch, only make specific modifications. Tips are provided along the way.
Make sure that your system satisfies the prerequisites given on the main course page. We will use both Visual Studio Code and Nsight Systems in this practical.
Start Visual Studio code and, if you are using a remote system, connect to it. Start Nsight Systems and File > New Project. Click Select target for profiling… and, if you are using a remote system, Configure targets… to connect to it, otherwise select your local system. If you are using a remote system, Nsight Systems will take a minute or two to configure it.
In the previous practical we only looked at overall execution time. To obtain more detailed performance information we can profile the code using Nsight Systems.
Start Nsight Systems, create a new project, and enter the following settings:
- Command line with arguments:
- Working directory:
useris your username.
Check the boxes:
- Collect CPU IP/backtrace samples
- Collect CUDA trace, and under it:
- Collect UM CPU page faults
- Collect UM GPU page faults
- Collect cuBLAS trace
You can uncheck all the other boxes—they won’t hurt, but will make for busy reports with a lot of extraneous information for the purposes of this exercise.
Click Start to run the program and collect profiling information.
After a few moments the timeline view should appear.
On the left, expand CUDA GW > Context 1 > Kernels and Threads > main > CUDA API to inspect the activity of the GPU and CPU, respectively.
You can navigate the timeline view using the following basic controls:
Zoom in until you can make out a repeating pattern like the below.
We can understand this as the training and testing loop, implemented in
There is an opportunity to improve performance if we can swap steps 3 and 4, so that the CPU shuffles data for the next epoch while the GPU is still on the training and testing loop for the current epoch (step 2). This should allow us to compress the timeline to the following:
For smaller models, the time taken for 3 may be longer than the time taken for 4. This is because the time taken for the shuffle is a function of the data size, while the time taken to perform training and testing is a function of both the data size and model size. For such cases, the timeline that we can achieve is more like this:
We can assess the headroom for such an improvement by disabling the shuffle entirely.
In Visual Studio Code, comment out the line
src/main.c, then rebuild and rerun. How much does performance improve? Once you have established the headroom, uncomment the line again.
Not familiar with C?
Comment out a line of code by adding
//in front of it, or by wrapping the whole line in
In the current implementation, we cannot allow the execution of steps 2 and 4 to overlap, as the GPU would be reading the training data while the CPU is modifying it, and the results would be incorrect. Nonetheless, let’s start that way, as it leads to some interesting observations.
src/main.c, move the following two lines:
/* shuffle data for next time */ data_shuffle(&d);
to above the two lines:
/* finalize loss and report progress */ cudaDeviceSynchronize();
Rebuild with rerun.
In addition to being incorrect, as expected, this also runs considerably slower, which may be surprising. Let’s investigate with Nsight Systems again.
In Nsight Systems, return to the project tab, and click the Start button to collect a new profile. Switch between the tab for the first report and the second report. Can you diagnose what has happened here?
Hint: Expand the UVM page fault group on the left. UVM is Unified Virtual Memory.
We are seeing many more page faults than before. Recall that we are using managed memory, which means that pages of virtual memory swap between main and device memory on demand. A page fault occurs when either the CPU or GPU attempts to access a page not in its memory, triggering the swap. Too much swapping harms performance. On this occasion the swaps occur because the CPU and GPU are accessing the training data (
To fix both the correctness and performance issue we can proceed as follows:
Awhile the CPU shuffles
Bwhile the GPU shuffles
In this way, the CPU and GPU are always operating on different memory to minimize swaps.
Create a copy of the training data. Just before
/* start timer */, insert the following code:
float* X_train = NULL; size_t bytes = d.M*d.N_train*sizeof(float); cudaMallocManaged((void**)&X_train, bytes, cudaMemAttachGlobal); cudaMemcpyAsync(X_train, d.X_train, bytes, cudaMemcpyDefault, cudaStreamDefault);
/* clean up */, insert the following code:
The original data is in
d.X_train and the copy in
X_train. This code allocates memory for
X_train and copies
d.X_train into it. The second code deallocates the memory.
Just before the data shuffle, swap the
/* shuffle data for next time */ float* tmp = d.X_train; d.X_train = X_train; X_train = tmp; data_shuffle(&d);
Rebuild and rerun.
The results should now be correct, but may still run slower than before. Let’s investigate with Nsight Systems again.
In Nsight Systems, go to the tab for your project, reconnect if necessary, and click the Start button to collect a new profile. Can you diagnose what has happened here?
We are still seeing many page faults! The issue here is more subtle than before. The CPU and GPU are no longer accessing the same memory simultaneously, but they do share it alternately. On every odd numbered epoch the GPU swaps
A back from main to device memory on demand, while the CPU swaps
B back from device to main memory on demand (conversely for even numbered epochs). The inefficiency here is that these swaps occur page-by-page, on demand, each time forcing the CPU or GPU to pause.
One way to improve this situation is to advise the CUDA runtime to pre-fetch memory onto CPU or GPU ahead of its use with cudaMemPrefetchAsync(). This allows bulk swaps instead of on-demand.
data_shuffle(&d), enter the following code:
cudaMemPrefetchAsync(X_train, bytes, cudaCpuDeviceId, cudaStreamDefault); cudaMemPrefetchAsync(d.X_train, bytes, device, cudaStreamDefault);
Rebuild and rerun.
The program should run considerably faster than the previous version, and a little faster than the original version, depending on hardware and problem size.
Run in Nsight Systems one last time.
You should see fewer page faults than before.
How close did we get to the headroom?
In this practical we have learnt how to use Nsight Systems to profile a CUDA program, used the timeline view to find a performance opportunity, and updated the code to address it. We should now have a better understanding of CPU and GPU interaction via streaming computation and virtual memory.
X_trainare allocated as pinned main memory (see
cudaFreeHost()), and a third, call it
X_train_device, is allocated as device memory only (see
cudaFree()). Prefetch is removed,
X_trainswap as before, but at the start of each training epoch the active of them copies to
X_train_device. The GPU works on
X_train_deviceonly. This is faster because it uses a one-way copy from main to device memory, rather than two-way swaps between.