Foundations of GPU Computing

A short course with a machine learning flavor.

Lawrence Murray on 13 February 2023
updated on 10 March 2023

Graphics Processing Units (GPUs) are now a standard feature of the numerical computing landscape, especially so in machine learning. This course teaches foundational concepts of GPU computing: hardware, memory, kernels and streams. It includes practical sessions working with a deep neural network that has been implemented in C with CUDA. The motivation for using C (and not, for example, Python) is to reinforce the foundational concepts, as it forces us to be explicit about each step: each memory allocation, each kernel call, each stream synchronization.

Not familiar with C?

The practicals focus on reading, building, running and profiling code, not writing code. You will not have to write any C code from scratch.

The course consists of four modules of about one hour each: an introductory lecture, two practical sessions, and a closing lecture.

By way of philosophy, we focus less on individual kernel performance and more on holistic program performance, less on host-device memory copies and more on unified virtual memory, less on coalescence and more on cache efficiency. We believe that this reflects a more modern approach.


For the practical sessions you will need access to a machine with an Nvidia GPU and CUDA installed. This machine could be your local laptop or desktop machine if it has a discrete graphics card, or a remote machine to which you have SSH access. If you do not already have access to such a remote machine you can easily set up an instance on a cloud service provider by following this guide.

The GPU should be of at least the Pascal generation of hardware or later (post 2016), as we make use of more recent innovations in unified virtual memory that are not supported by earlier generations of hardware (Maxwell and before).

Regardless of where you will run the code—local or remote—you will need the following software installed on your local laptop or desktop machine. Follow the links for instructions on how to install the software and, if relevant, make an SSH connection to a remote machine with a GPU:

While the course assumes use of Visual Studio Code throughout, if you are familiar with SSH you could instead use a terminal and text editor of your choice. Nsight Systems is used for profiling code and, for the purposes of these practicals at least, has no obvious substitute.

Course material

  1. Introductory Lecture
    No code here, just concepts to start. GPU hardware, memory hierarchies, kernel configurations, streaming computation, floating point precision and performance.
  2. Practical Exercises 1
    Hands-on with the C example code. Building and running CUDA programs, single versus double precision performance, vector and matrix kernels, memory management.
  3. Practical Exercises 2
    Profiling code, identifying issues, assessing headroom, improving concurrency.
  4. Closing Lecture
    More streaming computation, and advanced kernel programming.
  5. Example code
    For use during the practicals. Implements a deep neural network with forward and backward passes, plus Adam optimizer for training.

Further resources

blog Related?
GPU Programming in the Cloud

A how-to and round-up of cloud service providers.

Lawrence Murray

22 Nov 22

blog Related?
Sums of Discrete Random Variables as Banded Matrix Products

Zero-stride catch and a custom CUDA kernel.

Sums of Discrete Random Variables as Banded Matrix Products

Lawrence Murray

16 Mar 23

blog Next
Fast Enumeration of Sums of Discrete Random Variables

22 Feb 23

photography Previous
Elements of Chiang Mai

1 Feb 23