CUDA (Experimental)¶
With v0.4.3, the HighFM library has experimental support for CUDA. Although the interface remains mostly the same, it is important to highlight some key differences between executing the code in CPU (or host) and GPU (or device):
The host and device have separated memory spaces, i.e., the host cannot directly access data that was allocated on the device memory, and vice-versa. This implies that the user is responsible for explicitly transferring data from the host memory to device memory before any computation on the device and retrieving the results from the GPU memory back to the host.
Most operations are executed asynchronously on the device.
HighFM objects (e.g., matrices and vectors) cannot be passed as arguments to the CUDA kernels nor their methods are available on the device. However, you can still pass the container using the compact struct returned by
mat_data()
. Remember that the dense matrix follows theColum-Major
layout.
Moreover, only the following operations are currently supported
All BLAS operations (addition, product, norm, etc.)
Solving a dense linear system using LU or QR
Solving a triangular linear system
QR factorization
SVD
Asynchronous Execution¶
Almost all operations on the GPU are executed asynchronously, i.e., the host will launch the kernel on GPU and then continue the execution of the program without waiting for its completion. All the kernel launches are stored in a queue (called a CUDA stream
) and executed in order. To wait for all operation on the queue to finish, you can either use sync_stream()
(synchronize only the queue) or sync_device()
(synchronize all queues in the device).
- Only the following operations are synchronous (which implies a barrier at the end of the routine):
Transferring data from the host to the device or vice versa;
Copying data between two locations in the device memory;
Calculating the norm of a device matrix or vector;
Calculating the dot product between two device vectors.
Usage¶
The following example illustrates how to offload the computation to the GPU.
#include <highfm/core.hpp>
#include <highfm/random.hpp>
int main(int argc, char* argv[])
{
HighFM::initialize();
int n = 2000;
double alpha = 0.5;
double beta = 0.9;
HighFM::Matrix<double, HighFM::Host> host_A(n, n);
HighFM::Matrix<double, HighFM::Host> host_B(n, n);
HighFM::Matrix<double, HighFM::Host> host_C(n, n);
HighFM::random::DefaultGenerator rng;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
host_A(i, j) = rng.normal<double>();
host_B(i, j) = rng.normal<double>();
host_C(i, j) = rng.normal<double>();
}
}
HighFM::Matrix<double, HighFM::Device> device_A = host_A;
HighFM::Matrix<double, HighFM::Device> device_B = host_B;
HighFM::Matrix<double, HighFM::Device> device_C = host_C;
device_C = alpha * device_A * device_B + beta * device_C;
host_C = device_C;
fmt::print("C = {}\n", host_C);
HighFM::terminate();
return EXIT_SUCCESS;
}
Let us examine the code by parts. The program begins by initializing the library and then creating 3 \(n \times n\) matrices – A
, B
, C
– on the host.
HighFM::initialize();
HighFM::random::DefaultGenerator rng;
int n = 2000;
double alpha = 0.5;
double beta = 0.9;
HighFM::Matrix<double, HighFM::Host> host_A(n, n);
HighFM::Matrix<double, HighFM::Host> host_B(n, n);
HighFM::Matrix<double, HighFM::Host> host_C(n, n);
Then, the program initializes the matrices with random numbers that follow the normal distribution.
HighFM::random::DefaultGenerator rng;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
host_A(i, j) = rng.normal<double>();
host_B(i, j) = rng.normal<double>();
host_C(i, j) = rng.normal<double>();
}
}
Next, it creates the matrices on the GPU and then copies the data from the host to the device.
HighFM::Matrix<double, HighFM::Device> device_A = host_A;
HighFM::Matrix<double, HighFM::Device> device_B = host_B;
HighFM::Matrix<double, HighFM::Device> device_C = host_C;
And finally, the program calculates \(C = \alpha AB + \beta C\) (gemm
) and then copies the result from the device to the host. Although the gemm
operation is executed asynchronously, the data transfer between the host and the device forces the device synchronization, and thus, the sync_stream()
can be omitted in this case.
device_C = alpha * device_A * device_B + beta * device_C;
host_C = device_C;
The program ends with
fmt::print("C = {}\n", host_C);
HighFM::terminate();
return EXIT_SUCCESS;
to print the result to the stdout
and then terminate the library before exiting.