General-purpose computing on the GPU¶
Besides rendering, Orka can also be used for general purpose computing on the GPU (GPGPU) by using compute shaders. With compute shaders, a framebuffer is not needed; compute shaders can read from and write to buffers and/or images (textures) that have been binded.
A shader program that uses a vertex and fragment shader requires a framebuffer. The default framebuffer is not available unless a window has been created. If the default framebuffer is not to be used, a new framebuffer object must be created and binded instead. This is useful, for example, to attach a texture to the framebuffer object and then later save the texture in a KTX file.
Surfaceless context¶
To run compute shaders, a context needs to be created first. If no default framebuffer is needed because there is nothing that needs to be displayed in a window, a surfaceless context can be created instead. This avoids needing a connection to a windowing system such as a Wayland compositor.
To create a surfaceless context, create a context via EGL using the 'device'
platform by calling the function Create_Context
in package
Orka.Contexts.EGL
.
The function has a parameter Device
that can be used to specify the device
that should be used in case your system has multiple GPUs.
See EGL for more information.
Buffers¶
Compute shaders often read or write from buffers. To make a buffer available, bind it as an SSBO to the index of a binding point. See SSBO on how to bind a buffer.
Work groups¶
When running a compute shader on the GPU, work is divided over multiple threads. The threads are grouped in what is called a workgroup. Both the threads inside a workgroup and the workgroups themselves can be placed on one to three axes. The number of axes is arbitrary and depends on what is considered useful. For example, a compute shader that applies element-wise transformations to a tensor may use a single axis for both the threads and the workgroups, while a shader which applies a matrix multiplication on two tensors may use two axes for both the threads and the workgroups.
A shader can define a fixed number of threads per axis for a workgroups as following:
layout(local_size_x = 256) in;
This should be a power of two, such as 64 or 256.
The reason for this is that threads inside a workgroup are grouped in smaller
groups called subgroups. These threads are executed in lock step.
If the local size is not a multiple of the subgroup size, some threads will be
inactive, decreasing the occupancy of the shader.
Most discrete GPUs have a subgroup size of 32 threads.
The constant gl_WorkGroupSize
contains a vector of the number of threads on each axis.
The variable gl_WorkGroupID
contains a vector with the indices of a workgroup (one for each axis).
The identifier of a thread inside a workgroup is given by gl_LocalInvocationID
and the identifier of a thread among all workgroups is given by gl_GlobalInvocationID
(both are uvec3
).
If a video driver supports the extension ARB_compute_variable_group_size
then
variable sized compute shaders may be enabled by declaring in the shader:
#extension GL_ARB_compute_variable_group_size : require
If a compute shader has a variable size, it must be dispatched in a different way than fixed size compute shaders (explained below) and it must declare the layout differently:
layout(local_size_variable) in;
For a variable sized compute shader, the variable gl_LocalGroupSizeARB
must be used
instead of the constant gl_WorkGroupSize
.
Shared data¶
The threads inside a single workgroup can communicate with each other using shared data.
Threads from different workgroups cannot communicate with each other because work
groups are scheduled independently. In worst case a GPU may process the workgroups
one workgroup at a time. The only way for the threads from different workgroups to
communicate is to write data to a buffer, insert a memory barrier after the program
has been completed, and then launch the program again so that the threads can read
the data from the previous launch. If multiple threads wish to write to the same offset
in a buffer or a shared variable, atomic functions like atomicAdd
should be used.
To create shared data in a shader, use the keyword shared
:
shared uint data[gl_WorkGroupSize.x];
When the threads have written their data to a shared variable, they all must execute a memory and computation barrier:
memoryBarrierShared();
barrier();
After all threads have passed the barriers, they can read the shared data.
This process can be repeated using a for
loop.
The function groupMemoryBarrier
can be used for memory transactions involving
all types of memory, including buffers, images, and shared variables.
All threads of a workgroup must execute the barriers
It is important to note that all threads of a workgroup must execute a barrier,
otherwise the compute shader will hang. Thus the barriers must not be placed
inside an if
block if the control flow is not uniform.
Voting¶
Threads inside a single subgroup operate in lock step and can also exchange information
by voting. This requires the extension ARB_shader_group_vote
.
The function anyInvocationARB
returns true if and only if at
least one thread in the subgroup used true
as the input to the function.
The function allInvocationsARB
returns true if all threads
used the value true
.
The function allInvocationsEqualARB
returns true if all threads agreed
on the boolean value of the first parameter.
However, for these functions to be used effectively it is often important to know
the size of the subgroup. The size of a subgroup can be retrieved from the constant
gl_SubGroupSizeARB
(a uint
) from the extension ARB_shader_ballot
.
If this extension is not present, the previously mentioned functions may be useful only to choose between executing a fast algorithm (which works only if certain conditions are met) and a slower one (which works in all cases).
A second way to vote is the function ballotARB
from the extension
ARB_shader_ballot
.
It returns a bitfield of the type uint64_t
with a bit set for each
thread that provided the value true
as the input to the function.
This function is an effective way to exchange a small amount of information
among threads of a subgroup.
Note
See Mesamatrix for a list of extensions supported by each video driver.
Broadcasting¶
The extension ARB_shader_ballot
provides two more functions that can be used
to broadcast a value to all threads inside a subgroup.
The function readFirstInvocationARB
broadcasts and returns the given value
of the first active thread to all other active threads in the subgroup.
The function readInvocationARB
has a second parameter containing the index of
the thread in the subgroup whose value must be broadcasted to all other threads.
Limits¶
The hardware places several limits on the number of threads and workgroups.
A workgroup often has a maximum of no less than 1024 threads.
The exact number can be queried with the function Compute_Work_Group_Size
of a
Program
object. It returns a Dimension_Size_Array
, which is an array containing
three values for the axes X
, Y
, and Z
.
A workgroup containing a large number of threads, decreases the amount of shared data available per thread.
Launching compute shaders¶
To launch a compute shader, first bind the necessary buffers and images, then
enable a program containing a compute shader with the procedure Use_Program
(See Using a program), and then dispatch it.
To dispatch a fixed size compute shader, execute the procedure Dispatch_Compute
from the package GL.Compute
:
declare
function Groups (Elements, Group_Size : Unsigned_32) return Unsigned_32 is
(Elements / Group_Size
+ (if Elements mod Group_Size = 0 then 0 else 1));
use all type Orka.Index_3D;
Group_Size : Dimension_Size_Array := Program_1.Compute_Work_Group_Size;
Size_X : Unsigned_32 := Unsigned_32 (Group_Size (X));
begin
GL.Compute.Dispatch_Compute
(X => Groups (Elements => Elements, Group_Size => Size_X));
end;
To dispatch a variable sized compute shader, run procedure Dispatch_Compute_Group_Size
:
GL.Compute.Dispatch_Compute_Group_Size
(Group_Size => (Integer_32 (Size_X), 1, 1),
X => Groups (Elements => Elements, Group_Size => Size_X));
Insert a barrier barrier if needed
If a buffer was previously modified in any way, make sure
to insert a Shader_Storage
memory barrier before launching a compute shader.