Skip to content

Compute Particles

This example shows how to transform data in parallel by uploading a buffer of data to the GPU and running a shader on all the items in the buffer. We do this every frame and draw the results to the screen using instanced rendering.

If you look at the code for this example, you may notice that the updateScene function is empty. That's because all the per-frame logic is occurring on the GPU, per-particle, with the following shader:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
struct ParticleData
{
    vec4 position;
    vec4 velocity;
    vec4 color;
};

// Particles from previouse frame
layout (std430, set = 0, binding = 0) coherent buffer Particles
{
    ParticleData particles[];
} data;

const float particleStep = 0.01;
const float finalCollisionFactor = 0.01;

void main(void)
{
    uint globalId = gl_GlobalInvocationID.x;

    // Retrieve current particle from previous frame
    ParticleData currentParticle = data.particles[globalId];

    // New position = old position + distance traveled over step duration
    currentParticle.position = currentParticle.position + currentParticle.velocity * particleStep;

    // Make acceleration more or less point toward the center of the scene
    vec4 acceleration =  normalize(vec4(0.0) - currentParticle.position) * finalCollisionFactor;

    // New velocity = old velocity + acceleration over step duration
    currentParticle.velocity = currentParticle.velocity + acceleration * particleStep;

    // Save updated particle
    data.particles[globalId] = currentParticle;
}

Filename: compute_particles/doc/shadersnippet.comp

Initialization

Our "particles" are a set of triangles with color and position information. They share their shape in common. The information that is unique to each triangle is the position, color, and velocity. We pack that data into a struct.

1
2
3
4
5
struct ParticleData {
    glm::vec4 position;
    glm::vec4 velocity;
    glm::vec4 color;
};

Filename: compute_particles/compute_particles.cpp

You may recognize this from a moment ago. An identical struct was created in the shader so that the GPU can receive this data properly.

Next, we create a buffer of ParticleData. Notice that it has been declared a storage buffer with the StorageBufferBit. A storage buffer is able to be much larger than a uniform buffer and is writable by the GPU.

1
2
3
4
5
6
7
            const BufferOptions particlesBufferOptions = {
                .size = ParticlesCount * sizeof(ParticleData),
                .usage = BufferUsageFlagBits::VertexBufferBit | BufferUsageFlagBits::StorageBufferBit,
                .memoryUsage = MemoryUsage::CpuToGpu // So we can map it to CPU address space
            };
            const std::vector<ParticleData> particles = initializeParticles(ParticlesCount);
            m_particleDataBuffer = m_device.createBuffer(particlesBufferOptions, particles.data());

Filename: compute_particles/compute_particles.cpp

We also create a small buffer with the common information between each triangle: the vertices.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
            const BufferOptions triangleBufferOptions = {
                .size = 3 * sizeof(Vertex),
                .usage = BufferUsageFlagBits::VertexBufferBit,
                .memoryUsage = MemoryUsage::CpuToGpu
            };

            const float r = 0.08f;
            std::array<Vertex, 3> vertexData;
            vertexData[0] = { { r * std::cos(7.0f * M_PI / 6.0f), -r * std::sin(7.0f * M_PI / 6.0f), 0.0f } }; // Bottom-left
            vertexData[1] = { { r * std::cos(11.0f * M_PI / 6.0f), -r * std::sin(11.0f * M_PI / 6.0f), 0.0f } }; // Bottom-right

Filename: compute_particles/compute_particles.cpp

Next, we begin creating the compute pipeline. The first step is to load in the .comp compute shader, shown in the introduction:

1
2
        const auto computeShaderPath = KDGpu::assetPath() + "/shaders/examples/compute_particles/particles.comp.spv";
        auto computeShader = m_device.createShaderModule(KDGpuExample::readShaderFile(computeShaderPath));

Filename: compute_particles/compute_particles.cpp

Then we create bind group options, a bind group, pipeline layout options, and finally the pipeline layout meant to hold the storage buffer we originally created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
        const BindGroupLayoutOptions bindGroupLayoutOptions = {
            .bindings = {{
                .binding = 0,
                .resourceType = ResourceBindingType::StorageBuffer,
                .shaderStages = ShaderStageFlags(ShaderStageFlagBits::ComputeBit)
            }}
        };
        // clang-format on
        const BindGroupLayout bindGroupLayout = m_device.createBindGroupLayout(bindGroupLayoutOptions);

        // Create a pipeline layout (array of bind group layouts)
        const PipelineLayoutOptions pipelineLayoutOptions = {
            .bindGroupLayouts = { bindGroupLayout }
        };
        m_computePipelineLayout = m_device.createPipelineLayout(pipelineLayoutOptions);

Filename: compute_particles/compute_particles.cpp

Now we can instantiate the pipeline elements. First we create the bind group, using a StorageBufferBinding resource in the creation options. Then we create the compute pipeline with KDGpu::ComputePipelineOptions, which is far simpler than the usual graphics pipeline options.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
        const BindGroupOptions bindGroupOptions {
            .layout = bindGroupLayout,
            .resources = {{
                .binding = 0,
                .resource = StorageBufferBinding{ .buffer = m_particleDataBuffer }
            }}
        };
        // clang-format on
        m_particleBindGroup = m_device.createBindGroup(bindGroupOptions);

        const ComputePipelineOptions pipelineOptions{
            .layout = m_computePipelineLayout,
            .shaderStage = { .shaderModule = computeShader }
        };

        m_computePipeline = m_device.createComputePipeline(pipelineOptions);

Filename: compute_particles/compute_particles.cpp

The next step is the initialization of the graphics pipeline. This pipeline will be used in a second pass to draw a triangle at a position and color according to the data in the storage buffer. It is initialized in largely the same way as graphics pipelines in previous examples. Let's take a look at the pipeline options' vertex field, which is new.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
            .vertex = {
                .buffers = {
                    { .binding = 0, .stride = sizeof(Vertex) },
                    { .binding = 1, .stride = sizeof(ParticleData), .inputRate = VertexRate::Instance }
                },
                .attributes = {
                    { .location = 0, .binding = 0, .format = Format::R32G32B32_SFLOAT }, // Vertex Position
                    { .location = 1, .binding = 1, .format = Format::R32G32B32A32_SFLOAT }, // Particle Position
                    { .location = 2, .binding = 1, .format = Format::R32G32B32A32_SFLOAT, .offset = 2 * sizeof(glm::vec4) } // Particle Color
                }
            },

Filename: compute_particles/compute_particles.cpp

There are two bindings: the common data (the vertices for every triangle) and the per-particle data. The first binding has one RGB attribute (a vec3) and the per-particle data has position and color. Notice that the offset of the particle color is two vec4s! We are skipping over the velocity field of ParticleData because only the compute shader needs that information.

Additionally, the input rate of the per-particle information is marked Instance. That means that the GPU will only move to the next item in that buffer after it finishes drawing all the vertices in the vertex buffer. The default rate is Vertex, which would tell the GPU that there was a different ParticleData for every vertex.

Per-Frame Logic

Each frame we don't run any CPU instructions except to queue new jobs on the GPU. We can achieve this either by using a single command buffer with a memory barrier inserted between the compute and rendering commands, or we can use two command buffers with semaphores.

Here is what the first approach looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
    auto commandRecorder = m_device.createCommandRecorder();
    {
        // Compute
        auto computePass = commandRecorder.beginComputePass();
        computePass.setPipeline(m_computePipeline);
        computePass.setBindGroup(0, m_particleBindGroup);
        constexpr size_t LocalWorkGroupXSize = 256;
        computePass.dispatchCompute(ComputeCommand{ .workGroupX = ParticlesCount / LocalWorkGroupXSize });
        computePass.end();

        // Barrier to force waiting for compute commands SSBO writes to have completed
        // before vertex shaders tries to read per instance vertex attributes

        // clang-format off
        commandRecorder.memoryBarrier(MemoryBarrierOptions {
                .srcStages = PipelineStageFlags(PipelineStageFlagBit::ComputeShaderBit),
                .dstStages = PipelineStageFlags(PipelineStageFlagBit::VertexInputBit),
                .memoryBarriers = {
                            {
                                .srcMask = AccessFlags(AccessFlagBit::ShaderWriteBit),
                                .dstMask = AccessFlags(AccessFlagBit::VertexAttributeReadBit)
                            }
                }
        });
        // clang-format on

        // Render
        m_opaquePassOptions.colorAttachments[0].view = m_swapchainViews.at(m_currentSwapchainImageIndex);
        auto opaquePass = commandRecorder.beginRenderPass(m_opaquePassOptions);
        opaquePass.setPipeline(m_graphicsPipeline);
        opaquePass.setVertexBuffer(0, m_triangleVertexBuffer);
        opaquePass.setVertexBuffer(1, m_particleDataBuffer); // Per instance Data
        opaquePass.draw(DrawCommand{ .vertexCount = 3, .instanceCount = ParticlesCount });
        renderImGuiOverlay(&opaquePass);
        opaquePass.end();
    }
    m_graphicsAndComputeCommands = commandRecorder.finish();

    // Submit Commands
    const SubmitOptions submitOptions = {
        .commandBuffers = { m_graphicsAndComputeCommands },
        .waitSemaphores = { m_presentCompleteSemaphores[m_inFlightIndex] },
        .signalSemaphores = { m_renderCompleteSemaphores[m_inFlightIndex] }
    };
    m_queue.submit(submitOptions);

Filename: compute_particles/compute_particles.cpp

The new command is dispatchCompute. In its arguments we specify the work group size to be 256. It is also found in the shader code:

1
layout (local_size_x = 256) in;

Filename: compute_particles/doc/shadersnippet.comp

TODO: explain meaning of work groups and reasoning for 256 maybe

The memory barrier inserted afterwards provides assurance that the compute shader will have completed before rendering begins. Within the memory barrier options, we describe the stages of the pipeline that come before and after the barrier using KDGpu::PipelineStageFlagBit. We also create the memory barrier object itself, which has two bitmasks describing the access that the previous commands had to the buffer and the access that the succeeding commands will have to the buffer. In this case, the compute shader was writing (ShaderWriteBit) and the graphics shader will read the buffer as vertex attributes (VertexAttributeReadBit).

The second available approach (not used by default for this example) is to use two separate command buffers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
    // Compute
    auto computeCommandRecorder = m_device.createCommandRecorder();
    {
        auto computePass = computeCommandRecorder.beginComputePass();
        computePass.setPipeline(m_computePipeline);
        computePass.setBindGroup(0, m_particleBindGroup);
        constexpr size_t LocalWorkGroupXSize = 256;
        computePass.dispatchCompute(ComputeCommand{ .workGroupX = ParticlesCount / LocalWorkGroupXSize });
        computePass.end();
    }
    m_computeCommands = computeCommandRecorder.finish();

    // Render
    auto graphicsCommandRecorder = m_device.createCommandRecorder();
    {
        m_opaquePassOptions.colorAttachments[0].view = m_swapchainViews.at(m_currentSwapchainImageIndex);
        auto opaquePass = graphicsCommandRecorder.beginRenderPass(m_opaquePassOptions);
        opaquePass.setPipeline(m_graphicsPipeline);
        opaquePass.setVertexBuffer(0, m_triangleVertexBuffer);
        opaquePass.setVertexBuffer(1, m_particleDataBuffer); // Per instance Data
        opaquePass.draw(DrawCommand{ .vertexCount = 3, .instanceCount = ParticlesCount });
        renderImGuiOverlay(&opaquePass);
        opaquePass.end();
    }
    m_graphicsCommands = graphicsCommandRecorder.finish();

    // Submit Commands

    // We first submit compute commands
    const SubmitOptions computeSubmitOptions = {
        .commandBuffers = { m_computeCommands },
        .waitSemaphores = { m_presentCompleteSemaphores[m_inFlightIndex] },
        .signalSemaphores = { m_computeSemaphoreComplete }
    };
    m_queue.submit(computeSubmitOptions);

    // Then we submit the graphics command, we rely on a semaphore to ensure
    // graphics commands don't start prior to the compute commands being completed
    const SubmitOptions graphicsSubmitOptions = {
        .commandBuffers = { m_graphicsCommands },
        .waitSemaphores = { m_computeSemaphoreComplete },
        .signalSemaphores = { m_renderCompleteSemaphores[m_inFlightIndex] }
    };
    m_queue.submit(graphicsSubmitOptions);

Filename: compute_particles/compute_particles.cpp

Note the additional m_computeSemaphoreComplete, which was initialized at program start with m_device.createGpuSemaphore(); and will live until the end of the program.

This approach is entirely identical in terms of the executed commands, but uses a different synchronization method. For more information on semaphores, check out the Hello Triangle Native example.

TODO: maybe benchmark these and talk about performance differences?


Updated on 2023-12-22 at 00:05:36 +0000