Skip to content

Compute Particles

compute_particles.png

This example shows how to leverage compute shaders to simulate thousands of particles entirely on the GPU, then render them using instanced draws. All particle updates (position, velocity, color) happen in parallel on the GPU - the CPU does no per-frame particle work. This is fundamental for GPU-driven rendering, physics simulations, and particle systems.

The example uses the KDGpuExample helper API for simplified setup.

Overview

What this example demonstrates:

  • Compute shaders for data-parallel processing
  • Storage buffers (SSBO) for large read-write GPU data
  • Compute-to-graphics pipeline synchronization
  • Instanced rendering with per-instance data
  • GPU work groups and dispatch dimensions

Performance benefit:

  • 1024 particles updated in parallel on GPU
  • Zero CPU work per particle per frame
  • Efficient instance rendering (3 vertices × 1024 instances)
  • Typical: 100-1000× faster than CPU particle updates

Vulkan Requirements

  • Vulkan Version: 1.0+
  • Extensions: None (compute shaders are core)
  • Device Features: Compute shader support (universal on modern GPUs)
  • Limits: Max work group size, max compute shared memory

Key Concepts

Compute Shaders:

Compute shaders are general-purpose GPU programs that operate on arbitrary data, not just graphics. Unlike vertex/fragment shaders tied to the graphics pipeline, compute shaders process data in parallel work groups:

1
2
3
4
5
6
layout(local_size_x = 256) in;  // Work group size

void main() {
    uint index = gl_GlobalInvocationID.x;  // Particle index
    // Update particle[index]...
}

Key concepts:

  • Work Group: Batch of invocations (threads) executing together
  • Local Size: Invocations per work group (e.g., 256)
  • Dispatch: Number of work groups to execute (e.g., 1024/256 = 4)
  • Global ID: Unique invocation index across all work groups

For 1024 particles with local_size_x=256:

  • Dispatch 4 work groups
  • Each work group processes 256 particles
  • Total: 4 × 256 = 1024 invocations

Spec: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkComputePipelineCreateInfo.html

Storage Buffers (SSBO):

Storage buffers are large, read-write GPU buffers accessible from shaders. Unlike uniform buffers (read-only, size-limited), SSBOs can be:

  • Large: Megabytes to gigabytes (vs ~64KB UBO limit)
  • Writable: Shaders can modify data
  • Structured: Arrays of structs with arbitrary layouts
  • Shared: Read by compute, written by compute, read by graphics
1
2
3
4
5
struct ParticleData {
    glm::vec4 position;
    glm::vec4 velocity;
    glm::vec4 color;
};

Filename: compute_particles/compute_particles.cpp

This struct exists identically in both C++ and shader code, allowing seamless GPU-CPU data sharing.

Instanced Rendering:

After compute updates particle data, graphics pass renders one triangle per particle using instancing. Instancing draws the same geometry (3 vertices) multiple times with per-instance data (position, color):

  • Base geometry: Triangle shape (3 vertices)
  • Per-instance: Particle position/color (1024 instances)
  • Result: 1024 triangles with one draw call

Spec: https://www.khronos.org/opengl/wiki/Vertex_Specification#Instanced_arrays

Implementation

Creating the Storage Buffer:

1
2
3
4
5
6
7
            const BufferOptions particlesBufferOptions = {
                .size = ParticlesCount * sizeof(ParticleData),
                .usage = BufferUsageFlagBits::VertexBufferBit | BufferUsageFlagBits::StorageBufferBit,
                .memoryUsage = MemoryUsage::CpuToGpu // So we can map it to CPU address space
            };
            const std::vector<ParticleData> particles = initializeParticles(ParticlesCount);
            m_particleDataBuffer = m_device.createBuffer(particlesBufferOptions, particles.data());

Filename: compute_particles/compute_particles.cpp

Key flags:

  • StorageBufferBit: Accessible as SSBO in shaders
  • VertexBufferBit: Also used as vertex buffer for instanced rendering
  • CpuToGpu: CPU initializes data, GPU updates it

Vertex Buffer for Shared Triangle:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
            const BufferOptions triangleBufferOptions = {
                .size = 3 * sizeof(Vertex),
                .usage = BufferUsageFlagBits::VertexBufferBit,
                .memoryUsage = MemoryUsage::CpuToGpu
            };

            const float r = 0.08f;
            std::array<Vertex, 3> vertexData;
            vertexData[0] = { { r * std::cos(7.0f * M_PI / 6.0f), -r * std::sin(7.0f * M_PI / 6.0f), 0.0f } }; // Bottom-left
            vertexData[1] = { { r * std::cos(11.0f * M_PI / 6.0f), -r * std::sin(11.0f * M_PI / 6.0f), 0.0f } }; // Bottom-right
            vertexData[2] = { { 0.0f, -r, 0.0f } }; // Top
            m_triangleVertexBuffer = m_device.createBuffer(triangleBufferOptions, vertexData.data());

Filename: compute_particles/compute_particles.cpp

This is the base triangle shape rendered for each particle. All particles share this geometry.

The shader:

  1. Reads particle data (position, velocity, color)
  2. Updates position based on velocity
  3. Bounces particles off screen boundaries
  4. Writes back to same buffer

All 1024 particles processed in parallel! See the compute shader source for implementation details.

All 1024 particles processed in parallel!

Loading Compute Shader:

1
2
        auto computeShaderPath = KDGpuExample::assetDir().file("shaders/examples/compute_particles/particles.comp.spv");
        auto computeShader = m_device.createShaderModule(KDGpuExample::readShaderFile(computeShaderPath));

Filename: compute_particles/compute_particles.cpp

Compute shaders have .comp extension and run on compute queue (same as graphics on most hardware).

Compute Pipeline Setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
        // Create bind group layout consisting of a single binding holding a SSBO
        const BindGroupLayoutOptions bindGroupLayoutOptions = {
            .bindings = {
                    {
                            .binding = 0,
                            .resourceType = ResourceBindingType::StorageBuffer,
                            .shaderStages = ShaderStageFlags(ShaderStageFlagBits::ComputeBit),
                    },
            },
        };
        m_computeBindGroupLayout = m_device.createBindGroupLayout(bindGroupLayoutOptions);

        // Create a pipeline layout (array of bind group layouts)
        const PipelineLayoutOptions pipelineLayoutOptions = {
            .bindGroupLayouts = { m_computeBindGroupLayout }
        };
        m_computePipelineLayout = m_device.createPipelineLayout(pipelineLayoutOptions);

        // Create a bindGroup to hold the Particles SSBO
        const BindGroupOptions bindGroupOptions{
            .layout = m_computeBindGroupLayout,
            .resources = {
                    {
                            .binding = 0,
                            .resource = StorageBufferBinding{ .buffer = m_particleDataBuffer },
                    },
            },
        };
        m_particleBindGroup = m_device.createBindGroup(bindGroupOptions);

        const ComputePipelineOptions pipelineOptions{
            .layout = m_computePipelineLayout,
            .shaderStage = {
                    .shaderModule = computeShader,
                    // Use a specialization constant to set the local X workgroup size
                    .specializationConstants = {
                            {
                                    .constantId = 0,
                                    .value = 256,
                            },
                    },
            }
        };

        m_computePipeline = m_device.createComputePipeline(pipelineOptions);

Filename: compute_particles/compute_particles.cpp

Note: Compute pipelines are much simpler than graphics pipelines - just shader + layout!

The shader uses a specialization constant for work group size (256), allowing runtime configuration.

Graphics Pipeline with Instancing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
            .vertex = {
                .buffers = {
                    { .binding = 0, .stride = sizeof(Vertex) },
                    { .binding = 1, .stride = sizeof(ParticleData), .inputRate = VertexRate::Instance }
                },
                .attributes = {
                    { .location = 0, .binding = 0, .format = Format::R32G32B32_SFLOAT }, // Vertex Position
                    { .location = 1, .binding = 1, .format = Format::R32G32B32A32_SFLOAT }, // Particle Position
                    { .location = 2, .binding = 1, .format = Format::R32G32B32A32_SFLOAT, .offset = 2 * sizeof(glm::vec4) } // Particle Color
                }
            },

Filename: compute_particles/compute_particles.cpp

Two vertex buffers:

  • Binding 0: Triangle geometry (Vertex rate - advances per vertex)
  • Binding 1: Particle data (Instance rate - advances per instance)

The inputRate = VertexRate::Instance tells the GPU to use one particle data entry per instance, not per vertex.

Dispatching Compute and Memory Barrier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
    auto commandRecorder = m_device.createCommandRecorder();
    {
        // Compute
        auto computePass = commandRecorder.beginComputePass();
        computePass.setPipeline(m_computePipeline);
        computePass.setBindGroup(0, m_particleBindGroup);
        constexpr size_t LocalWorkGroupXSize = 256;
        computePass.dispatchCompute(ComputeCommand{ .workGroupX = ParticlesCount / LocalWorkGroupXSize });
        computePass.end();

        // Barrier to force waiting for compute commands SSBO writes to have completed
        // before vertex shaders tries to read per instance vertex attributes
        commandRecorder.memoryBarrier(MemoryBarrierOptions{
                .srcStages = PipelineStageFlags(PipelineStageFlagBit::ComputeShaderBit),
                .dstStages = PipelineStageFlags(PipelineStageFlagBit::VertexInputBit),
                .memoryBarriers = {
                        {
                                .srcMask = AccessFlags(AccessFlagBit::ShaderWriteBit),
                                .dstMask = AccessFlags(AccessFlagBit::VertexAttributeReadBit),
                        },
                },
        });

Filename: compute_particles/compute_particles.cpp

Critical steps:

  1. Dispatch compute: 1024 particles / 256 per work group = 4 work groups
  2. Memory barrier: Ensures compute writes complete before graphics reads
    • srcStages = ComputeShaderBit: Wait for compute to finish
    • dstStages = VertexInputBit: Before vertex shader reads
    • srcMask = ShaderWriteBit: Compute wrote to buffer
    • dstMask = VertexAttributeReadBit: Graphics will read as vertex data

Without the barrier, graphics might read stale particle data (race condition)!

Spec: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkMemoryBarrier.html

Instanced Draw:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
        // Render
        auto opaquePass = commandRecorder.beginRenderPass(RenderPassCommandRecorderOptions{
                .colorAttachments = {
                        {
                                .view = m_swapchainViews.at(m_currentSwapchainImageIndex),
                                .clearValue = { 0.3f, 0.3f, 0.3f, 1.0f },
                                .finalLayout = TextureLayout::PresentSrc,
                        },
                },
                .depthStencilAttachment = {
                        .view = m_depthTextureView,
                },
        });
        opaquePass.setPipeline(m_graphicsPipeline);
        opaquePass.setVertexBuffer(0, m_triangleVertexBuffer);
        opaquePass.setVertexBuffer(1, m_particleDataBuffer); // Per instance Data
        opaquePass.draw(DrawCommand{ .vertexCount = 3, .instanceCount = ParticlesCount });
        renderImGuiOverlay(&opaquePass);
        opaquePass.end();

Filename: compute_particles/compute_particles.cpp

GPU draws 3 vertices × 1024 instances = 3072 vertices total, but data for only 3 vertices (+1024 particles) loaded.

Performance Notes

Compute Performance:

  • Work Group Size: 256 is good for most GPUs (multiples of 32/64 for warp efficiency)
  • Memory Access: Sequential access pattern (particle[i]) is cache-friendly
  • ALU vs Memory: This shader is compute-bound (simple math), not memory-bound
  • Occupancy: Small local size may limit GPU occupancy - try 512 or 1024

Graphics Performance:

  • Instancing: 1024 instances in one draw vs 1024 draws = massive CPU savings
  • Vertex Reuse: 3-vertex triangle reused 1024 times (good cache hit rate)
  • Per-Instance Data: Reading from SSBO is fast on modern GPUs
  • Overdraw: Particles may overlap - consider depth sorting or alpha blending

Scaling:

  • Easy to scale to 100K+ particles by increasing particle count
  • Ensure dispatch dimensions stay within limits (65535 work groups)
  • For huge counts, consider GPU frustum culling and indirect draws

Memory Barriers:

  • Barriers have small cost but are essential for correctness
  • Could optimize with separate compute/graphics queues (advanced)
  • Pipeline barriers are cheaper than full device waitIdle

See Also

Further Reading


Updated on 2026-03-31 at 00:02:07 +0000