Skip to content

Hello Triangle with Frame Overlap

This example is visually identical to Hello Triangle but achieves significantly higher performance through proper frame synchronization. By allowing the CPU to prepare the next frame while the GPU renders the previous one, we eliminate idle time on both processors. This is a fundamental technique for real-time rendering applications.

The example uses the KDGpuExample helper API with the AdvancedExampleEngineLayer base class that manages multiple in-flight frames.

Overview

What this example demonstrates:

  • Triple-buffered rendering with independent frames "in-flight"
  • Fence-based CPU/GPU synchronization for frame pacing
  • Per-frame command buffers and synchronization primitives
  • Eliminating GPU bubbles and CPU stalls for maximum throughput

Performance benefit:

  • Simple blocking: CPU waits for GPU → both idle 50% of the time
  • Frame overlap: CPU and GPU work in parallel → both utilized continuously
  • Typical improvement: 50-100% higher frame rate

Vulkan Requirements

  • Vulkan Version: 1.0+
  • Extensions: None (core synchronization)
  • Synchronization Primitives: VkFence and VkSemaphore

Key Concepts

The Problem with Blocking:

In Hello Triangle (SimpleExampleEngineLayer), the CPU calls device.waitUntilIdle() after submitting each frame. This creates a timeline like:

1
2
3
Frame 1: CPU prepares | CPU waits | GPU renders | idle
Frame 2:                                          CPU prepares | CPU waits | GPU renders | idle
Frame 3:                                                                                   CPU prepares | ...

The GPU is idle while the CPU prepares, and the CPU is idle while the GPU renders. Both processors run at ~50% utilization.

Triple-Buffered Frame Overlap:

By maintaining multiple frames "in-flight" simultaneously, we overlap CPU and GPU work:

1
2
3
4
Frame 1: CPU prepares | GPU renders |
Frame 2:                CPU prepares | GPU renders |
Frame 3:                               CPU prepares | GPU renders |
Frame 4:                                              CPU prepares | GPU renders |

The CPU prepares frame N+1 while the GPU renders frame N. Both processors stay busy.

Why "Double-Buffered"?

We maintain 2 independent sets of resources (buffers, command buffers, fences):

  1. Frame N-1: GPU is rendering
  2. Frame N: CPU is preparing

This ensures we never access resources currently in use by the GPU.

Fences vs Semaphores:

Vulkan provides two synchronization primitives:

  • Fence (VkFence): CPU-GPU synchronization. CPU can wait on a fence to know when GPU work completes.
  • Semaphore (VkSemaphore): GPU-GPU synchronization. GPU waits on semaphores between queue submissions.

This example uses:

  • Per-frame fences: CPU waits to ensure frame N-2 finished before reusing its resources for frame N+1
  • Present semaphores: GPU waits for swapchain image acquisition before rendering
  • Render semaphores: Presentation waits for rendering to complete

For more on synchronization: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkFence.html

Implementation

AdvancedExampleEngineLayer vs SimpleExampleEngineLayer:

The key difference is removing device.waitUntilIdle() and managing per-frame resources:

  • SimpleExampleEngineLayer: Blocks after every frame, single command buffer
  • AdvancedExampleEngineLayer: Manages multiple in-flight frames, per-frame resources

To see what AdvancedExampleEngineLayer does behind the scenes, study Hello Triangle Native API which manually implements all synchronization.

Frame Overlap Synchronization:

The render function uses per-frame indices to manage independent frame resources:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    renderImGuiOverlay(&opaquePass, m_inFlightIndex);
    opaquePass.end();
    m_commandBuffers[m_inFlightIndex] = commandRecorder.finish();

    const SubmitOptions submitOptions = {
        .commandBuffers = { m_commandBuffers[m_inFlightIndex] },
        .waitSemaphores = { m_presentCompleteSemaphores[m_inFlightIndex] }, // Wait for swapchain image acquisition
        .signalSemaphores = { m_renderCompleteSemaphores[m_currentSwapchainImageIndex] },
        .signalFence = m_frameFences[m_inFlightIndex] // Signal Fence once submission and execution is complete
    };

Filename: hello_triangle_overlap/hello_triangle.cpp

Key points:

  • m_inFlightIndex: Current frame slot (0, 1, (or 2 for triple-buffering))
  • m_commandBuffers[m_inFlightIndex]: This frame's command buffer (each frame has its own)
  • m_frameFences[m_inFlightIndex]: Signal this fence when GPU finishes this frame
  • m_presentCompleteSemaphores[m_inFlightIndex]: Wait for swapchain image acquisition
  • m_renderCompleteSemaphores[m_currentSwapchainImageIndex]: Signal when rendering completes

Frame Lifecycle:

Each frame goes through this cycle:

  1. Acquire: Get next swapchain image (signals present semaphore)
  2. Wait: CPU waits on frame fence from N-2 frames ago (ensure that frame finished)
  3. Record: CPU records command buffer for current frame N
  4. Submit: GPU begins executing commands (waits on present semaphore, signals render semaphore and fence)
  5. Present: Display frame on screen (waits on render semaphore)

By the time we reach frame N, frame N-2 has definitely completed (fence wait), so we can safely reuse its resources.

Resource Management:

Each frame needs independent resources:

1
2
3
4
5
6
7
// In AdvancedExampleEngineLayer:
std::array<CommandBuffer, FRAMES_IN_FLIGHT> m_commandBuffers;
std::array<Fence, FRAMES_IN_FLIGHT> m_frameFences;
std::array<Semaphore, FRAMES_IN_FLIGHT> m_presentCompleteSemaphores;

// Swapchain images typically have their own semaphores:
std::array<Semaphore, SWAPCHAIN_IMAGES> m_renderCompleteSemaphores;

Performance Notes

  • Latency vs Throughput: Double/Triple-buffering increases throughput (FPS) but adds 1-2 frames of input latency. For competitive games, consider double-buffering.
  • CPU-bound vs GPU-bound: Frame overlap only helps when both CPU and GPU have work to do. If GPU-bound, overlap won't improve FPS.
  • Frame pacing: Fences prevent unlimited buffering. Without them, CPU could queue dozens of frames, causing massive latency.
  • VSync interaction: With VSync enabled, you may not see FPS improvement but will have smoother frame times.

See Also

Further Reading


Updated on 2026-03-31 at 00:02:07 +0000