Skip to content

GPU Memory Model

Understanding CUDA memory hierarchy is crucial for optimizing ray tracing performance.

Memory Hierarchy

Memory Types

TypeCapacityLatencyScopeUsage
GlobalGBs400+ cyclesGlobalScene data, output buffer
L2 CacheMBs~30 cyclesAutoCache global memory access
Shared48KB/SM~5 cyclesBlockBVH stack, temporary data
Local256KB/SM~5 cyclesThreadLocal variables
Register255/thread1 cycleThreadComputation intermediates

Coalesced Access

Global memory access should be coalesced so accesses from the same warp land on contiguous addresses:

cpp
// Good: coalesced access
float value = global_array[threadIdx.x];

// Bad: random access
float value = global_array[random_index[threadIdx.x]];

BVH Traversal Memory Optimization

Use shared memory for traversal stack:

cpp
__global__ void render_kernel(...) {
    // Shared memory for BVH stack
    __shared__ int shared_stack[32][64];  // 32 threads, depth 64

    int* my_stack = shared_stack[threadIdx.x];
    // ... traversal code
}

Scene Data Layout

Optimize data layout for cache efficiency:

cpp
// Array of Structures (AoS) - Not recommended
struct Sphere {
    vec3 center;
    float radius;
    Material material;
};
Sphere spheres[N];

// Structure of Arrays (SoA) - Recommended
struct SphereSoA {
    vec3 centers[N];
    float radii[N];
    Material materials[N];
};

Memory Usage Analysis

DataTypeSize Estimate
Sphere ArrayGlobalN × 32 bytes
BVH NodesGlobal(2N-1) × 32 bytes
Ray BufferGlobalW×H × 32 bytes
Output ImageGlobalW×H × 12 bytes
BVH StackShared32 × 64 × 4 bytes

Bandwidth Optimization

Read-Only Data Cache

Use __ldg intrinsic for read-only data:

cpp
// Hint to cache in read-only cache
float value = __ldg(&global_array[idx]);

Vectorized Access

Use vector types for wider memory transactions:

cpp
// Load 128 bits at once
float4 data = reinterpret_cast<float4*>(ptr)[idx];

Memory Pressure Analysis

When occupancy is low due to register pressure:

  1. Reduce register usage: Simplify kernel code
  2. Launch configuration: Adjust block size
  3. Shared memory: Trade shared memory for registers

References

  • [CUDA Programming Guide] NVIDIA
  • [Aila & Karras 2010] "Understanding Ray Traversal Efficiency"

Technical Whitepaper · Built with VitePress