GPU Memory Model

Understanding CUDA memory hierarchy is crucial for optimizing ray tracing performance.

Memory Hierarchy

Memory Types

Type	Capacity	Latency	Scope	Usage
Global	GBs	400+ cycles	Global	Scene data, output buffer
L2 Cache	MBs	~30 cycles	Auto	Cache global memory access
Shared	48KB/SM	~5 cycles	Block	BVH stack, temporary data
Local	256KB/SM	~5 cycles	Thread	Local variables
Register	255/thread	1 cycle	Thread	Computation intermediates

Coalesced Access

Global memory access should be coalesced so accesses from the same warp land on contiguous addresses:

cpp

// Good: coalesced access
float value = global_array[threadIdx.x];

// Bad: random access
float value = global_array[random_index[threadIdx.x]];

BVH Traversal Memory Optimization

Use shared memory for traversal stack:

cpp

__global__ void render_kernel(...) {
    // Shared memory for BVH stack
    __shared__ int shared_stack[32][64];  // 32 threads, depth 64

    int* my_stack = shared_stack[threadIdx.x];
    // ... traversal code
}

Scene Data Layout

Optimize data layout for cache efficiency:

cpp

// Array of Structures (AoS) - Not recommended
struct Sphere {
    vec3 center;
    float radius;
    Material material;
};
Sphere spheres[N];

// Structure of Arrays (SoA) - Recommended
struct SphereSoA {
    vec3 centers[N];
    float radii[N];
    Material materials[N];
};

Memory Usage Analysis

Data	Type	Size Estimate
Sphere Array	Global	N × 32 bytes
BVH Nodes	Global	(2N-1) × 32 bytes
Ray Buffer	Global	W×H × 32 bytes
Output Image	Global	W×H × 12 bytes
BVH Stack	Shared	32 × 64 × 4 bytes

Bandwidth Optimization

Read-Only Data Cache

Use __ldg intrinsic for read-only data:

cpp

// Hint to cache in read-only cache
float value = __ldg(&global_array[idx]);

Vectorized Access

Use vector types for wider memory transactions:

cpp

// Load 128 bits at once
float4 data = reinterpret_cast<float4*>(ptr)[idx];

Memory Pressure Analysis

When occupancy is low due to register pressure:

Reduce register usage: Simplify kernel code
Launch configuration: Adjust block size
Shared memory: Trade shared memory for registers

References

[CUDA Programming Guide] NVIDIA
[Aila & Karras 2010] "Understanding Ray Traversal Efficiency"

GPU Memory Model ​

Memory Hierarchy ​

Memory Types ​

Coalesced Access ​

BVH Traversal Memory Optimization ​

Scene Data Layout ​

Memory Usage Analysis ​

Bandwidth Optimization ​

Read-Only Data Cache ​

Vectorized Access ​

Memory Pressure Analysis ​

References ​