GPU Memory Model
Understanding CUDA memory hierarchy is crucial for optimizing ray tracing performance.
Memory Hierarchy
Memory Types
| Type | Capacity | Latency | Scope | Usage |
|---|---|---|---|---|
| Global | GBs | 400+ cycles | Global | Scene data, output buffer |
| L2 Cache | MBs | ~30 cycles | Auto | Cache global memory access |
| Shared | 48KB/SM | ~5 cycles | Block | BVH stack, temporary data |
| Local | 256KB/SM | ~5 cycles | Thread | Local variables |
| Register | 255/thread | 1 cycle | Thread | Computation intermediates |
Coalesced Access
Global memory access should be coalesced so accesses from the same warp land on contiguous addresses:
cpp
// Good: coalesced access
float value = global_array[threadIdx.x];
// Bad: random access
float value = global_array[random_index[threadIdx.x]];BVH Traversal Memory Optimization
Use shared memory for traversal stack:
cpp
__global__ void render_kernel(...) {
// Shared memory for BVH stack
__shared__ int shared_stack[32][64]; // 32 threads, depth 64
int* my_stack = shared_stack[threadIdx.x];
// ... traversal code
}Scene Data Layout
Optimize data layout for cache efficiency:
cpp
// Array of Structures (AoS) - Not recommended
struct Sphere {
vec3 center;
float radius;
Material material;
};
Sphere spheres[N];
// Structure of Arrays (SoA) - Recommended
struct SphereSoA {
vec3 centers[N];
float radii[N];
Material materials[N];
};Memory Usage Analysis
| Data | Type | Size Estimate |
|---|---|---|
| Sphere Array | Global | N × 32 bytes |
| BVH Nodes | Global | (2N-1) × 32 bytes |
| Ray Buffer | Global | W×H × 32 bytes |
| Output Image | Global | W×H × 12 bytes |
| BVH Stack | Shared | 32 × 64 × 4 bytes |
Bandwidth Optimization
Read-Only Data Cache
Use __ldg intrinsic for read-only data:
cpp
// Hint to cache in read-only cache
float value = __ldg(&global_array[idx]);Vectorized Access
Use vector types for wider memory transactions:
cpp
// Load 128 bits at once
float4 data = reinterpret_cast<float4*>(ptr)[idx];Memory Pressure Analysis
When occupancy is low due to register pressure:
- Reduce register usage: Simplify kernel code
- Launch configuration: Adjust block size
- Shared memory: Trade shared memory for registers
References
- [CUDA Programming Guide] NVIDIA
- [Aila & Karras 2010] "Understanding Ray Traversal Efficiency"