CUDA Optimization
GPU optimization techniques used in Mini-ImagePipe.
Memory Hierarchy
Memory Types
| Memory | Location | Speed | Scope |
|---|---|---|---|
| Global | DRAM | ~800 GB/s | All threads |
| Shared | On-chip | ~2 TB/s | Block |
| Constant | Cache | ~1 TB/s | All threads |
| Registers | On-chip | ~10 TB/s | Thread |
Shared Memory Tiling
For convolution operations, we load tile + halo into shared memory:
Tile Configuration
cpp
// For 5×5 kernel with 16×16 tiles
constexpr int TILE = 16;
constexpr int HALO = 2; // (kernel_size - 1) / 2
__shared__ float sharedMem[TILE + 2*HALO][TILE + 2*HALO];Halo Region Loading
+------------------+
| Halo (top) |
+--+------------+--+
|H | Tile | H|
|a | Data | a|
|l | | l|
|o | | o|
+--+------------+--+
| Halo (bottom) |
+------------------+Constant Memory
Sobel kernels are stored in constant memory for broadcast:
cpp
__constant__ float c_sobelX[9] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
__constant__ float c_sobelY[9] = {-1, -2, -1, 0, 0, 0, 1, 2, 1};
// All threads in a warp access the same address → single memory fetchStream Concurrency
Independent Execution
cpp
// Stream 0: Operator A
kernelA<<<grid, block, 0, stream0>>>(...);
cudaEventRecord(eventA, stream0);
// Stream 1: Operator B (independent)
kernelB<<<grid, block, 0, stream1>>>(...);
// Stream 0: Operator C (depends on A)
cudaStreamWaitEvent(stream0, eventA);
kernelC<<<grid, block, 0, stream0>>>(...);Occupancy Optimization
Thread Block Sizing
cpp
// Target: maximize occupancy
// Consider: registers per thread, shared memory per block
dim3 blockSize(16, 16); // 256 threads per block
dim3 gridSize((width + 15) / 16, (height + 15) / 16);Register Pressure
cpp
// Reduce register usage for higher occupancy
__launch_bounds__(256, 4) // 256 threads, min 4 blocks per SM
__global__ void myKernel(...) {
// ...
}Performance Guidelines
- Coalesced access: Ensure threads access consecutive memory addresses
- Shared memory bank conflicts: Pad arrays to avoid bank conflicts
- Divergent warps: Minimize branching within warps
- Memory transfers: Use pinned memory for async transfers
References
- CUDA Best Practices Guide
- Harris, M. (2007). "Optimizing Parallel Reduction in CUDA."