CUDA Optimization

GPU optimization techniques used in Mini-ImagePipe.

Memory Hierarchy

Memory Types

Memory	Location	Speed	Scope
Global	DRAM	~800 GB/s	All threads
Shared	On-chip	~2 TB/s	Block
Constant	Cache	~1 TB/s	All threads
Registers	On-chip	~10 TB/s	Thread

Shared Memory Tiling

For convolution operations, we load tile + halo into shared memory:

Tile Configuration

cpp

// For 5×5 kernel with 16×16 tiles
constexpr int TILE = 16;
constexpr int HALO = 2;  // (kernel_size - 1) / 2

__shared__ float sharedMem[TILE + 2*HALO][TILE + 2*HALO];

Halo Region Loading

+------------------+
|  Halo (top)      |
+--+------------+--+
|H |   Tile     | H|
|a |   Data     | a|
|l |            | l|
|o |            | o|
+--+------------+--+
|  Halo (bottom)   |
+------------------+

Constant Memory

Sobel kernels are stored in constant memory for broadcast:

cpp

__constant__ float c_sobelX[9] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
__constant__ float c_sobelY[9] = {-1, -2, -1, 0, 0, 0, 1, 2, 1};

// All threads in a warp access the same address → single memory fetch

Stream Concurrency

Independent Execution

cpp

// Stream 0: Operator A
kernelA<<<grid, block, 0, stream0>>>(...);
cudaEventRecord(eventA, stream0);

// Stream 1: Operator B (independent)
kernelB<<<grid, block, 0, stream1>>>(...);

// Stream 0: Operator C (depends on A)
cudaStreamWaitEvent(stream0, eventA);
kernelC<<<grid, block, 0, stream0>>>(...);

Occupancy Optimization

Thread Block Sizing

cpp

// Target: maximize occupancy
// Consider: registers per thread, shared memory per block

dim3 blockSize(16, 16);  // 256 threads per block
dim3 gridSize((width + 15) / 16, (height + 15) / 16);

Register Pressure

cpp

// Reduce register usage for higher occupancy
__launch_bounds__(256, 4)  // 256 threads, min 4 blocks per SM
__global__ void myKernel(...) {
    // ...
}

Performance Guidelines

Coalesced access: Ensure threads access consecutive memory addresses
Shared memory bank conflicts: Pad arrays to avoid bank conflicts
Divergent warps: Minimize branching within warps
Memory transfers: Use pinned memory for async transfers

References

CUDA Best Practices Guide
Harris, M. (2007). "Optimizing Parallel Reduction in CUDA."

CUDA Optimization ​

Memory Hierarchy ​

Memory Types ​

Shared Memory Tiling ​

Tile Configuration ​

Halo Region Loading ​

Constant Memory ​

Stream Concurrency ​

Independent Execution ​

Occupancy Optimization ​

Thread Block Sizing ​

Register Pressure ​

Performance Guidelines ​

References ​

CUDA Optimization

Memory Hierarchy

Memory Types

Shared Memory Tiling

Tile Configuration

Halo Region Loading

Constant Memory

Stream Concurrency

Independent Execution

Occupancy Optimization

Thread Block Sizing

Register Pressure

Performance Guidelines

References