CUDA Streams

Asynchronous execution using CUDA streams.

Overview

CUDA streams enable overlapping computation and data transfer for higher throughput.

Stream Basics

A CUDA stream is a sequence of operations that execute in order:

cpp

cudaStream_t stream;
cudaStreamCreate(&stream);

// Async operations
cudaMemcpyAsync(d_dst, h_src, size, cudaMemcpyHostToDevice, stream);
kernel<<<grid, block, 0, stream>>>(...);
cudaMemcpyAsync(h_dst, d_src, size, cudaMemcpyDeviceToHost, stream);

cudaStreamDestroy(stream);

Multi-Stream Pipeline

Mini-OpenCV uses multiple streams for pipelining:

cpp

class StreamManager {
public:
    cudaStream_t getStream();
    void synchronize();
    
private:
    std::vector<cudaStream_t> streams_;
    size_t nextStream_ = 0;
};

PipelineProcessor

The PipelineProcessor class handles multi-stream execution:

cpp

PipelineProcessor processor(4);  // 4 streams

// Process multiple images concurrently
for (auto& image : images) {
    auto stream = processor.getStream();
    processor.gaussianBlur(image, 5, 1.5f, stream);
}

processor.synchronize();

Performance Benefits

Scenario	Single Stream	Multi-Stream	Improvement
10 images	50 ms	35 ms	30%
100 images	500 ms	280 ms	44%

Best Practices

Use 2-4 streams: More streams have diminishing returns
Batch similar operations: Group same kernel calls
Synchronize at boundaries: Only sync when necessary

Next Steps

Design Decisions - Why we chose this approach
Benchmarks - Performance data

CUDA Streams ​

Overview ​

Stream Basics ​

Multi-Stream Pipeline ​

PipelineProcessor ​

Performance Benefits ​

Best Practices ​

Next Steps ​