Architecture Overview
Mini-ImagePipe is organized as a DAG-native runtime stack that prioritizes deterministic execution, memory behavior control, and GPU utilization transparency.
Layered runtime model
The architecture can be read in four layers:
Pipelinebuilds and validates execution plans.TaskGraphrepresents dependency topology and computes topological order.DAGSchedulermaps ready tasks to CUDA streams with event-based synchronization.MemoryManagersupplies pooled pinned/device allocations and per-task workspace.
Core runtime responsibilities
| Component | Primary responsibility | Key implementation hooks |
|---|---|---|
Pipeline | Graph assembly, input wiring, buffer lifecycle | addOperator(), connect(), setInput(), execute() |
TaskGraph | DAG validation and order computation | validate(), getTopologicalOrder(), areIndependent() |
DAGScheduler | Stream assignment, sync insertion, failure propagation | execute(), setErrorCallback(), internal insertSynchronization() |
MemoryManager | Pinned/device pooling, async allocation mode, workspace management | allocateDevice(), allocatePinned(), allocateWorkspace() |
Execution semantics
1) Graph validation and ordering
Before launch, Pipeline::execute() verifies graph validity and computes a topological order. If validation fails, execution exits with cudaErrorInvalidValue.
TaskGraph& graph = pipeline.getTaskGraph();
if (!graph.validate()) {
return cudaErrorInvalidValue;
}
auto order = graph.getTopologicalOrder();2) Input and output contract
Each task consumes one or more ImageBuffer inputs and produces one ImageBuffer output. Task-local dimensions are derived from upstream outputs and operator output-dimension logic.
3) Stream assignment and synchronization
DAGScheduler attempts to spread independent tasks across streams. For cross-stream dependency edges, it records CUDA events on producer streams and waits on consumer streams.
4) Failure propagation
If one task fails, dependent descendants are marked FAILED. Independent branches are not force-killed, which preserves partial progress visibility in complex DAGs.
Memory architecture
Mini-ImagePipe uses a pooled model for both host and device memory:
- Pinned host buffers for async transfer paths.
- Device pool reuse for intermediate outputs.
- Per-task workspace bundles to avoid transient allocations in hot paths.
Design constraints and known behavior
- The runtime currently models each task with a primary output buffer.
- In fork-join topologies, multi-input operator semantics depend on operator implementation capabilities.
- Deterministic ordering is preserved by explicit topological sorting plus dependency synchronization.
Why this architecture stands out
Compared to ad-hoc CUDA pipelines, the value proposition is not only throughput. It is structured control over execution semantics:
- Dependency-aware scheduling instead of manually ordered kernel chains.
- Explicit memory lifecycle and reuse policy.
- Testable graph semantics via property-based tests.
- Documentation that links implementation choices to references and reproducible metrics.
Further reading
- DAG Scheduling — ordering and synchronization details
- Memory Management — allocator behavior and workspace model
- CUDA Optimization — operator-level optimization patterns
- Performance Analysis — benchmark interpretation and trade-offs