Skip to content

Architecture Overview

Mini-ImagePipe is organized as a DAG-native runtime stack that prioritizes deterministic execution, memory behavior control, and GPU utilization transparency.

Layered runtime model

The architecture can be read in four layers:

  1. Pipeline builds and validates execution plans.
  2. TaskGraph represents dependency topology and computes topological order.
  3. DAGScheduler maps ready tasks to CUDA streams with event-based synchronization.
  4. MemoryManager supplies pooled pinned/device allocations and per-task workspace.

Core runtime responsibilities

ComponentPrimary responsibilityKey implementation hooks
PipelineGraph assembly, input wiring, buffer lifecycleaddOperator(), connect(), setInput(), execute()
TaskGraphDAG validation and order computationvalidate(), getTopologicalOrder(), areIndependent()
DAGSchedulerStream assignment, sync insertion, failure propagationexecute(), setErrorCallback(), internal insertSynchronization()
MemoryManagerPinned/device pooling, async allocation mode, workspace managementallocateDevice(), allocatePinned(), allocateWorkspace()

Execution semantics

1) Graph validation and ordering

Before launch, Pipeline::execute() verifies graph validity and computes a topological order. If validation fails, execution exits with cudaErrorInvalidValue.

cpp
TaskGraph& graph = pipeline.getTaskGraph();
if (!graph.validate()) {
    return cudaErrorInvalidValue;
}
auto order = graph.getTopologicalOrder();

2) Input and output contract

Each task consumes one or more ImageBuffer inputs and produces one ImageBuffer output. Task-local dimensions are derived from upstream outputs and operator output-dimension logic.

3) Stream assignment and synchronization

DAGScheduler attempts to spread independent tasks across streams. For cross-stream dependency edges, it records CUDA events on producer streams and waits on consumer streams.

4) Failure propagation

If one task fails, dependent descendants are marked FAILED. Independent branches are not force-killed, which preserves partial progress visibility in complex DAGs.

Memory architecture

Mini-ImagePipe uses a pooled model for both host and device memory:

  • Pinned host buffers for async transfer paths.
  • Device pool reuse for intermediate outputs.
  • Per-task workspace bundles to avoid transient allocations in hot paths.

Design constraints and known behavior

  • The runtime currently models each task with a primary output buffer.
  • In fork-join topologies, multi-input operator semantics depend on operator implementation capabilities.
  • Deterministic ordering is preserved by explicit topological sorting plus dependency synchronization.

Why this architecture stands out

Compared to ad-hoc CUDA pipelines, the value proposition is not only throughput. It is structured control over execution semantics:

  • Dependency-aware scheduling instead of manually ordered kernel chains.
  • Explicit memory lifecycle and reuse policy.
  • Testable graph semantics via property-based tests.
  • Documentation that links implementation choices to references and reproducible metrics.

Further reading

Released under the MIT License.