Architecture Overview

Mini-ImagePipe is organized as a DAG-native runtime stack that prioritizes deterministic execution, memory behavior control, and GPU utilization transparency.

Layered runtime model

The architecture can be read in four layers:

Pipeline builds and validates execution plans.
TaskGraph represents dependency topology and computes topological order.
DAGScheduler maps ready tasks to CUDA streams with event-based synchronization.
MemoryManager supplies pooled pinned/device allocations and per-task workspace.

Core runtime responsibilities

Component	Primary responsibility	Key implementation hooks
`Pipeline`	Graph assembly, input wiring, buffer lifecycle	`addOperator()`, `connect()`, `setInput()`, `execute()`
`TaskGraph`	DAG validation and order computation	`validate()`, `getTopologicalOrder()`, `areIndependent()`
`DAGScheduler`	Stream assignment, sync insertion, failure propagation	`execute()`, `setErrorCallback()`, internal `insertSynchronization()`
`MemoryManager`	Pinned/device pooling, async allocation mode, workspace management	`allocateDevice()`, `allocatePinned()`, `allocateWorkspace()`

Execution semantics

1) Graph validation and ordering

Before launch, Pipeline::execute() verifies graph validity and computes a topological order. If validation fails, execution exits with cudaErrorInvalidValue.

cpp

TaskGraph& graph = pipeline.getTaskGraph();
if (!graph.validate()) {
    return cudaErrorInvalidValue;
}
auto order = graph.getTopologicalOrder();

2) Input and output contract

Each task consumes one or more ImageBuffer inputs and produces one ImageBuffer output. Task-local dimensions are derived from upstream outputs and operator output-dimension logic.

3) Stream assignment and synchronization

DAGScheduler attempts to spread independent tasks across streams. For cross-stream dependency edges, it records CUDA events on producer streams and waits on consumer streams.

4) Failure propagation

If one task fails, dependent descendants are marked FAILED. Independent branches are not force-killed, which preserves partial progress visibility in complex DAGs.

Memory architecture

Mini-ImagePipe uses a pooled model for both host and device memory:

Pinned host buffers for async transfer paths.
Device pool reuse for intermediate outputs.
Per-task workspace bundles to avoid transient allocations in hot paths.

Design constraints and known behavior

The runtime currently models each task with a primary output buffer.
In fork-join topologies, multi-input operator semantics depend on operator implementation capabilities.
Deterministic ordering is preserved by explicit topological sorting plus dependency synchronization.

Why this architecture stands out

Compared to ad-hoc CUDA pipelines, the value proposition is not only throughput. It is structured control over execution semantics:

Dependency-aware scheduling instead of manually ordered kernel chains.
Explicit memory lifecycle and reuse policy.
Testable graph semantics via property-based tests.
Documentation that links implementation choices to references and reproducible metrics.

Architecture Overview ​

Layered runtime model ​

Core runtime responsibilities ​

Execution semantics ​

1) Graph validation and ordering ​

2) Input and output contract ​

3) Stream assignment and synchronization ​

4) Failure propagation ​

Memory architecture ​

Design constraints and known behavior ​

Why this architecture stands out ​

Further reading ​