throughput-engine-v2 Design
Summary
This phase upgrades the runtime from a correctness-first execution model to a throughput-oriented engine. The implementation remains additive: the legacy execution path stays available, while new configuration flags and runtime branches enable async allocation, graph replay, and fixed-shape batch execution when the workload qualifies.
Design Decisions
1. Async allocator is capability-driven
MemoryManager gains an explicit device allocator mode instead of silently changing behavior.
Callers can request legacy pooled allocation or stream-ordered async allocation. If async
allocation is requested but not supported by the runtime, the manager falls back to the legacy
allocator while still exposing the effective mode for diagnostics and tests.
2. Graph replay is opt-in and workload-scoped
DAGScheduler gains a graph-execution mode that is only used for stable workloads. Graph replay is
valid when:
- the topology is unchanged,
- the task count and dependency shape are unchanged,
- input/output dimensions remain stable,
- operators do not require recapture.
When those conditions are not met, the scheduler falls back to direct task submission and marks the graph state dirty for the next eligible execution.
3. Batch execution uses runtime metadata, not repeated execute()
Pipeline::executeBatch() is reworked to construct a single batch execution context. ImageBuffer
batch metadata (batchSize, batchStride) becomes authoritative for fixed-shape batch execution.
Operators may still implement batch handling internally in a simple loop, but the pipeline/runtime
must invoke each node once per batch execution rather than once per frame.
4. Profiling seam becomes concrete
The existing profiling seam is upgraded from a placeholder to real execution markers around:
- scheduler execution,
- graph capture/replay transitions,
- per-task execution boundaries.
This phase keeps the instrumentation lightweight and build-optional.
5. Benchmark/GPU validation stays lightweight
This phase does not introduce a new third-party benchmark framework. Instead, it adds a lightweight benchmark target or benchmark mode using existing project tooling and updates CI/documentation so GPU-capable validation has a clear, explicit path.