throughput-engine-v2 Design

Summary

This phase upgrades the runtime from a correctness-first execution model to a throughput-oriented engine. The implementation remains additive: the legacy execution path stays available, while new configuration flags and runtime branches enable async allocation, graph replay, and fixed-shape batch execution when the workload qualifies.

Design Decisions

1. Async allocator is capability-driven

MemoryManager gains an explicit device allocator mode instead of silently changing behavior. Callers can request legacy pooled allocation or stream-ordered async allocation. If async allocation is requested but not supported by the runtime, the manager falls back to the legacy allocator while still exposing the effective mode for diagnostics and tests.

2. Graph replay is opt-in and workload-scoped

DAGScheduler gains a graph-execution mode that is only used for stable workloads. Graph replay is valid when:

the topology is unchanged,
the task count and dependency shape are unchanged,
input/output dimensions remain stable,
operators do not require recapture.

When those conditions are not met, the scheduler falls back to direct task submission and marks the graph state dirty for the next eligible execution.

3. Batch execution uses runtime metadata, not repeated `execute()`

Pipeline::executeBatch() is reworked to construct a single batch execution context. ImageBuffer batch metadata (batchSize, batchStride) becomes authoritative for fixed-shape batch execution. Operators may still implement batch handling internally in a simple loop, but the pipeline/runtime must invoke each node once per batch execution rather than once per frame.

4. Profiling seam becomes concrete

The existing profiling seam is upgraded from a placeholder to real execution markers around:

scheduler execution,
graph capture/replay transitions,
per-task execution boundaries.

This phase keeps the instrumentation lightweight and build-optional.

5. Benchmark/GPU validation stays lightweight

This phase does not introduce a new third-party benchmark framework. Instead, it adds a lightweight benchmark target or benchmark mode using existing project tooling and updates CI/documentation so GPU-capable validation has a clear, explicit path.