throughput-engine-v2
Why
runtime-foundation-v2 fixed correctness and introduced the operator/runtime contract needed for
future execution upgrades, but Mini-ImagePipe still behaves like a correctness-first pipeline rather
than a throughput-first engine.
Today:
executeBatch()still loops overexecute()once per frame,- the scheduler still submits work task-by-task from the CPU,
- device allocation is stream-aware at the API layer but still backed by the legacy
cudaMallocpool implementation, - the profiling seam is compiled but not instrumented,
- CI can build the project but does not provide a clear GPU validation or throughput measurement path.
These gaps limit the project’s ability to reduce launch overhead, improve steady-state throughput, and validate performance-oriented changes safely.
What Changes
This change introduces throughput engine v2 with:
- async-ready device allocation controls in
MemoryManager, - scheduler-side graph capture/replay for stable workloads,
- a true fixed-shape batch execution path in
Pipeline, - lightweight benchmark and GPU-validation scaffolding,
- concrete profiling markers around scheduler execution.
Impact
- Preserves the existing public API while making throughput features opt-in.
- Converts batch execution from an API placeholder into a real runtime path.
- Prepares the codebase for later CV-CUDA, TensorRT, and production GPU validation work.