throughput-engine-v2

Why

runtime-foundation-v2 fixed correctness and introduced the operator/runtime contract needed for future execution upgrades, but Mini-ImagePipe still behaves like a correctness-first pipeline rather than a throughput-first engine.

Today:

executeBatch() still loops over execute() once per frame,
the scheduler still submits work task-by-task from the CPU,
device allocation is stream-aware at the API layer but still backed by the legacy cudaMalloc pool implementation,
the profiling seam is compiled but not instrumented,
CI can build the project but does not provide a clear GPU validation or throughput measurement path.

These gaps limit the project’s ability to reduce launch overhead, improve steady-state throughput, and validate performance-oriented changes safely.

What Changes

This change introduces throughput engine v2 with:

async-ready device allocation controls in MemoryManager,
scheduler-side graph capture/replay for stable workloads,
a true fixed-shape batch execution path in Pipeline,
lightweight benchmark and GPU-validation scaffolding,
concrete profiling markers around scheduler execution.

Impact

Preserves the existing public API while making throughput features opt-in.
Converts batch execution from an API placeholder into a real runtime path.
Prepares the codebase for later CV-CUDA, TensorRT, and production GPU validation work.