throughput-engine-v2

Why

runtime-foundation-v2 fixed correctness and introduced the operator/runtime contract needed for future execution upgrades, but Mini-ImagePipe still behaves like a correctness-first pipeline rather than a throughput-first engine.

Today:

  • executeBatch() still loops over execute() once per frame,
  • the scheduler still submits work task-by-task from the CPU,
  • device allocation is stream-aware at the API layer but still backed by the legacy cudaMalloc pool implementation,
  • the profiling seam is compiled but not instrumented,
  • CI can build the project but does not provide a clear GPU validation or throughput measurement path.

These gaps limit the project’s ability to reduce launch overhead, improve steady-state throughput, and validate performance-oriented changes safely.

What Changes

This change introduces throughput engine v2 with:

  1. async-ready device allocation controls in MemoryManager,
  2. scheduler-side graph capture/replay for stable workloads,
  3. a true fixed-shape batch execution path in Pipeline,
  4. lightweight benchmark and GPU-validation scaffolding,
  5. concrete profiling markers around scheduler execution.

Impact

  • Preserves the existing public API while making throughput features opt-in.
  • Converts batch execution from an API placeholder into a real runtime path.
  • Prepares the codebase for later CV-CUDA, TensorRT, and production GPU validation work.