ADDED Requirements

Requirement: Async-Ready Device Allocation

The memory manager SHALL expose a configurable device allocator mode so that stream-ordered async allocation can be enabled for throughput-oriented workloads without changing pipeline call sites.

Scenarios

Scenario: Effective allocator mode

  • WHEN async allocation is requested but not supported by the current CUDA runtime
  • THEN the memory manager SHALL report the effective fallback allocator mode explicitly

Scenario: Stream-ordered allocation

  • WHEN async allocation is enabled and supported
  • THEN device allocations and frees SHALL use stream-ordered allocator APIs

Requirement: Scheduler Graph Replay

The scheduler SHALL support CUDA Graph capture and replay for stable workloads so that repeated pipeline executions can reduce CPU launch overhead.

Scenarios

Scenario: Reusing a captured graph

  • WHEN the same stable workload executes repeatedly with graph mode enabled
  • THEN the scheduler SHALL reuse the captured graph instead of recapturing every run

Scenario: Invalidating a captured graph

  • WHEN workload shape or topology changes
  • THEN the scheduler SHALL invalidate the previously captured graph before the next replay

Requirement: Fixed-Shape Batch Execution

The pipeline SHALL execute fixed-shape batches as a single runtime batch context so that batch execution is not implemented as a thin loop over single-frame execution.

Scenarios

Scenario: Batch metadata propagation

  • WHEN executeBatch() is called with a fixed-shape batch
  • THEN operators SHALL receive batch metadata through ImageBuffer

Scenario: One invocation per node

  • WHEN the runtime executes a fixed-shape batch
  • THEN each node SHALL execute once per batch context instead of once per frame

Requirement: Throughput Validation Tooling

The project SHALL provide benchmark and GPU-validation entry points so that throughput-oriented changes can be measured and validated in engineering workflows.

Scenarios

Scenario: Benchmark entry point

  • WHEN a developer needs to measure throughput
  • THEN the repository SHALL provide a supported benchmark target or benchmark mode

Scenario: GPU validation path

  • WHEN CI or local development runs on GPU-capable infrastructure
  • THEN the project SHALL provide an explicit GPU validation path rather than relying only on best-effort CPU-only execution