Architecture Guide¶

Overview¶

Hetero-Paged-Infer implements a heterogeneous computing architecture where CPU and GPU collaborate efficiently for LLM inference. The system separates control flow (CPU) from compute-heavy operations (GPU) to maximize resource utilization.

Design Principles¶

CPU Responsibilities: Tokenization, request scheduling, KV Cache page management, batch preparation
GPU Responsibilities: Attention computation, matrix operations, token generation
Memory Efficiency: PagedAttention avoids memory waste from padding and fragmentation
Throughput Optimization: Continuous batching maximizes GPU utilization

System Architecture¶

┌─────────────────────────────────────────────────────────────────────────┐
│                           InferenceEngine                                │
├─────────────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────────┐  │
│  │   Tokenizer  │    │   Scheduler  │    │    KV Cache Manager      │  │
│  │    (CPU)     │    │    (CPU)     │    │        (CPU)             │  │
│  │  Encode/Decode│    │(Prefill/    │    │  (BlockPool/PageTable)   │  │
│  │              │    │   Decode)    │    │                          │  │
│  └──────┬───────┘    └──────┬───────┘    └───────────┬──────────────┘  │
│         │                   │                        │                  │
│         │            ┌──────▼───────┐               │                  │
│         │            │ Batch Builder│◄──────────────┘                  │
│         │            │    (CPU)     │                                  │
│         │            └──────┬───────┘                                  │
│  ───────┼───────────────────┼────────────────────────────────────────  │
│         │            ┌──────▼───────┐                                  │
│         │            │ GPU Executor │                                  │
│         │            │  (CUDA/GPU)  │                                  │
│         │            └──────┬───────┘                                  │
│         │                   │                                          │
│         │            ┌──────▼───────┐                                  │
│         │            │  KV Cache    │                                  │
│         │            │ (GPU Memory) │                                  │
│         │            └──────────────┘                                  │
└─────────────────────────────────────────────────────────────────────────┘

Core Components¶

1. InferenceEngine¶

The main orchestrator that coordinates all components:

pub struct InferenceEngine {
    config: EngineConfig,
    tokenizer: Box<dyn TokenizerTrait>,
    scheduler: Box<dyn SchedulerTrait>,
    kv_cache_manager: Box<dyn KVCacheManagerTrait>,
    gpu_executor: Box<dyn GPUExecutorTrait>,
}

Responsibilities: - Request submission and lifecycle management - Step-by-step execution loop - Error recovery strategies - Metrics collection

2. Scheduler¶

Implements Continuous Batching with decode-priority scheduling:

Key Features: - Separate queues for prefill and decode requests - Dynamic batch formation - Memory pressure awareness - Request state transitions

Scheduling Algorithm: 1. Prioritize decode requests (lower latency for in-flight requests) 2. Fill remaining batch capacity with prefill requests 3. Respect memory thresholds and batch size limits

3. KV Cache Manager¶

Implements PagedAttention memory management:

Core Concepts: - Physical Blocks: Contiguous GPU memory regions - Logical Blocks: Virtual blocks mapped via page table - Block Pool: Free list management for physical blocks

Memory Layout:

┌─────────────────────────────────────────────────────────────────┐
│                    GPU KV Cache Memory Pool                      │
├─────────────────────────────────────────────────────────────────┤
│  Block 0    │  Block 1    │  Block 2    │  ...  │  Block N-1   │
│ ┌─────────┐ │ ┌─────────┐ │ ┌─────────┐ │       │ ┌─────────┐  │
│ │ K[0:16] │ │ │ K[0:16] │ │ │ K[0:16] │ │       │ │ K[0:16] │  │
│ │ V[0:16] │ │ │ V[0:16] │ │ │ V[0:16] │ │       │ │ V[0:16] │  │
│ └─────────┘ │ └─────────┘ │ └─────────┘ │ │ └─────────┘  │
└─────────────────────────────────────────────────────────────────┘

Page Table Example:

Sequence 0:                    Sequence 1:
┌─────────────┐               ┌─────────────┐
│ Logical → Physical         │ Logical → Physical        │
├─────────────┤               ├─────────────┤
│   0   →   3  │              │   0   →   1  │
│   1   →   7  │              │   1   →   5  │
│   2   →  12  │              │   2   →   9  │
└─────────────┘               └─────────────┘

4. GPU Executor¶

Abstracts GPU computation (currently mock implementation):

Interface:

pub trait GPUExecutorTrait {
    fn execute(&mut self, batch: &ExecutionBatch) -> ExecutionOutput;
    fn capture_decode_graph(&mut self, batch_size: u32);
    fn execute_graph(&mut self, batch: &ExecutionBatch) -> ExecutionOutput;
}

Features (planned): - PagedAttention kernel - CUDA Graph capture for decode phase - FP16/BF16 computation with FP32 accumulation - Fused operations for memory bandwidth optimization

5. Tokenizer¶

Trait-based tokenization interface:

pub trait TokenizerTrait {
    fn encode(&self, text: &str) -> Vec<u32>;
    fn decode(&self, tokens: &[u32]) -> String;
    fn vocab_size(&self) -> u32;
    fn bos_token_id(&self) -> u32;
    fn eos_token_id(&self) -> u32;
    fn pad_token_id(&self) -> u32;
}

Inference Flow¶

┌─────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌──────────┐
│ Request │──▶│ Tokenize  │──▶│ Schedule  │──▶│  Execute  │──▶│  Decode  │
│  Input  │   │   (CPU)   │   │   (CPU)   │   │   (GPU)   │   │  (CPU)   │
└─────────┘   └───────────┘   └───────────┘   └───────────┘   └──────────┘
                                  │               │
                                  │    ┌──────────┘
                                  ▼    ▼
                            ┌───────────────┐
                            │ KV Cache Mgr  │
                            │    (CPU)      │
                            └───────────────┘

Detailed Steps:

Request Submission
Client submits text input with generation parameters
Tokenizer encodes text to token IDs
Request queued in scheduler's pending queue
Scheduling
Scheduler selects requests for next batch
Decode requests prioritized over prefill
Batch size and token limits enforced
Memory pressure checks performed
KV Cache Allocation
New sequences allocate logical blocks
Physical blocks mapped via page table
Block allocations tracked for memory stats
GPU Execution
Batch data prepared (input IDs, positions, block tables)
GPU kernel executed (prefill or decode)
Output tokens generated
State Updates
Sequence states updated
Completed sequences removed
New tokens decoded to text

State Machine¶

                    ┌─────────────┐
                    │   Pending   │  (Waiting for scheduling)
                    └──────┬──────┘
                           │ schedule()
                           ▼
                    ┌─────────────┐
            ┌───────│   Prefill   │  (Processing input tokens)
            │       └──────┬──────┘
            │              │ Prefill complete
            │              ▼
            │       ┌─────────────┐
            │       │   Decode    │◄────┐ (Generating tokens)
            │       └──────┬──────┘     │
            │              │            │ Generate next token
            │              ├────────────┘
            │              │ EOS or max_tokens
            │              ▼
            │       ┌─────────────┐
            └──────▶│  Completed  │  (Done)
                    └─────────────┘

State Transitions:

From	To	Trigger
Pending	Prefill	Scheduled for first execution
Prefill	Decode	All input tokens processed
Decode	Decode	Token generated, not finished
Decode	Completed	EOS token or max_tokens reached
Any	Failed	Error during execution

Data Structures¶

Request¶

pub struct Request {
    pub id: u64,
    pub input_tokens: Vec<u32>,
    pub output_tokens: Vec<u32>,
    pub max_tokens: u32,
    pub temperature: f32,
    pub top_p: f32,
    pub state: RequestState,
}

Sequence¶

pub struct Sequence {
    pub seq_id: u64,
    pub request: Request,
    pub logical_blocks: Vec<LogicalBlock>,
    pub num_computed_tokens: u32,
}

ExecutionBatch¶

pub struct ExecutionBatch {
    pub input_tokens: Vec<u32>,
    pub positions: Vec<u32>,
    pub seq_lens: Vec<u32>,
    pub block_tables: Vec<Vec<u32>>,
    pub is_prefill: Vec<bool>,
}

Design Properties¶

The architecture maintains these correctness properties:

Request ID Uniqueness - All sequence IDs are unique
Batch Constraints - Sequence count ≤ max_batch_size, tokens ≤ max_total_tokens
Decode Priority - Decode requests scheduled before prefill when capacity allows
Block Count Invariant - used_blocks + free_blocks == total_blocks
Memory Pressure Response - New prefill requests rejected when memory threshold exceeded

Current Implementation Status¶

Component	Status	Description
Scheduler	✅ Complete	Full continuous batching with decode priority
KV Cache Manager	✅ Complete	PagedAttention with block pool
Inference Engine	✅ Complete	Orchestration with error handling
Tokenizer	✅ Complete	Simple character-level implementation
GPU Executor	⚠️ Mock	Interface defined, mock implementation
CUDA Kernels	❌ Planned	Real kernel implementation pending

For API details, see API.md. For configuration options, see CONFIGURATION.md.