Skip to content

Configuration Guide

Complete configuration reference for Hetero-Paged-Infer.

Configuration Methods

1. Command-Line Arguments

./hetero-infer \
  --block-size 16 \
  --max-num-blocks 1024 \
  --max-batch-size 32 \
  --memory-threshold 0.9 \
  --input "Hello" \
  --max-tokens 100

2. Configuration File

Create config.json:

{
  "block_size": 16,
  "max_num_blocks": 1024,
  "max_batch_size": 32,
  "max_num_seqs": 256,
  "max_model_len": 2048,
  "max_total_tokens": 4096,
  "memory_threshold": 0.9
}

Use with:

./hetero-infer --config config.json --input "Hello"

3. Environment Variables

export HETERO_BLOCK_SIZE=16
export HETERO_MAX_BLOCKS=1024
export HETERO_MEMORY_THRESHOLD=0.9

4. Programmatic

use hetero_infer::EngineConfig;

let config = EngineConfig {
    block_size: 16,
    max_num_blocks: 2048,
    max_batch_size: 64,
    ..Default::default()
};

Configuration Reference

Block Size

{
  "block_size": 16
}
  • Default: 16
  • Range: 1-128
  • Impact: Number of tokens per physical block
Size Fragmentation Metadata Best For
8 Low High Short sequences
16 Medium Medium General use
32 Higher Low Long sequences

Maximum Blocks

{
  "max_num_blocks": 1024
}

Memory calculation:

Total Memory = max_num_blocks × block_size × layers × heads × head_dim × 2 × bytes

Example (FP16):

1024 × 16 × 32 × 32 × 128 × 2 × 2 = ~8.6 GB

Batch Configuration

{
  "max_batch_size": 32,
  "max_num_seqs": 256,
  "max_total_tokens": 4096
}
Parameter Default Description
max_batch_size 32 Sequences per batch
max_num_seqs 256 Concurrent sequences
max_total_tokens 4096 Tokens per batch

Memory Settings

{
  "max_model_len": 2048,
  "memory_threshold": 0.9
}
Parameter Default Range Description
max_model_len 2048 - Max sequence length
memory_threshold 0.9 0.0-1.0 Admission control threshold

Generation Parameters

Temperature

GenerationParams {
    temperature: 1.0,  // 0.0 = greedy, >1.0 = random
}
Value Behavior
0.0 Greedy decoding
0.7 Focused
1.0 Balanced
1.5 Creative

Top-p (Nucleus Sampling)

GenerationParams {
    top_p: 0.9,  // Sample from top 90% probability mass
}

Max Tokens

GenerationParams {
    max_tokens: 100,
}

Configuration Presets

Low Latency (Interactive)

{
  "block_size": 16,
  "max_num_blocks": 512,
  "max_batch_size": 8,
  "max_num_seqs": 64,
  "max_model_len": 1024,
  "max_total_tokens": 1024,
  "memory_threshold": 0.8
}

High Throughput (Batch Processing)

{
  "block_size": 32,
  "max_num_blocks": 4096,
  "max_batch_size": 128,
  "max_num_seqs": 1024,
  "max_model_len": 4096,
  "max_total_tokens": 16384,
  "memory_threshold": 0.95
}

Memory Constrained

{
  "block_size": 16,
  "max_num_blocks": 256,
  "max_batch_size": 16,
  "max_num_seqs": 128,
  "max_model_len": 1024,
  "max_total_tokens": 2048,
  "memory_threshold": 0.85
}

Long Context

{
  "block_size": 32,
  "max_num_blocks": 2048,
  "max_batch_size": 16,
  "max_num_seqs": 64,
  "max_model_len": 8192,
  "max_total_tokens": 8192,
  "memory_threshold": 0.9
}

Validation Rules

Parameter Rule Error
block_size > 0 Must be positive
max_num_blocks > 0 Must be positive
max_batch_size > 0 Must be positive
max_model_len ≥ block_size Context must fit blocks
memory_threshold 0.0-1.0 Must be fraction

Monitoring Configuration

// Enable metrics
let config = EngineConfig {
    enable_metrics: true,
    metrics_port: 9090,
    ..Default::default()
};

Performance Tuning

GPU Optimization

{
  "cuda_graph": true,
  "fp16": true,
  "kv_cache_dtype": "fp16"
}

CPU Optimization

{
  "num_threads": 8,
  "pin_memory": true
}

Next: Architecture Overview