Configuration Guide¶
Complete configuration reference for Hetero-Paged-Infer.
Configuration Methods¶
1. Command-Line Arguments¶
./hetero-infer \
--block-size 16 \
--max-num-blocks 1024 \
--max-batch-size 32 \
--memory-threshold 0.9 \
--input "Hello" \
--max-tokens 100
2. Configuration File¶
Create config.json:
{
"block_size": 16,
"max_num_blocks": 1024,
"max_batch_size": 32,
"max_num_seqs": 256,
"max_model_len": 2048,
"max_total_tokens": 4096,
"memory_threshold": 0.9
}
Use with:
3. Environment Variables¶
4. Programmatic¶
use hetero_infer::EngineConfig;
let config = EngineConfig {
block_size: 16,
max_num_blocks: 2048,
max_batch_size: 64,
..Default::default()
};
Configuration Reference¶
Block Size¶
- Default: 16
- Range: 1-128
- Impact: Number of tokens per physical block
| Size | Fragmentation | Metadata | Best For |
|---|---|---|---|
| 8 | Low | High | Short sequences |
| 16 | Medium | Medium | General use |
| 32 | Higher | Low | Long sequences |
Maximum Blocks¶
Memory calculation:
Example (FP16):
Batch Configuration¶
| Parameter | Default | Description |
|---|---|---|
| max_batch_size | 32 | Sequences per batch |
| max_num_seqs | 256 | Concurrent sequences |
| max_total_tokens | 4096 | Tokens per batch |
Memory Settings¶
| Parameter | Default | Range | Description |
|---|---|---|---|
| max_model_len | 2048 | - | Max sequence length |
| memory_threshold | 0.9 | 0.0-1.0 | Admission control threshold |
Generation Parameters¶
Temperature¶
| Value | Behavior |
|---|---|
| 0.0 | Greedy decoding |
| 0.7 | Focused |
| 1.0 | Balanced |
| 1.5 | Creative |
Top-p (Nucleus Sampling)¶
Max Tokens¶
Configuration Presets¶
Low Latency (Interactive)¶
{
"block_size": 16,
"max_num_blocks": 512,
"max_batch_size": 8,
"max_num_seqs": 64,
"max_model_len": 1024,
"max_total_tokens": 1024,
"memory_threshold": 0.8
}
High Throughput (Batch Processing)¶
{
"block_size": 32,
"max_num_blocks": 4096,
"max_batch_size": 128,
"max_num_seqs": 1024,
"max_model_len": 4096,
"max_total_tokens": 16384,
"memory_threshold": 0.95
}
Memory Constrained¶
{
"block_size": 16,
"max_num_blocks": 256,
"max_batch_size": 16,
"max_num_seqs": 128,
"max_model_len": 1024,
"max_total_tokens": 2048,
"memory_threshold": 0.85
}
Long Context¶
{
"block_size": 32,
"max_num_blocks": 2048,
"max_batch_size": 16,
"max_num_seqs": 64,
"max_model_len": 8192,
"max_total_tokens": 8192,
"memory_threshold": 0.9
}
Validation Rules¶
| Parameter | Rule | Error |
|---|---|---|
| block_size | > 0 | Must be positive |
| max_num_blocks | > 0 | Must be positive |
| max_batch_size | > 0 | Must be positive |
| max_model_len | ≥ block_size | Context must fit blocks |
| memory_threshold | 0.0-1.0 | Must be fraction |
Monitoring Configuration¶
// Enable metrics
let config = EngineConfig {
enable_metrics: true,
metrics_port: 9090,
..Default::default()
};
Performance Tuning¶
GPU Optimization¶
CPU Optimization¶
Next: Architecture Overview