Home

# Hetero-Paged-Infer **High-Performance LLM Inference Engine** *PagedAttention + Continuous Batching* [Get Started](setup/quickstart.md){ .md-button .md-button--primary } [GitHub](https://github.com/LessUp/hetero-paged-infer){ .md-button }

Features¶

Feature	Description
PagedAttention	Block-based KV Cache, <5% memory waste
Continuous Batching	Dynamic prefill/decode scheduling
Production Ready	Error handling, metrics, monitoring
Well Tested	135 tests (unit, property, integration)

Quick Start¶

git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer
cargo build --release
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

Performance¶

Method	Memory Waste	Throughput
Static Allocation	~40-60%	Baseline
PagedAttention	<5%	+50%

Architecture¶

┌─────────────────────────────────────────────────┐
│            InferenceEngine (CPU)                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
│  │Tokenizer │  │Scheduler │  │ KV Cache Mgr │   │
│  └────┬─────┘  └────┬─────┘  └──────┬───────┘   │
│       └─────────────┼───────────────┘           │
├─────────────────────┼───────────────────────────┤
│               ┌─────▼─────┐                      │
│               │    GPU    │  Executor + Memory   │
│               └───────────┘                      │
└─────────────────────────────────────────────────┘

Documentation¶

Setup - Installation and configuration
Architecture - System design
API Reference - API documentation
Deployment - Production deployment
Development - Contributing