BlockPool<T>
Scheduler::step()
PagedAttention
KV Cache Manager
Production Ready v0.1.0

High-Performance LLM Inference Engine

PagedAttention + Continuous Batching in Rust. Achieve <5% memory waste with 50% higher throughput.

<5%
Memory Waste
+50%
Throughput
121
Tests Passed

Built for Production LLM Serving

Modern inference techniques implemented in Rust for maximum performance and reliability.

🧠

PagedAttention

Block-based KV Cache management with on-demand allocation. Eliminates memory fragmentation and enables efficient memory sharing.

Continuous Batching

Dynamic prefill/decode scheduling with priority awareness. Maximizes GPU utilization while maintaining low latency.

🛡️

Memory Pressure Awareness

Configurable OOM prevention with graceful degradation. Production-ready error handling and monitoring.

🔧

Modular Architecture

Trait-based abstractions for easy customization. Clean separation between CPU scheduler and GPU executor.

🧪

Comprehensive Testing

121 tests including unit, property-based, and integration tests. Property tests verify critical invariants.

🚀

Rust Performance

Zero-cost abstractions with memory safety. No GC pauses, predictable performance for serving workloads.

Memory Efficiency Comparison

See how PagedAttention compares to traditional allocation strategies.

Static Allocation
~40-60% waste
Dynamic Allocation
~20-30% waste (+20% throughput)
PagedAttention
<5% waste (+50% throughput)

Get Started in Minutes

Install and run your first inference with just a few commands.

bash
$ # Clone the repository
$ git clone https://github.com/LessUp/hetero-paged-infer.git
$ cd hetero-paged-infer
$ # Build in release mode
$ cargo build --release
$ # Run inference
$ ./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

Learn More

Explore our comprehensive documentation to get the most out of Hetero-Paged-Infer.

🚀

Quick Start Guide

Get up and running with step-by-step installation and first run instructions.

🏗️

Architecture

Deep dive into the system design, components, and design principles.

📚

API Reference

Complete API documentation with examples and use cases.

🖥️

Production Deploy

Best practices for deploying to production environments.