Hetero-Paged-Infer — High-Performance LLM Inference Engine

Features

Built for Production LLM Serving

Modern inference techniques implemented in Rust for maximum performance and reliability.

🧠

PagedAttention

Block-based KV Cache management with on-demand allocation. Eliminates memory fragmentation and enables efficient memory sharing.

⚡

Continuous Batching

Dynamic prefill/decode scheduling with priority awareness. Maximizes GPU utilization while maintaining low latency.

🛡️

Memory Pressure Awareness

Configurable OOM prevention with graceful degradation. Production-ready error handling and monitoring.

🔧

Modular Architecture

Trait-based abstractions for easy customization. Clean separation between CPU scheduler and GPU executor.

🧪

Comprehensive Testing

121 tests including unit, property-based, and integration tests. Property tests verify critical invariants.

🚀

Rust Performance

Zero-cost abstractions with memory safety. No GC pauses, predictable performance for serving workloads.

Performance

Memory Efficiency Comparison

See how PagedAttention compares to traditional allocation strategies.

Static Allocation

~40-60% waste

Dynamic Allocation

~20-30% waste (+20% throughput)

PagedAttention
<5% waste (+50% throughput)

Quick Start

Get Started in Minutes

Install and run your first inference with just a few commands.

bash

$ # Clone the repository

$ git clone https://github.com/LessUp/hetero-paged-infer.git

$ cd hetero-paged-infer

$ # Build in release mode

$ cargo build --release

$ # Run inference

$ ./target/release/hetero-infer --input "Hello, world!" --max-tokens 50

Documentation

Learn More

Explore our comprehensive documentation to get the most out of Hetero-Paged-Infer.

🚀

Quick Start Guide

Get up and running with step-by-step installation and first run instructions.

→

🏗️

Architecture

Deep dive into the system design, components, and design principles.

→

📚

API Reference

Complete API documentation with examples and use cases.

→

🖥️

Production Deploy

Best practices for deploying to production environments.

→