Deployment Guide¶
Overview¶
This guide covers deploying Hetero-Paged-Infer in production environments, including system requirements, build instructions, and operational best practices.
System Requirements¶
Minimum Requirements¶
| Component | Requirement |
|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) |
| CPU | x86_64 with AVX2 support |
| Memory | 8 GB RAM |
| GPU | NVIDIA GPU with CUDA 11.x+ (optional) |
| Rust | 1.70+ (2021 edition) |
| Git | 2.25+ |
Recommended for Production¶
| Component | Recommendation |
|---|---|
| OS | Ubuntu 22.04 LTS |
| CPU | 16+ cores |
| Memory | 32 GB RAM |
| GPU | NVIDIA A100 / H100 / RTX 4090 |
| CUDA | 12.x |
| Storage | NVMe SSD for models |
Installation¶
1. Install Rust¶
# Using rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Verify installation
rustc --version # Should be 1.70+
2. Install CUDA (Optional)¶
For GPU acceleration:
# Ubuntu 22.04 example
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-1
# Verify
nvcc --version
nvidia-smi
3. Clone and Build¶
# Clone repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer
# Build release version
cargo build --release
# Run tests
cargo test --release
# Binary location
./target/release/hetero-infer --help
Running the Application¶
Basic Usage¶
# Basic inference
./target/release/hetero-infer \
--input "Hello, world!" \
--max-tokens 50
# With custom parameters
./target/release/hetero-infer \
--input "Tell me a story" \
--max-tokens 200 \
--temperature 0.8 \
--top-p 0.95
Using Configuration File¶
Create production.json:
{
"block_size": 16,
"max_num_blocks": 2048,
"max_batch_size": 64,
"max_num_seqs": 512,
"max_model_len": 4096,
"max_total_tokens": 8192,
"memory_threshold": 0.9
}
Run with configuration:
Production Deployment¶
Systemd Service¶
Create /etc/systemd/system/hetero-infer.service:
[Unit]
Description=Hetero-Paged-Infer Inference Engine
After=network.target
[Service]
Type=simple
User=hetero
Group=hetero
WorkingDirectory=/opt/hetero-paged-infer
ExecStart=/opt/hetero-paged-infer/target/release/hetero-infer \
--config /etc/hetero-infer/config.json
Restart=always
RestartSec=5
# Resource limits
LimitNOFILE=65536
# Environment
Environment=RUST_LOG=info
Environment=RUST_BACKTRACE=1
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable hetero-infer
sudo systemctl start hetero-infer
sudo systemctl status hetero-infer
Docker Deployment¶
Create Dockerfile:
FROM rust:1.75-bullseye as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bullseye-slim
RUN apt-get update && apt-get install -y \
ca-certificates \
libssl1.1 \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/hetero-infer /usr/local/bin/
COPY --from=builder /app/config.example.json /etc/hetero-infer/config.json
USER nobody
EXPOSE 8080
ENTRYPOINT ["hetero-infer"]
CMD ["--config", "/etc/hetero-infer/config.json"]
Build and run:
# Build image
docker build -t hetero-infer:latest .
# Run container
docker run -d \
--name hetero-infer \
--gpus all \
-v /path/to/config.json:/etc/hetero-infer/config.json \
-p 8080:8080 \
hetero-infer:latest
Kubernetes Deployment¶
Create deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hetero-infer
labels:
app: hetero-infer
spec:
replicas: 1
selector:
matchLabels:
app: hetero-infer
template:
metadata:
labels:
app: hetero-infer
spec:
containers:
- name: hetero-infer
image: hetero-infer:latest
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
requests:
memory: "16Gi"
cpu: "4"
volumeMounts:
- name: config
mountPath: /etc/hetero-infer
volumes:
- name: config
configMap:
name: hetero-infer-config
---
apiVersion: v1
kind: Service
metadata:
name: hetero-infer
spec:
selector:
app: hetero-infer
ports:
- port: 8080
targetPort: 8080
type: LoadBalancer
Deploy:
Monitoring¶
Log Levels¶
Set via RUST_LOG environment variable:
# Error only
RUST_LOG=error ./hetero-infer
# Warning and above
RUST_LOG=warn ./hetero-infer
# Info (default)
RUST_LOG=info ./hetero-infer
# Debug (verbose)
RUST_LOG=debug ./hetero-infer
# Trace (very verbose)
RUST_LOG=trace ./hetero-infer
Metrics (Future)¶
Planned metrics endpoints:
GET /metrics- Prometheus-compatible metrics- Request rate, latency, batch size
- Memory utilization, queue depth
- GPU utilization, temperature
Health Checks¶
Current status check via exit codes:
Performance Optimization¶
CPU Optimization¶
-
CPU Affinity
-
NUMA Awareness
GPU Optimization¶
-
Persistent Mode
-
GPU Clock Settings
-
ECC Memory
Memory Optimization¶
-
Huge Pages
-
Memory Limits
Troubleshooting¶
Build Issues¶
| Issue | Solution |
|---|---|
linker not found | Install build-essential: sudo apt install build-essential |
CUDA not found | Set CUDA_HOME environment variable |
proptest fails | Run with --test-threads=1 |
Runtime Issues¶
| Issue | Solution |
|---|---|
| OOM errors | Reduce max_num_blocks or max_batch_size |
| Slow inference | Increase max_batch_size, enable CUDA Graphs |
| Request rejected | Check memory_threshold, reduce concurrent requests |
| GPU not used | Verify CUDA installation, check nvidia-smi |
Debug Mode¶
# Enable backtrace
RUST_BACKTRACE=1 ./hetero-infer ...
# Full backtrace
RUST_BACKTRACE=full ./hetero-infer ...
# Debug logging
RUST_LOG=debug ./hetero-infer ...
Security Considerations¶
-
Run as non-root user
-
Restrict file permissions
-
Network isolation
- Use firewall rules
- Deploy behind reverse proxy
-
Enable TLS for API endpoints
-
Resource limits
- Set memory limits
- Configure CPU quotas
- Limit GPU access
Backup and Recovery¶
Configuration Backup¶
# Backup configuration
sudo tar czf hetero-infer-config-$(date +%Y%m%d).tar.gz \
/etc/hetero-infer/
# Automated backup via cron
0 2 * * * tar czf /backup/hetero-infer-config-$(date +\%Y\%m\%d).tar.gz /etc/hetero-infer/
Log Rotation¶
Create /etc/logrotate.d/hetero-infer:
/var/log/hetero-infer/*.log {
daily
rotate 14
compress
delaycompress
missingok
notifempty
create 0640 hetero hetero
}
For API details, see API.md. For configuration options, see CONFIGURATION.md.