Deployment Guide¶

Overview¶

This guide covers deploying Hetero-Paged-Infer in production environments, including system requirements, build instructions, and operational best practices.

System Requirements¶

Minimum Requirements¶

Component	Requirement
OS	Linux (Ubuntu 20.04+ recommended)
CPU	x86_64 with AVX2 support
Memory	8 GB RAM
GPU	NVIDIA GPU with CUDA 11.x+ (optional)
Rust	1.70+ (2021 edition)
Git	2.25+

Recommended for Production¶

Component	Recommendation
OS	Ubuntu 22.04 LTS
CPU	16+ cores
Memory	32 GB RAM
GPU	NVIDIA A100 / H100 / RTX 4090
CUDA	12.x
Storage	NVMe SSD for models

Installation¶

1. Install Rust¶

# Using rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Verify installation
rustc --version  # Should be 1.70+

2. Install CUDA (Optional)¶

For GPU acceleration:

# Ubuntu 22.04 example
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-1

# Verify
nvcc --version
nvidia-smi

3. Clone and Build¶

# Clone repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer

# Build release version
cargo build --release

# Run tests
cargo test --release

# Binary location
./target/release/hetero-infer --help

Running the Application¶

Basic Usage¶

# Basic inference
./target/release/hetero-infer \
  --input "Hello, world!" \
  --max-tokens 50

# With custom parameters
./target/release/hetero-infer \
  --input "Tell me a story" \
  --max-tokens 200 \
  --temperature 0.8 \
  --top-p 0.95

Using Configuration File¶

Create production.json:

{
  "block_size": 16,
  "max_num_blocks": 2048,
  "max_batch_size": 64,
  "max_num_seqs": 512,
  "max_model_len": 4096,
  "max_total_tokens": 8192,
  "memory_threshold": 0.9
}

Run with configuration:

./target/release/hetero-infer \
  --config production.json \
  --input "Hello, world!"

Production Deployment¶

Systemd Service¶

Create /etc/systemd/system/hetero-infer.service:

[Unit]
Description=Hetero-Paged-Infer Inference Engine
After=network.target

[Service]
Type=simple
User=hetero
Group=hetero
WorkingDirectory=/opt/hetero-paged-infer
ExecStart=/opt/hetero-paged-infer/target/release/hetero-infer \
  --config /etc/hetero-infer/config.json
Restart=always
RestartSec=5

# Resource limits
LimitNOFILE=65536

# Environment
Environment=RUST_LOG=info
Environment=RUST_BACKTRACE=1

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable hetero-infer
sudo systemctl start hetero-infer
sudo systemctl status hetero-infer

Docker Deployment¶

Create Dockerfile:

FROM rust:1.75-bullseye as builder

WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bullseye-slim

RUN apt-get update && apt-get install -y \
    ca-certificates \
    libssl1.1 \
    && rm -rf /var/lib/apt/lists/*

COPY --from=builder /app/target/release/hetero-infer /usr/local/bin/
COPY --from=builder /app/config.example.json /etc/hetero-infer/config.json

USER nobody
EXPOSE 8080

ENTRYPOINT ["hetero-infer"]
CMD ["--config", "/etc/hetero-infer/config.json"]

Build and run:

# Build image
docker build -t hetero-infer:latest .

# Run container
docker run -d \
  --name hetero-infer \
  --gpus all \
  -v /path/to/config.json:/etc/hetero-infer/config.json \
  -p 8080:8080 \
  hetero-infer:latest

Kubernetes Deployment¶

Create deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hetero-infer
  labels:
    app: hetero-infer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hetero-infer
  template:
    metadata:
      labels:
        app: hetero-infer
    spec:
      containers:
      - name: hetero-infer
        image: hetero-infer:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          requests:
            memory: "16Gi"
            cpu: "4"
        volumeMounts:
        - name: config
          mountPath: /etc/hetero-infer
      volumes:
      - name: config
        configMap:
          name: hetero-infer-config
---
apiVersion: v1
kind: Service
metadata:
  name: hetero-infer
spec:
  selector:
    app: hetero-infer
  ports:
  - port: 8080
    targetPort: 8080
  type: LoadBalancer

Deploy:

kubectl apply -f deployment.yaml

Monitoring¶

Log Levels¶

Set via RUST_LOG environment variable:

# Error only
RUST_LOG=error ./hetero-infer

# Warning and above
RUST_LOG=warn ./hetero-infer

# Info (default)
RUST_LOG=info ./hetero-infer

# Debug (verbose)
RUST_LOG=debug ./hetero-infer

# Trace (very verbose)
RUST_LOG=trace ./hetero-infer

Metrics (Future)¶

Planned metrics endpoints:

GET /metrics - Prometheus-compatible metrics
Request rate, latency, batch size
Memory utilization, queue depth
GPU utilization, temperature

Health Checks¶

Current status check via exit codes:

# In systemd service
ExecStartPre=/path/to/hetero-infer --version

Performance Optimization¶

CPU Optimization¶

CPU Affinity
```
taskset -c 0-7 ./hetero-infer
```

NUMA Awareness

numactl --cpunodebind=0 --membind=0 ./hetero-infer

GPU Optimization¶

Persistent Mode
```
sudo nvidia-smi -pm 1
```
GPU Clock Settings
```
sudo nvidia-smi -ac 877,1530
```

ECC Memory

# Disable ECC for better performance (if acceptable)
sudo nvidia-smi -e 0

Memory Optimization¶

Huge Pages

# Enable transparent huge pages
echo always > /sys/kernel/mm/transparent_hugepage/enabled

Memory Limits

# In systemd service
MemoryMax=64G
MemorySwapMax=0

Troubleshooting¶

Build Issues¶

Issue	Solution
`linker not found`	Install build-essential: `sudo apt install build-essential`
`CUDA not found`	Set `CUDA_HOME` environment variable
`proptest fails`	Run with `--test-threads=1`

Runtime Issues¶

Issue	Solution
OOM errors	Reduce `max_num_blocks` or `max_batch_size`
Slow inference	Increase `max_batch_size`, enable CUDA Graphs
Request rejected	Check memory_threshold, reduce concurrent requests
GPU not used	Verify CUDA installation, check `nvidia-smi`

Debug Mode¶

# Enable backtrace
RUST_BACKTRACE=1 ./hetero-infer ...

# Full backtrace
RUST_BACKTRACE=full ./hetero-infer ...

# Debug logging
RUST_LOG=debug ./hetero-infer ...

Security Considerations¶

Run as non-root user
```
useradd -r -s /bin/false hetero
```

Restrict file permissions

chmod 750 /opt/hetero-paged-infer
chmod 640 /etc/hetero-infer/config.json

Network isolation
Use firewall rules
Deploy behind reverse proxy
Enable TLS for API endpoints
Resource limits
Set memory limits
Configure CPU quotas
Limit GPU access

Backup and Recovery¶

Configuration Backup¶

# Backup configuration
sudo tar czf hetero-infer-config-$(date +%Y%m%d).tar.gz \
  /etc/hetero-infer/

# Automated backup via cron
0 2 * * * tar czf /backup/hetero-infer-config-$(date +\%Y\%m\%d).tar.gz /etc/hetero-infer/

Log Rotation¶

Create /etc/logrotate.d/hetero-infer:

/var/log/hetero-infer/*.log {
    daily
    rotate 14
    compress
    delaycompress
    missingok
    notifempty
    create 0640 hetero hetero
}

For API details, see API.md. For configuration options, see CONFIGURATION.md.