Design Decisions

This document records the current architectural decisions that still matter after the repository slimdown. It intentionally keeps only decisions that directly affect the CUDA implementation or the public integration surface.

ADR-001: Restrict `head_dim` to

Context
The kernels rely on compile-time tiling choices to keep shared-memory usage and register pressure predictable.

Decision
Support only head_dim values 32, 64, and 128, and dispatch to explicit template instantiations.

Consequences

Maximum compiler optimization and predictable shared-memory sizing
Simpler host-side validation and narrower test surface
Unsupported model variants must pad or adapt externally

ADR-002: Use FP32 accumulation for FP16 paths

Context
Attention score accumulation and online softmax are numerically fragile in half precision.

Decision
Store inputs/outputs as half where needed, but perform internal accumulation and normalization in float.

Consequences

Better numerical stability for long sequences
Slightly higher internal register/shared-memory pressure
Stable behavior across FP32 and FP16 entry points

ADR-003: Keep the Python integration surface at the C ABI level

Context
The project is intentionally lightweight and should remain easy to embed without adding large binding frameworks.

Decision
Expose a C ABI and document ctypes-style usage rather than adopting pybind11 or a custom runtime layer.

Consequences

Smaller dependency surface
Easier integration from multiple languages
Less ergonomic runtime type checking than higher-level binding frameworks

ADR-004: Fix the core thread-block shape at 128 threads

Context
Kernel occupancy, warp assignment, and shared-memory tiling have all been tuned around a 128-thread block shape.

Decision
Keep blockDim.x = 128 and continue to pair it with explicit launch configuration checks.

Consequences

Predictable scheduling and occupancy characteristics
Fewer tuning dimensions to maintain
Any future change requires retuning and revalidation across kernels

ADR-005: Prefer explicit repository docs over external process frameworks

Context
The repository previously accumulated overlapping AI/process frameworks that duplicated guidance and drifted from the actual codebase.

Decision
Keep contributor guidance in the repository’s native docs (README, CONTRIBUTING, GitHub workflow files, and VitePress pages) and remove external control layers.

Consequences

Lower maintenance overhead
Fewer contradictory instructions
Less ceremony when updating docs and workflows

Design Decisions ​

ADR-001: Restrict head_dim to ​

ADR-002: Use FP32 accumulation for FP16 paths ​

ADR-003: Keep the Python integration surface at the C ABI level ​

ADR-004: Fix the core thread-block shape at 128 threads ​

ADR-005: Prefer explicit repository docs over external process frameworks ​

Design Decisions

ADR-001: Restrict `head_dim` to

ADR-002: Use FP32 accumulation for FP16 paths

ADR-003: Keep the Python integration surface at the C ABI level

ADR-004: Fix the core thread-block shape at 128 threads

ADR-005: Prefer explicit repository docs over external process frameworks