Skip to content

Changelog

Changelog | 变更日志

All notable changes to this project will be documented in this file.

本项目的所有显著变更都将记录在此文件中。

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

格式基于 Keep a Changelog,本项目遵循 语义化版本


Unreleased | [未发布]

Added | 新增

  • Complete bilingual (EN/ZH) documentation structure | 完整的中英文双语文档结构
  • Root-level CHANGELOG.md with bilingual support | 根目录双语 CHANGELOG.md
  • Professional documentation index and navigation | 专业的文档索引和导航
  • CLAUDE.md, .github/copilot-instructions.md, .github/workflows/copilot-setup-steps.yml | 新增 CLAUDE.md.github/copilot-instructions.md.github/workflows/copilot-setup-steps.yml
  • .editorconfig, .clangd, and tracked .vscode/ recommendations | 新增 .editorconfig.clangd 与可跟踪的 .vscode/ 推荐配置

Changed | 变更

  • OpenSpec is now the only active workflow language in root governance docs | 根治理文档现统一以 OpenSpec 作为唯一主动工作流语言
  • README, documentation hub, and reference pages were tightened to reduce duplication | 收敛 README、文档入口页与参考页,减少重复内容
  • CI now fails on meaningful pre-commit problems instead of hiding them | CI 现在会对真实的 pre-commit 问题直接失败,而不是隐藏失败
  • CMakePresets.json no longer injects unused CUDA variables into CPU-only smoke runs | CMakePresets.json 不再在 CPU smoke 场景中注入无效 CUDA 变量

Fixed | 修复

  • Rewrote docs/en/guides/architecture.md to proper English (was Chinese content) | 重写 docs/en/guides/architecture.md 为正确的英文版本(原内容为中文)
  • Completed docs/zh/getting-started/installation.md Chinese translation | 补全 docs/zh/getting-started/installation.md 中文翻译
  • Translated docs/zh/getting-started/troubleshooting.md to Chinese | 翻译 docs/zh/getting-started/troubleshooting.md 为中文
  • Translated docs/zh/examples/README.md to Chinese | 翻译 docs/zh/examples/README.md 为中文

Removed | 移除

  • Deleted changelog/ directory (consolidated into CHANGELOG.md) | 删除 changelog/ 目录(内容已合并到 CHANGELOG.md)
  • Removed redundant pre-commit CI job (kept format-check) | 移除冗余的 pre-commit CI job(保留 format-check

Technical | 技术改进

  • Added sm_70 (V100) support to CMakePresets release build | CMakePresets release 构建添加 sm_70 (V100) 支持
  • Enhanced .clangd with multi-directory fallback (build/dev, build/release) | 增强 .clangd 配置支持多目录 fallback
  • Updated GitHub repository metadata (description, topics, homepage) | 更新 GitHub 仓库元数据(描述、标签、主页)

3.0.0 - 2026-04-16 | v3.0.0 - 2026年4月16日

Changed | 变更

  • Documentation Reconstruction | 文档重构
    • Reorganized docs/ into bilingual structure (en/, zh/) | 将 docs/ 重组为双语结构(en/、zh/)
    • Professional documentation landing page | 专业的文档首页
    • Comprehensive bilingual navigation | 全面的双语导航

2.0.0 - 2026-03-09 | v2.0.0 - 2026年3月9日

Fixed | 修复

  • MemoryPool lifecycle bug (Critical) | MemoryPool 生命周期错误(严重)

    • clear() was erasing tracking for in-use blocks | clear() 删除了正在使用块的跟踪
    • deallocate() left stale entries | deallocate() 留下过期条目
    • Added freed_sizes_ map for proper pool state management | 添加 freed_sizes_ 映射以正确管理池状态
  • atomicMin/atomicMax for negative floats (Critical) | 负浮点数的 atomicMin/atomicMax(严重)

    • compute_quant_params_kernel gave incorrect results for negative values | compute_quant_params_kernel 对负值给出错误结果
    • Replaced with CAS-based atomic float min/max | 替换为基于 CAS 的原子浮点 min/max

Added | 新增

  • core/warp_utils.hpp: Shared warp-level reduction primitives | 共享线程束级归约原语
    • warp_reduce_max/sum/min | warp_reduce_max/sum/min
    • warp_broadcast | warp_broadcast
    • block_reduce_sum/max | block_reduce_sum/max
  • detail::fill_kernel: GPU-side fill kernel for Tensor::fill | 用于 Tensor::fill 的 GPU 端填充内核

Changed | 变更

  • FlashAttention kernel rewrite | FlashAttention 内核重写
    • Moved output accumulator from per-thread registers to shared memory | 将输出累加器从每线程寄存器移动到共享内存
    • Reduced register pressure from 256 bytes/thread | 将寄存器压力从 256 字节/线程降低
    • Cooperative tile loading | 协作瓦片加载
    • Reduced default block sizes | 减少默认块大小
  • normalization.hpp no longer depends on softmax.hpp | normalization.hpp 不再依赖 softmax.hpp
  • Tensor::fill now uses a GPU kernel instead of host-memory roundtrip | Tensor::fill 现在使用 GPU 内核而非主机内存往返

1.1.0 - 2026-01-08 | v1.1.0 - 2026年1月8日

Fixed | 修复

  • Python bindings CMake configuration | Python 绑定的 CMake 配置
    • Fixed src/python_ops/CMakeLists.txt referencing non-existent source files | 修复 src/python_ops/CMakeLists.txt 引用不存在的源文件
    • tensor_bindings.cpp / kernel_bindings.cpp | tensor_bindings.cpp / kernel_bindings.cpp
  • CUDA-optional builds | 可选 CUDA 构建
    • CMake now gracefully handles environments without CUDA Toolkit | CMake 现在优雅处理没有 CUDA Toolkit 的环境
    • Auto-disabling tests, benchmarks, and Python bindings | 自动禁用测试、基准和 Python 绑定

1.0.1 - 2025-02-13 | v1.0.1 - 2025年2月13日

Added | 新增

  • Project infrastructure files | 项目基础设施文件
    • .gitignore for CUDA/Python/IDE rules | .gitignore 用于 CUDA/Python/IDE 规则
    • .editorconfig for unified code formatting | .editorconfig 用于统一代码格式
  • Standardized badges in README | README 中的标准化徽章
    • License, CUDA, C++17/20, CMake, Python | 许可证、CUDA、C++17/20、CMake、Python

Changed | 变更

  • Changelog files restructured into changelog/ directory | 变更日志文件重组到 changelog/ 目录

1.0.0 - 2024-01-01 | v1.0.0 - 2024年1月1日

Added | 新增

  • GEMM Kernels | GEMM 内核

    • Naive GEMM implementation | 朴素 GEMM 实现
    • Tiled GEMM with shared memory optimization | 带共享内存优化的平铺 GEMM
    • Double-buffered GEMM for latency hiding | 用于延迟隐藏的双缓冲 GEMM
    • Tensor Core GEMM using WMMA API (CUDA 11.0+) | 使用 WMMA API 的张量核心 GEMM
  • Attention Kernels | 注意力内核

    • FlashAttention-style fused attention kernel | FlashAttention 风格的融合注意力内核
    • Memory-efficient attention computation | 内存高效注意力计算
    • RoPE (Rotary Positional Embeddings) kernel | RoPE(旋转位置嵌入)内核
    • Simplified PagedAttention kernel | 简化版 PagedAttention 内核
    • MoE (Mixture of Experts) router kernel | MoE(专家混合)路由器内核
  • Normalization Kernels | 归一化内核

    • LayerNorm implementation | LayerNorm 实现
    • RMSNorm implementation | RMSNorm 实现
    • BatchNorm implementation | BatchNorm 实现
    • Softmax with online algorithm | 使用在线算法的 Softmax
  • Convolution Kernels | 卷积内核

    • Naive 2D convolution | 朴素二维卷积
    • Im2Col-based convolution | 基于 Im2Col 的卷积
    • Depthwise separable convolution | 深度可分离卷积
  • Sparse Operations | 稀疏操作

    • CSR and CSC sparse matrix formats | CSR 和 CSC 稀疏矩阵格式
    • Sparse Matrix-Vector multiplication (SpMV) | 稀疏矩阵-向量乘法 (SpMV)
    • Sparse Matrix-Matrix multiplication (SpMM) | 稀疏矩阵-矩阵乘法 (SpMM)
  • Elementwise Operations | 逐元素操作

    • Fused elementwise kernel support | 融合逐元素内核支持
    • Common activation functions | 常用激活函数 (ReLU, GELU, SiLU, LeakyReLU, ELU, Swish)
  • Operator Fusion & Quantization | 算子融合与量化

    • Fused Bias+GeLU epilogue | 融合的 Bias+GeLU 后记
    • INT8 quantization support | INT8 量化支持
    • FP8 quantization support (CUDA 12.0+) | FP8 量化支持
  • Memory Management | 内存管理

    • Memory pool for efficient GPU memory allocation | 用于高效 GPU 内存分配的内存池
    • Aligned vector for CPU-side data | 用于 CPU 端数据的对齐向量
    • Tensor abstraction with automatic memory management | 自动内存管理的张量抽象
  • Python Bindings | Python 绑定

    • pybind11-based Python interface | 基于 pybind11 的 Python 接口
    • NumPy array interoperability | NumPy 数组互操作性
  • Testing | 测试

    • Unit tests for all kernel implementations | 所有内核实现的单元测试
    • Correctness validation against reference implementations | 与参考实现的正确性验证
  • Benchmarks | 基准测试

    • GEMM performance benchmarks | GEMM 性能基准
    • Attention kernel benchmarks | 注意力内核基准
    • Convolution benchmarks | 卷积基准

Dependencies | 依赖

  • CUDA Toolkit 11.0+ (12.x recommended) | CUDA Toolkit 11.0+(推荐 12.x)
  • CMake 3.20+ | CMake 3.20+
  • C++17 compatible compiler | 兼容 C++17 的编译器
  • pybind11 (for Python bindings) | pybind11(用于 Python 绑定)

Version History Summary | 版本历史汇总

VersionDateDescription描述
3.0.02026-04-16Bilingual documentation, CHANGELOG professionalization双语文档,CHANGELOG 专业化
2.0.02026-03-09Critical bug fixes, architecture improvements关键错误修复,架构改进
1.1.02026-01-08Build system fixes for CUDA-optional environments可选 CUDA 环境的构建系统修复
1.0.12025-02-13Project infrastructure improvements项目基础设施改进
1.0.02024-01-01Initial release初始发布

Released under the Apache 2.0 License.