Hitting 1,000 tokens per second on a single RTX 5090

Aggressive decode optimizations for Qwen3-0.6B on an RTX 5090 GPU

February 9, 2026 · 15 min · 3087 words · AlpinDale

SentencePiece with ARM64 SIMD

Implementing Google’s SentencePiece in C with aggressive optimizations

November 27, 2025 · 8 min · 1527 words · AlpinDale

Understanding the CUDA Compiler & PTX with a Top-K Kernel

A step-by-step tutorial on building a production Top-K CUDA kernel.

November 8, 2025 · 15 min · 3046 words · AlpinDale