Hitting 1,000 tokens per second on a single RTX 5090
Aggressive decode optimizations for Qwen3-0.6B on an RTX 5090 GPU
Aggressive decode optimizations for Qwen3-0.6B on an RTX 5090 GPU
Implementing Google’s SentencePiece in C with aggressive optimizations
Implementing Tiktoken in C, with SIMD
A step-by-step tutorial on building a production Top-K CUDA kernel.
Concise mapping of PyTorch op schema notation to idiomatic C++ signatures.
Learning Vulkan compute shaders the hard way: from assuming ‘device and queue = basically home’ to understanding why the compiler won’t hold your hand.