Technical

Hitting 1,000 tokens per second on a single RTX 5090

Aggressive decode optimizations for Qwen3-0.6B on an RTX 5090 GPU

SentencePiece with ARM64 SIMD

Implementing Google’s SentencePiece in C with aggressive optimizations

Tiktoken with ARM64 SIMD

Implementing Tiktoken in C, with SIMD

Understanding the CUDA Compiler & PTX with a Top-K Kernel

A step-by-step tutorial on building a production Top-K CUDA kernel.

PyTorch Op Registration: Schema Notation to C++ (libtorch)

Concise mapping of PyTorch op schema notation to idiomatic C++ signatures.

My Journey through Vulkan, Part I

Learning Vulkan compute shaders the hard way: from assuming ‘device and queue = basically home’ to understanding why the compiler won’t hold your hand.