Implementing Google’s SentencePiece in C with aggressive optimizations
Tiktoken with ARM64 SIMD
Implementing Tiktoken in C, with SIMD
Understanding the CUDA Compiler & PTX with a Top-K Kernel
A step-by-step tutorial on building a production Top-K CUDA kernel.
PyTorch Op Registration: Schema Notation to C++ (libtorch)
Concise mapping of PyTorch op schema notation to idiomatic C++ signatures.
My Journey through Vulkan, Part I
Learning Vulkan compute shaders the hard way: from assuming ‘device and queue = basically home’ to understanding why the compiler won’t hold your hand.