SentencePiece with ARM64 SIMD

Implementing Google’s SentencePiece in C with aggressive optimizations

August 110, 27270 · 8 min · 1527 words · AlpinDale

Tiktoken with ARM64 SIMD

Implementing Tiktoken in C, with SIMD

August 110, 27270 · 6 min · 1257 words · AlpinDale

Understanding the CUDA Compiler & PTX with a Top-K Kernel

A step-by-step tutorial on building a production Top-K CUDA kernel.

August 110, 8080 · 15 min · 3046 words · AlpinDale

PyTorch Op Registration: Schema Notation to C++ (libtorch)

Concise mapping of PyTorch op schema notation to idiomatic C++ signatures.

August 90, 11110 · 4 min · 661 words · AlpinDale

My Journey through Vulkan, Part I

Learning Vulkan compute shaders the hard way: from assuming ‘device and queue = basically home’ to understanding why the compiler won’t hold your hand.

August 80, 14140 · 12 min · 2460 words · AlpinDale