SentencePiece with ARM64 SIMD

Performance Comparison

After implementing tiktoken in C with SIMD, the next target is Google’s SentencePiece. It powers T5, Gemma, Llama (1 & 2), and a few other modern Open-Source LLMs.

Tiktoken wasn’t that hard, but this one was harder. Much harder.

Results

Text Type	Python sentencepiece	Our C + SIMD	Speedup
Short English	1.23 µs	0.12 µs	10.3x
Medium English	2.87 µs	0.41 µs	7.0x
Long English	7.59 µs	1.21 µs	6.3x
Code	6.29 µs	0.65 µs	9.7x

That’s 108 million characters per second for English text. CJK text is even faster at 595 MB/s for Japanese, because most characters map to <unk> tokens, allowing early termination.

I reckon half the performance issues upstream comes from Python, but that’s neither here or there.

Part 1: Understanding SentencePiece

SentencePiece is fundamentally different from tiktoken. Instead of regex-based pre-tokenization followed by BPE, SentencePiece uses a Unigram language model and the Viterbi algorithm.

The Unigram Model

Each vocabulary piece has a probability score (stored as log probability). Tokenization finds the most likely sequence of pieces:

P("hello world") = P("▁hello") × P("▁world")
                 = exp(-5.2) × exp(-4.8)

The ▁ character (Unicode U+2581) marks word boundaries, essentially replacing spaces (assuming space-normalized input as per typical usage).

The Viterbi Algorithm

Given a string, we build a “lattice” of all possible tokenizations and find the path with the highest total score:

"hello" at each position:
  pos 0: "▁h" (score -8.1), "▁he" (score -6.2), "▁hel" (score -7.5), "▁hello" (score -5.2)
  pos 1: "e" (score -4.1), "el" (score -5.8), "ell" (score -6.2), "ello" (score -5.9)
  ...

The Viterbi algorithm finds the globally optimal path through dynamic programming:

for (int pos = 0; pos < num_chars; pos++) {
    for each piece starting at pos {
        int end = pos + piece_length;
        float new_score = best_score[pos] + piece_score;
        if (new_score > best_score[end]) {
            best_score[end] = new_score;
            best_prev[end] = pos;
        }
    }
}

The critical question: how do we efficiently find “all pieces starting at pos”?

Part 2: The Naive Implementation

My first implementation was:

Parse the protobuf model file
Store vocabulary in a hash table
At each position, try every possible substring and look it up

for (int pos = 0; pos < len; pos++) {
    for (int end = pos + 1; end <= len && end - pos <= MAX_PIECE_LEN; end++) {
        int id = lookup_piece(text + pos, end - pos);  // Hash lookup
        if (id >= 0) {
            update_viterbi(pos, end, piece_score[id]);
        }
    }
}

This is O(n × max_piece_len × hash_lookup). For a 32K vocabulary with average piece length of 4 bytes, we’re doing 128 hash lookups per character; hardly ideal.

Initial performance was 45 µs for a 44-character sentence. That’s 1 million chars/sec; very slow.

Part 3: The Trie

The first major insight was that we don’t need to try every substring. We need to find all vocab pieces that are prefixes of the remaining text.

A trie (prefix tree) does exactly this. One traversal finds all matches:

typedef struct {
    int32_t children[256];  // One child per byte value
    int32_t piece_id;       // -1 if not a complete piece
    float score;
} TrieNode;

Now the inner loop becomes:

int32_t node = 0;
for (int byte_pos = start; byte_pos < end; byte_pos++) {
    uint8_t c = text[byte_pos];
    node = trie[node].children[c];
    if (node < 0) break;  // No more prefixes
    
    if (trie[node].piece_id >= 0) {
        // Found a valid piece!
        update_viterbi(pos, byte_pos + 1, trie[node].score);
    }
}

One traversal, all matches found. Complexity drops to O(n × max_piece_len).

Result: 8.5 µs (5.3x faster)

But there’s a big problem with memory. Each node has 256 child pointers (1 KB). With 80,000+ nodes, that’s 83 MB just for the trie.

Part 4: The Double-Array Trie

In my attempt to find a solution to this, I took another look at Google’s sentencepiece implementation. It seemed that they use double-array tries for the trie.

This was promising, and way better than what I had going on. Instead of storing 256 pointers per node, we use four arrays:

typedef struct {
    int32_t *base;   // Base offset for children
    int32_t *check;  // Verifies parent relationship
    int32_t *value;  // Piece ID at this node
    float *score;    // Score at this node
} DoubleArrayTrie;

Navigation works like this:

int32_t next = base[node] + char_byte;
if (check[next] == node) {
    // Valid transition
    node = next;
} else {
    // Dead end
}

The trick is finding base values during construction such that children don’t collide. This requires very careful allocation:

static int32_t dat_find_base(DoubleArrayTrie *dat, uint8_t *labels, int count) {
    int32_t base = 1;
    while (1) {
        bool ok = true;
        for (int i = 0; i < count; i++) {
            int32_t pos = base + labels[i];
            if (dat->check[pos] >= 0) {  // Position taken
                ok = false;
                break;
            }
        }
        if (ok) return base;
        base++;
    }
}

Critical bug I encountered: when building recursively, you must claim all child positions before recursing into any of them. Otherwise, grandchildren can steal positions meant for siblings.

// WRONG: claim and recurse one at a time
for (int i = 0; i < count; i++) {
    check[base + labels[i]] = node;
    build_recursive(children[i]);  // May steal sibling positions!
}

// CORRECT: claim all, then recurse
for (int i = 0; i < count; i++) {
    check[base + labels[i]] = node;
}
for (int i = 0; i < count; i++) {
    build_recursive(children[i]);
}

Memory Reduction

Result: 2 MB memory (41x reduction), almost identical performance (from 1ns to 1.9ns).

Part 5: Eliminating Allocations

Profiling revealed another hotspot: malloc/free in the Viterbi loop. Each encode allocated three arrays:

float *best_scores = malloc(num_chars * sizeof(float));
int *best_prev = malloc(num_chars * sizeof(int));
int *best_id = malloc(num_chars * sizeof(int));
// ... encode ...
free(best_scores);
free(best_prev);
free(best_id);

Solution: stack allocation with a reasonable maximum.

#define VITERBI_MAX_CHARS 8192

float best_scores[VITERBI_MAX_CHARS];
int best_prev[VITERBI_MAX_CHARS];
int best_id[VITERBI_MAX_CHARS];

For texts longer than 8K characters (rare), we fall back to heap allocation. Why is this rare? For my specific case (and I assume for most cases in general), we tokenize messages one by one, and count them together. This process can also be batched if needed. Messages hardly exceed 8K characters in most cases, and if they do, just split them across multiple calls. Otherwise take the ~4 µs hit.

Result: 4.2 µs (2x faster from this alone)

Part 6: First-Byte Pruning

Not all bytes can start long pieces. If the longest piece starting with byte 0xE3 is 6 bytes, there’s no point checking beyond that:

uint8_t first_byte_max_len[256];  // Precomputed during model load

for (int pos = 0; pos < num_chars; pos++) {
    uint8_t first = text[byte_positions[pos]];
    int max_len = first_byte_max_len[first];  // Early termination hint
    
    // Trie traversal with early exit
    for (int len = 0; len < max_len && node >= 0; len++) {
        // ...
    }
}

This helps most for CJK text, where many bytes have no valid pieces at all.

Result: 2.1 µs (another 2x)

Part 7: SIMD Hash

The hash table is still used for piece_to_id lookups during decoding. We reused the CRC32 approach from tiktoken:

_simd_hash_bytes_arm64:
    mov     w2, #0xFFFFFFFF
.loop8:
    cmp     x1, #8
    b.lt    .loop1
    ldr     x3, [x0], #8
    crc32x  w2, w2, x3      ; 8 bytes in one cycle
    sub     x1, x1, #8
    b       .loop8

This matters more during model loading (hashing 32K pieces) than encoding, but every microsecond counts.

The Optimization Journey

Optimization Journey

Stage	Latency (44 chars)	Speedup
Naive implementation	45.0 µs	1x
Hash table + SIMD hash	8.5 µs	5x
Stack allocation	4.2 µs	11x
First-byte pruning	2.1 µs	21x
Trie-based prefix match	0.45 µs	100x
Double-array trie	0.41 µs	110x

CJK Performance

Something unexpected: Japanese and Chinese text tokenize faster than English:

Language	Throughput	Why?
English	107 MB/s	Normal vocabulary coverage
Japanese	595 MB/s	Most chars → `<unk>`, fast trie termination
Chinese	535 MB/s	Same reason
Korean	402 MB/s	Same reason

The T5 vocabulary has limited CJK coverage. When a character isn’t in the vocabulary, the trie traversal terminates immediately, and we emit <unk>. This is actually the best case for performance, because no backtracking, and no complex path finding.

Complexity Analysis

Viterbi Complexity

The naive approach has complexity O(n² × V) where V is vocabulary size; for each of n positions, we try O(n) substrings and do O(V) hash lookups.

With the trie, complexity becomes O(n × L) where L is max piece length (typically 16-32 bytes). The trie traversal finds all matches in a single pass.

For a 1000-character input:

Naive: 1,000,000 × 32,000 = 32 billion operations
Trie: 1,000 × 32 = 32,000 operations

That’s a million-fold reduction.

Memory Layout

Component	Size
Vocabulary pieces	~1.5 MB
Double-array trie	2.0 MB
Hash table	256 KB
Score array	128 KB
Total	~4 MB

Compare to the original 83 MB with the simple trie. And load time is under 4 seconds (most spent parsing the protobuf model file).

Try It Yourself

The implementation is part of sillytui:

#include "tokenizer/sentencepiece.h"

SentencePieceProcessor sp;
sp_init(&sp);
sp_load(&sp, "model.spiece");

uint32_t ids[1024];
int n = sp_encode(&sp, "Hello, world!", ids, 1024);
// n = 4, ids = [▁Hello, ,, ▁world, !]

char *text = sp_decode(&sp, ids, n);
// text = "Hello, world!"

sp_free(&sp);

SIMD is used automatically on ARM64. Other platforms get portable C fallbacks.

Benchmarks performed on Apple M4 Silicon. Performance may vary on other hardware.

Results#

Part 1: Understanding SentencePiece#

The Unigram Model#

The Viterbi Algorithm#

Part 2: The Naive Implementation#

Part 3: The Trie#

Part 4: The Double-Array Trie#

Part 5: Eliminating Allocations#

Part 6: First-Byte Pruning#

Part 7: SIMD Hash#

The Optimization Journey#

CJK Performance#

Complexity Analysis#

Memory Layout#

Try It Yourself#