Concepts

This page is a first-hour guide to the concepts you need for reliable tokenization and alignment workflows in KeemenaSubwords.

Where This Fits

Typical Julia LLM preprocessing split:

KeemenaPreprocessing: produces normalized clean_text.
KeemenaSubwords: turns text into token pieces and 1-based token ids.

Recommended integration flow:

clean_text = ... from your preprocessing pipeline.
tokenization_text = tokenization_view(tokenizer, clean_text).
encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true, ...).

Token Pieces Vs Token Ids

tokenize(tok, text) returns token pieces (Vector{String}).
encode(tok, text; ...) returns token ids (Vector{Int}).
decode(tok, ids) maps ids back to text.

using KeemenaSubwords

tok = load_tokenizer(:core_bpe_en)
text = "Hello world"

pieces = tokenize(tok, text)
ids = encode(tok, text; add_special_tokens=true)
decoded = decode(tok, ids)

(; pieces, ids, decoded)

(pieces = ["<unk>", "world</w>"], ids = [3, 1, 29, 4], decoded = "<unk>world")

KeemenaSubwords uses 1-based token ids.

Convert to 0-based ids only when you need parity with external tooling:

ids_zero_based = ids .- 1
ids_julia = ids_zero_based .+ 1

Tokenizer Families Supported

BPE (classic)
- Typical format symbols: :bpe
- Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
ByteBPE
- Typical format symbols: :bytebpe
- Byte-level: yes
- Offset implication: offsets are valid codeunit spans, but may not always be safe Julia string slice boundaries on multibyte text.
WordPiece
- Typical format symbols: :wordpiece, :wordpiece_vocab
- Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
Unigram TSV
- Typical format symbols: :unigram, :unigram_tsv
- Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
SentencePiece model
- Typical format symbols: :sentencepiece_model
- Byte-level: usually no
- Offset implication: spanful offsets are expected to be string-safe for standard SentencePiece pipelines.
tiktoken
- Typical format symbols: :tiktoken
- Byte-level: yes
- Offset implication: use the same byte-level caveat as ByteBPE.
HF tokenizer.json
- Typical format symbols: :hf_tokenizer_json
- Byte-level: depends on configured pipeline components
- Offset implication: when ByteLevel components are present, use byte-level offset caveats.

Special Tokens And `add_special_tokens`

Inspect special token mappings and common ids:

using KeemenaSubwords

tok = load_tokenizer(:core_wordpiece_en)

specials = special_tokens(tok)
ids = (
    bos = try bos_id(tok) catch; nothing end,
    eos = try eos_id(tok) catch; nothing end,
    pad = try pad_id(tok) catch; nothing end,
    unk = try unk_id(tok) catch; nothing end,
)

(; specials, ids)

(specials = Dict(:pad => 1, :sep => 4, :unk => 2, :mask => 5, :cls => 3), ids = (bos = nothing, eos = nothing, pad = 1, unk = 2))

add_special_tokens=true asks the tokenizer/post-processor to insert framework specials (for example BOS/EOS or CLS/SEP).

Offset behavior:

Inserted specials: special_tokens_mask[i] == 1 and offsets[i] == (0, 0).
Specials that appear in the input text as matched added tokens: special_tokens_mask[i] == 1, but offsets can still be real spans into the input text.

Structured Encoding And Offsets

Use encode_result when you need ids plus offsets and masks in one object (TokenizationResult).

using KeemenaSubwords

tok = load_tokenizer(:core_sentencepiece_unigram_en)
clean_text = "Hello world"
tokenization_text = tokenization_view(tok, clean_text)

result = encode_result(
    tok,
    tokenization_text;
    assume_normalized=true,
    add_special_tokens=true,
    return_offsets=true,
    return_masks=true,
)

(
    ids = result.ids,
    tokens = result.tokens,
    offsets = result.offsets,
    attention_mask = result.attention_mask,
    special_tokens_mask = result.special_tokens_mask,
    metadata = result.metadata,
)

(ids = [2, 1, 5, 3], tokens = ["<s>", "<unk>", "▁world", "</s>"], offsets = [(0, 0), (1, 6), (7, 12), (0, 0)], attention_mask = [1, 1, 1, 1], special_tokens_mask = [1, 1, 0, 1], metadata = (format = :sentencepiece, model_name = "core_sentencepiece_unigram_en", add_special_tokens = true, assume_normalized = true, offsets_coordinates = :utf8_codeunits, offsets_reference = :input_text))

High-level offset contract:

Coordinate system: UTF-8 codeunits.
Index base: 1.
Span style: half-open [start, stop).
Sentinel for no-span tokens: (0, 0).

For the full contract and helper APIs, see Normalization and Offsets Contract. For worked alignment walkthroughs, see Offsets Alignment Examples. For batching and padding recipes, see Structured Outputs and Batching.

Recommended KeemenaPreprocessing alignment pattern:

tokenization_text = tokenization_view(tok, clean_text)
result = encode_result(
    tok,
    tokenization_text;
    assume_normalized=true,
    return_offsets=true,
    return_masks=true,
    add_special_tokens=true,
)

Model Registry And Caching

Use the registry APIs to discover models and the cache APIs to avoid reloading tokenizers repeatedly:

available_models(shipped=true)
describe_model(:core_bpe_en)
prefetch_models([:core_bpe_en, :core_wordpiece_en, :core_sentencepiece_unigram_en])

tok = get_tokenizer_cached(:core_bpe_en)

# Clear long-lived cached tokenizer instances when needed
# (for example to release memory or force a fresh reload).
clear_tokenizer_cache!()

Loading And Exporting

Pointers:

Export APIs:

export_tokenizer(tokenizer, out_dir; format=...)
save_tokenizer(tokenizer, out_dir; format=...)

If you export with format=:hf_tokenizer_json, KeemenaSubwords writes tokenizer.json for HF-compatible fast tokenizer loading. Current scope details (for example companion config files) are documented in Tokenizer Formats and Required Files.

Placeholder examples below require local paths or gated access and are intentionally non-executable in docs:

# local path placeholder (non-executable)
tok = load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)

# gated install placeholder (non-executable)
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])