Concepts
This page is a first-hour guide to the concepts you need for reliable tokenization and alignment workflows in KeemenaSubwords.
Where This Fits
Typical Julia LLM preprocessing split:
KeemenaPreprocessing: produces normalizedclean_text.KeemenaSubwords: turns text into token pieces and 1-based token ids.
Recommended integration flow:
clean_text = ...from your preprocessing pipeline.tokenization_text = tokenization_view(tokenizer, clean_text).encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true, ...).
Token Pieces Vs Token Ids
tokenize(tok, text)returns token pieces (Vector{String}).encode(tok, text; ...)returns token ids (Vector{Int}).decode(tok, ids)maps ids back to text.
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
text = "Hello world"
pieces = tokenize(tok, text)
ids = encode(tok, text; add_special_tokens=true)
decoded = decode(tok, ids)
(; pieces, ids, decoded)(pieces = ["<unk>", "world</w>"], ids = [3, 1, 29, 4], decoded = "<unk>world")KeemenaSubwords uses 1-based token ids.
Convert to 0-based ids only when you need parity with external tooling:
ids_zero_based = ids .- 1
ids_julia = ids_zero_based .+ 1Tokenizer Families Supported
- BPE (classic)
- Typical format symbols:
:bpe - Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
- Typical format symbols:
- ByteBPE
- Typical format symbols:
:bytebpe - Byte-level: yes
- Offset implication: offsets are valid codeunit spans, but may not always be safe Julia string slice boundaries on multibyte text.
- Typical format symbols:
- WordPiece
- Typical format symbols:
:wordpiece,:wordpiece_vocab - Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
- Typical format symbols:
- Unigram TSV
- Typical format symbols:
:unigram,:unigram_tsv - Byte-level: no
- Offset implication: spanful offsets are expected to be string-safe in normal usage.
- Typical format symbols:
- SentencePiece model
- Typical format symbols:
:sentencepiece_model - Byte-level: usually no
- Offset implication: spanful offsets are expected to be string-safe for standard SentencePiece pipelines.
- Typical format symbols:
- tiktoken
- Typical format symbols:
:tiktoken - Byte-level: yes
- Offset implication: use the same byte-level caveat as ByteBPE.
- Typical format symbols:
- HF tokenizer.json
- Typical format symbols:
:hf_tokenizer_json - Byte-level: depends on configured pipeline components
- Offset implication: when ByteLevel components are present, use byte-level offset caveats.
- Typical format symbols:
Special Tokens And add_special_tokens
Inspect special token mappings and common ids:
using KeemenaSubwords
tok = load_tokenizer(:core_wordpiece_en)
specials = special_tokens(tok)
ids = (
bos = try bos_id(tok) catch; nothing end,
eos = try eos_id(tok) catch; nothing end,
pad = try pad_id(tok) catch; nothing end,
unk = try unk_id(tok) catch; nothing end,
)
(; specials, ids)(specials = Dict(:pad => 1, :sep => 4, :unk => 2, :mask => 5, :cls => 3), ids = (bos = nothing, eos = nothing, pad = 1, unk = 2))add_special_tokens=true asks the tokenizer/post-processor to insert framework specials (for example BOS/EOS or CLS/SEP).
Offset behavior:
- Inserted specials:
special_tokens_mask[i] == 1andoffsets[i] == (0, 0). - Specials that appear in the input text as matched added tokens:
special_tokens_mask[i] == 1, but offsets can still be real spans into the input text.
Structured Encoding And Offsets
Use encode_result when you need ids plus offsets and masks in one object (TokenizationResult).
using KeemenaSubwords
tok = load_tokenizer(:core_sentencepiece_unigram_en)
clean_text = "Hello world"
tokenization_text = tokenization_view(tok, clean_text)
result = encode_result(
tok,
tokenization_text;
assume_normalized=true,
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
(
ids = result.ids,
tokens = result.tokens,
offsets = result.offsets,
attention_mask = result.attention_mask,
special_tokens_mask = result.special_tokens_mask,
metadata = result.metadata,
)(ids = [2, 1, 5, 3], tokens = ["<s>", "<unk>", "▁world", "</s>"], offsets = [(0, 0), (1, 6), (7, 12), (0, 0)], attention_mask = [1, 1, 1, 1], special_tokens_mask = [1, 1, 0, 1], metadata = (format = :sentencepiece, model_name = "core_sentencepiece_unigram_en", add_special_tokens = true, assume_normalized = true, offsets_coordinates = :utf8_codeunits, offsets_reference = :input_text))High-level offset contract:
- Coordinate system: UTF-8 codeunits.
- Index base: 1.
- Span style: half-open
[start, stop). - Sentinel for no-span tokens:
(0, 0).
For the full contract and helper APIs, see Normalization and Offsets Contract. For worked alignment walkthroughs, see Offsets Alignment Examples. For batching and padding recipes, see Structured Outputs and Batching.
Recommended KeemenaPreprocessing alignment pattern:
tokenization_text = tokenization_view(tok, clean_text)
result = encode_result(
tok,
tokenization_text;
assume_normalized=true,
return_offsets=true,
return_masks=true,
add_special_tokens=true,
)Model Registry And Caching
Use the registry APIs to discover models and the cache APIs to avoid reloading tokenizers repeatedly:
available_models(shipped=true)
describe_model(:core_bpe_en)
prefetch_models([:core_bpe_en, :core_wordpiece_en, :core_sentencepiece_unigram_en])
tok = get_tokenizer_cached(:core_bpe_en)
# Clear long-lived cached tokenizer instances when needed
# (for example to release memory or force a fresh reload).
clear_tokenizer_cache!()Loading And Exporting
Pointers:
Export APIs:
export_tokenizer(tokenizer, out_dir; format=...)save_tokenizer(tokenizer, out_dir; format=...)
If you export with format=:hf_tokenizer_json, KeemenaSubwords writes tokenizer.json for HF-compatible fast tokenizer loading. Current scope details (for example companion config files) are documented in Tokenizer Formats and Required Files.
Placeholder examples below require local paths or gated access and are intentionally non-executable in docs:
# local path placeholder (non-executable)
tok = load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
# gated install placeholder (non-executable)
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])