Tokenizer Formats and Required Files
detect_tokenizer_format(path) and load_tokenizer(path; format=:auto) use the same detection rules. You can always override detection with format=....
:hf_tokenizer_json
- Format symbol:
:hf_tokenizer_json - Accepted input: file or directory
- Required files:
tokenizer.json - Canonical named spec key:
path(alias:tokenizer_json) - Recommended loader:
load_hf_tokenizer_json("/path/to/tokenizer.json") - Offset behavior: depends on pipeline components. If ByteLevel is configured, apply byte-level offset caveats.
:bpe_gpt2
- Format symbol:
:bpe_gpt2 - Accepted input: file pair or directory
- Required files:
vocab.json+merges.txt - Canonical named spec keys:
vocab_json,merges_txt - Recommended loader:
load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt") - Offset behavior: byte-level family. Offsets are codeunit spans and may not always be safe Julia string slice boundaries on multibyte text.
:bpe_encoder
- Format symbol:
:bpe_encoder - Accepted input: file pair or directory
- Required files:
encoder.json+vocab.bpe - Canonical named spec keys:
encoder_json,vocab_bpe - Recommended loader:
load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe") - Offset behavior: byte-level family with the same slicing caveat as
:bpe_gpt2.
:bpe
- Format symbol:
:bpe - Accepted input: directory or
vocab.txtfile - Required files:
vocab.txt+merges.txt - Canonical named spec keys:
vocab,merges(tuple/spec) - Recommended loader:
load_bpe("/path/to/model_dir") - Offset behavior: non-byte-level in standard usage; spanful offsets are expected to be string-safe.
:bytebpe
- Format symbol:
:bytebpe - Accepted input: directory or file pair
- Required files:
vocab.txt+merges.txt - Canonical named spec keys:
vocab,merges(tuple/spec) - Recommended loader:
load_bytebpe("/path/to/model_dir") - Offset behavior: byte-level family. Use
try_span_substringwithspan_codeunitsfallback for non-boundary spans.
:wordpiece / :wordpiece_vocab
- Format symbols:
:wordpiece,:wordpiece_vocab - Accepted input: file or directory
- Required files:
vocab.txt - Canonical named spec key:
path(alias:vocab_txt) - Recommended loader:
load_wordpiece("/path/to/vocab.txt") - Offset behavior: non-byte-level; spanful offsets are expected to be string-safe in normal usage.
:sentencepiece_model
- Format symbol:
:sentencepiece_model - Accepted input: file or directory
- Required files: standard SentencePiece binary
.model/.model.v3, or Keemena text-exported model files (spm.model,spiece.model,tokenizer.model,tokenizer.model.v3,sentencepiece.bpe.model) - Canonical named spec key:
path(alias:model_file) - Recommended loader:
load_sentencepiece("/path/to/tokenizer.model"; kind=:auto) - Offset behavior: usually non-byte-level. Spanful offsets are expected to be string-safe for standard SentencePiece pipelines.
:tiktoken
- Format symbol:
:tiktoken - Accepted input: file or directory
- Required files:
*.tiktokenor tiktoken-texttokenizer.model - Canonical named spec key:
path(alias:encoding_file) - Recommended loader:
load_tiktoken("/path/to/o200k_base.tiktoken") - Offset behavior: byte-level family with non-boundary span cases on multibyte text.
:unigram / :unigram_tsv
- Format symbols:
:unigram,:unigram_tsv - Accepted input: file or directory
- Required files:
unigram.tsv - Canonical named spec key:
path(alias:unigram_tsv) - Recommended loader:
load_unigram("/path/to/unigram.tsv") - Offset behavior: non-byte-level; spanful offsets are expected to be string-safe in normal usage.
Exporting Hugging Face tokenizer.json
Use the HF export target to write tokenizer.json from any supported KeemenaSubwords tokenizer:
export_tokenizer(tokenizer, "out_dir"; format=:hf_tokenizer_json)
# equivalent via save_tokenizer
save_tokenizer(tokenizer, "out_dir"; format=:hf_tokenizer_json)Reload in Julia:
reloaded = load_tokenizer("out_dir"; format=:hf_tokenizer_json)Load the same file in Python Fast tokenizers:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="out_dir/tokenizer.json")Current scope note:
tokenizer_config.jsonandspecial_tokens_map.jsonare not emitted yet.TemplateProcessingis emitted in canonical HF JSON shape (single/pairobject items and object-mapspecial_tokens) for better external HF compatibility.- Byte-level export writes explicit
ByteLeveloptions (add_prefix_space=false,trim_offsets=false,use_regex=false) for Keemena ByteBPE interoperability, so Python/Rust HF loaders do not silently fall back to different defaults.
BERT Components in tokenizer.json
KeemenaSubwords now supports the common Hugging Face BERT pipeline components:
normalizer.type = "BertNormalizer"pre_tokenizer.type = "BertPreTokenizer"
Offsets for these pipelines follow the same package contract:
- offsets are computed against
tokenization_view(tokenizer, text)(the tokenizer-normalized text), - inserted post-processor specials keep sentinel
(0,0), - spanful offsets remain 1-based UTF-8 codeunit half-open spans.
KeemenaPreprocessing integration remains: tokenization_text = tokenization_view(tokenizer, clean_text) then encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true).
Detection Notes
- Directory preference order:
tokenizer.jsonvocab.json + merges.txtencoder.json + vocab.bpe- SentencePiece model filenames
*.tiktoken
.modelfiles are sniffed:- tiktoken-like text (
<base64> <int>) =>:tiktoken - binary SentencePiece protobuf payload or Keemena text SentencePiece payload =>
:sentencepiece_model
- tiktoken-like text (
- LLaMA3-style files often use
tokenizer.modelcontaining tiktoken text. Use explicit override when needed.
load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
load_tokenizer("/path/to/tokenizer.model"; format=:sentencepiece_model)Byte-level behavior
- Byte-level tokenizers:
:bytebpe:bpe_gpt2/:bpe_encoder:tiktoken- some
:hf_tokenizer_jsonpipelines whenByteLevelis configured.
- Non-byte-level tokenizers:
:wordpiece:sentencepiece_model:unigram.
Round-trip expectations:
- Byte-level families are generally robust for arbitrary UTF-8 input and usually satisfy stable
decode(encode(text)). - WordPiece/SentencePiece/Unigram operate on learned subword vocabularies; they are deterministic, but unknown-token fallback can reduce exact round-trip fidelity depending on vocab coverage.