Loading Tokenizers From Local Paths

Use explicit loader functions when you know the file contract. Use load_tokenizer(path; format=:auto) only when auto-detection is preferred.

Named-spec convention:

  • use path as the canonical key for single-file formats,
  • keep format-specific pair keys for multi-file formats (vocab_json + merges_txt, encoder_json + vocab_bpe).
  • backward-compatible aliases (vocab_txt, model_file, encoding_file, tokenizer_json) are still accepted.

1) GPT-2 / RoBERTa style BPE (vocab.json + merges.txt)

tok = load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")

# equivalent named spec
load_tokenizer((format=:bpe_gpt2, vocab_json="/path/to/vocab.json", merges_txt="/path/to/merges.txt"))

2) OpenAI encoder variant (encoder.json + vocab.bpe)

tok = load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")

# equivalent named spec
load_tokenizer((format=:bpe_encoder, encoder_json="/path/to/encoder.json", vocab_bpe="/path/to/vocab.bpe"))

3) Classic BPE / Byte-level BPE (vocab.txt + merges.txt)

classic = load_bpe("/path/to/model_dir")
byte_level = load_bytebpe("/path/to/model_dir")

4) WordPiece (vocab.txt)

wp = load_wordpiece("/path/to/vocab.txt"; continuation_prefix="##")

# register via canonical key
register_local_model!(
    :my_wordpiece,
    (format=:wordpiece_vocab, path="/path/to/vocab.txt");
    description="local WordPiece",
)

5) SentencePiece (.model, .model.v3, sentencepiece.bpe.model)

load_sentencepiece accepts either:

  • standard SentencePiece binary model files,
  • or Keemena text-exported SentencePiece files (same filename patterns).
sp_auto = load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)
sp_uni = load_sentencepiece("/path/to/spm.model"; kind=:unigram)
sp_bpe = load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)

register_local_model!(:my_sp, (format=:sentencepiece_model, path="/path/to/tokenizer.model"))

6) tiktoken (*.tiktoken or text tokenizer.model)

tt = load_tiktoken("/path/to/o200k_base.tiktoken")
llama3_style = load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)

register_local_model!(:my_tiktoken, (format=:tiktoken, path="/path/to/tokenizer.model"))

7) Hugging Face tokenizer.json

hf = load_hf_tokenizer_json("/path/to/tokenizer.json")

register_local_model!(:my_hf, (format=:hf_tokenizer_json, path="/path/to/tokenizer.json"))

8) Generic auto-detect + override

auto_tok = load_tokenizer("/path/to/model_dir")
forced = load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)

9) Explicit FilesSpec objects

spec = FilesSpec(
    format=:bpe_gpt2,
    vocab_json="/path/to/vocab.json",
    merges_txt="/path/to/merges.txt",
)
tok = load_tokenizer(spec)
register_local_model!(:my_bpe, spec; description="explicit file spec")