Troubleshooting

Offset contract sync check fails

The check julia --project=. tools/sync_offset_contract.jl --check enforces that notes/OffsetContract.md (source) and docs/src/normalization_offsets_contract.md (generated target) match after newline normalization.

Do not hand-edit docs/src/normalization_offsets_contract.md. Edit notes/OffsetContract.md, then run:

julia --project=. tools/sync_offset_contract.jl
julia --project=. tools/sync_offset_contract.jl --check

Auto-detect picked the wrong format

Force the format explicitly:

load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
load_tokenizer("/path/to/tokenizer.model"; format=:sentencepiece_model)

Use detection helpers to inspect first:

detect_tokenizer_format("/path/to/model_dir")
detect_tokenizer_files("/path/to/model_dir")

Missing merges.txt

For GPT-2 style BPE you need both files:

  • vocab.json + merges.txt
  • or encoder.json + vocab.bpe

Use explicit loaders for clearer errors:

load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")

tokenizer.model is not SentencePiece

Some models (notably LLaMA3-style releases) provide tiktoken text in a file named tokenizer.model.

KeemenaSubwords sniffs .model files:

  • tiktoken-like text lines => :tiktoken
  • binary / SentencePiece-like payload => :sentencepiece_model

If needed, override manually:

load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)

Gated model key fails to load

If load_tokenizer(:llama3_8b_tokenizer) says not installed, install first:

install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])

You must have accepted upstream license terms and have valid access.