Troubleshooting
Offset contract sync check fails
The check julia --project=. tools/sync_offset_contract.jl --check enforces that notes/OffsetContract.md (source) and docs/src/normalization_offsets_contract.md (generated target) match after newline normalization.
Do not hand-edit docs/src/normalization_offsets_contract.md. Edit notes/OffsetContract.md, then run:
julia --project=. tools/sync_offset_contract.jl
julia --project=. tools/sync_offset_contract.jl --checkAuto-detect picked the wrong format
Force the format explicitly:
load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
load_tokenizer("/path/to/tokenizer.model"; format=:sentencepiece_model)Use detection helpers to inspect first:
detect_tokenizer_format("/path/to/model_dir")
detect_tokenizer_files("/path/to/model_dir")Missing merges.txt
For GPT-2 style BPE you need both files:
vocab.json+merges.txt- or
encoder.json+vocab.bpe
Use explicit loaders for clearer errors:
load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")tokenizer.model is not SentencePiece
Some models (notably LLaMA3-style releases) provide tiktoken text in a file named tokenizer.model.
KeemenaSubwords sniffs .model files:
- tiktoken-like text lines =>
:tiktoken - binary / SentencePiece-like payload =>
:sentencepiece_model
If needed, override manually:
load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)Gated model key fails to load
If load_tokenizer(:llama3_8b_tokenizer) says not installed, install first:
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])You must have accepted upstream license terms and have valid access.