Built-In Models

using KeemenaSubwords

available_models()
available_models(format=:tiktoken)
available_models(format=:bpe_gpt2)
available_models(format=:hf_tokenizer_json)
available_models(family=:qwen)
available_models(family=:mistral)
available_models(distribution=:artifact_public)
available_models(distribution=:installable_gated)
available_models(shipped=true)

describe_model(:core_bpe_en)
describe_model(:core_wordpiece_en)
describe_model(:core_sentencepiece_unigram_en)
describe_model(:tiktoken_o200k_base)
describe_model(:openai_gpt2_bpe)
describe_model(:bert_base_uncased_wordpiece)
describe_model(:bert_base_multilingual_cased_wordpiece)
describe_model(:t5_small_sentencepiece_unigram)
describe_model(:mistral_v1_sentencepiece)
describe_model(:mistral_v3_sentencepiece)
describe_model(:phi2_bpe)
describe_model(:qwen2_5_bpe)
describe_model(:roberta_base_bpe)
describe_model(:xlm_roberta_base_sentencepiece_bpe)
describe_model(:llama3_8b_tokenizer)
recommended_defaults_for_llms()

model_path(:core_bpe_en)

The inventory below is generated from the same registry used by available_models() and describe_model(...).

Generated from the registry by `tools/syncreadmemodels.jl(excluding:userlocal` entries)._

`bpe` / `core`

:core_bpe_en
- Distribution: shipped
- License: MIT
- Upstream: in-repo/core @ in-repo:core
- Expected files: vocab.txt, merges.txt
- Description: Tiny built-in English classic BPE model (vocab.txt + merges.txt).

`bpe_gpt2` / `openai`

:openai_gpt2_bpe
- Distribution: artifact_public
- License: MIT
- Upstream: openaipublic/gpt-2 @ openaipublic:gpt-2/encodings/main
- Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: OpenAI GPT-2 byte-level BPE assets (encoder.json + vocab.bpe).

`bpe_gpt2` / `phi`

:phi2_bpe
- Distribution: artifact_public
- License: MIT
- Upstream: microsoft/phi-2 @ huggingface:microsoft/phi-2@810d367871c1d460086d9f82db8696f2e0a0fcd0
- Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: Microsoft Phi-2 GPT2-style tokenizer files (vocab.json + merges.txt).

`bpe_gpt2` / `roberta`

:roberta_base_bpe
- Distribution: artifact_public
- License: MIT
- Upstream: FacebookAI/roberta-base @ huggingface:FacebookAI/roberta-base@e2da8e2f811d1448a5b465c236feacd80ffbac7b
- Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: RoBERTa-base byte-level BPE tokenizer files (vocab.json + merges.txt).

`hf_tokenizer_json` / `llama`

:llama3_8b_tokenizer
- Distribution: installable_gated
- License: Llama-3.1-Community-License
- Upstream: meta-llama/Meta-Llama-3-8B-Instruct @ huggingface:meta-llama/Meta-Llama-3-8B-Instruct@main
- Expected files: tokenizer.json (preferred), vocab.json + merges.txt (fallback)
- Description: Meta Llama 3 8B tokenizer (gated; install with install_model!).

`hf_tokenizer_json` / `qwen`

:qwen2_5_bpe
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: Qwen/Qwen2.5-7B @ huggingface:Qwen/Qwen2.5-7B@d149729398750b98c0af14eb82c78cfe92750796
- Expected files: tokenizer.json (preferred), vocab.json + merges.txt (fallback)
- Description: Qwen2.5 BPE tokenizer assets (tokenizer.json with vocab/merges fallback).

`sentencepiece_model` / `core`

:core_sentencepiece_unigram_en
- Distribution: shipped
- License: MIT
- Upstream: in-repo/core @ in-repo:core
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Tiny built-in SentencePiece Unigram model (.model).

`sentencepiece_model` / `llama`

:llama2_tokenizer
- Distribution: installable_gated
- License: Llama-2-Community-License
- Upstream: meta-llama/Llama-2-7b-hf @ huggingface:meta-llama/Llama-2-7b-hf@main
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Meta Llama 2 tokenizer (gated; install with install_model!).

`sentencepiece_model` / `mistral`

:mistral_v1_sentencepiece
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: mistralai/Mixtral-8x7B-Instruct-v0.1 @ huggingface:mistralai/Mixtral-8x7B-Instruct-v0.1@eba92302a2861cdc0098cc54bc9f17cb2c47eb61
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Mistral/Mixtral tokenizer.model SentencePiece model.
:mistral_v3_sentencepiece
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: mistralai/Mistral-7B-Instruct-v0.3 @ huggingface:mistralai/Mistral-7B-Instruct-v0.3@c170c708c41dac9275d15a8fff4eca08d52bab71
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Mistral-7B-Instruct-v0.3 tokenizer.model.v3 SentencePiece model.

`sentencepiece_model` / `t5`

:t5_small_sentencepiece_unigram
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: google-t5/t5-small @ huggingface:google-t5/t5-small@df1b051c49625cf57a3d0d8d3863ed4d13564fe4
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Hugging Face google-t5/t5-small SentencePiece model (Unigram).

`sentencepiece_model` / `xlm_roberta`

:xlm_roberta_base_sentencepiece_bpe
- Distribution: artifact_public
- License: MIT
- Upstream: FacebookAI/xlm-roberta-base @ huggingface:FacebookAI/xlm-roberta-base@e73636d4f797dec63c3081bb6ed5c7b0bb3f2089
- Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: XLM-RoBERTa-base sentencepiece.bpe.model file.

`tiktoken` / `openai`

:tiktoken_cl100k_base
- Distribution: artifact_public
- License: MIT
- Upstream: openaipublic/encodings @ openaipublic:encodings/cl100k_base.tiktoken
- Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken cl100k_base encoding.
:tiktoken_o200k_base
- Distribution: artifact_public
- License: MIT
- Upstream: openaipublic/encodings @ openaipublic:encodings/o200k_base.tiktoken
- Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken o200k_base encoding.
:tiktoken_p50k_base
- Distribution: artifact_public
- License: MIT
- Upstream: openaipublic/encodings @ openaipublic:encodings/p50k_base.tiktoken
- Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken p50k_base encoding.
:tiktoken_r50k_base
- Distribution: artifact_public
- License: MIT
- Upstream: openaipublic/encodings @ openaipublic:encodings/r50k_base.tiktoken
- Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken r50k_base encoding.

`wordpiece_vocab` / `bert`

:bert_base_multilingual_cased_wordpiece
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: google-bert/bert-base-multilingual-cased @ huggingface:google-bert/bert-base-multilingual-cased@3f076fdb1ab68d5b2880cb87a0886f315b8146f8
- Expected files: vocab.txt
- Description: Hugging Face bert-base-multilingual-cased WordPiece vocabulary.
:bert_base_uncased_wordpiece
- Distribution: artifact_public
- License: Apache-2.0
- Upstream: bert-base-uncased @ huggingface:bert-base-uncased@86b5e0934494bd15c9632b12f734a8a67f723594
- Expected files: vocab.txt
- Description: Hugging Face bert-base-uncased WordPiece vocabulary.

`wordpiece_vocab` / `core`

:core_wordpiece_en
- Distribution: shipped
- License: MIT
- Upstream: in-repo/core @ in-repo:core
- Expected files: vocab.txt
- Description: Tiny built-in English WordPiece model.

describe_model(key) includes provenance metadata such as license, family, distribution, upstream_repo, upstream_ref, and upstream_files.

Built-ins resolve from artifact paths when present, with in-repo fallback model files only for tiny :core_* assets.

prefetch_models(recommended_defaults_for_llms())