Built-In Models
using KeemenaSubwords
available_models()
available_models(format=:tiktoken)
available_models(format=:bpe_gpt2)
available_models(format=:hf_tokenizer_json)
available_models(family=:qwen)
available_models(family=:mistral)
available_models(distribution=:artifact_public)
available_models(distribution=:installable_gated)
available_models(shipped=true)
describe_model(:core_bpe_en)
describe_model(:core_wordpiece_en)
describe_model(:core_sentencepiece_unigram_en)
describe_model(:tiktoken_o200k_base)
describe_model(:openai_gpt2_bpe)
describe_model(:bert_base_uncased_wordpiece)
describe_model(:bert_base_multilingual_cased_wordpiece)
describe_model(:t5_small_sentencepiece_unigram)
describe_model(:mistral_v1_sentencepiece)
describe_model(:mistral_v3_sentencepiece)
describe_model(:phi2_bpe)
describe_model(:qwen2_5_bpe)
describe_model(:roberta_base_bpe)
describe_model(:xlm_roberta_base_sentencepiece_bpe)
describe_model(:llama3_8b_tokenizer)
recommended_defaults_for_llms()
model_path(:core_bpe_en)The inventory below is generated from the same registry used by available_models() and describe_model(...).
Generated from the registry by `tools/syncreadmemodels.jl(excluding:userlocal` entries)._
bpe / core
:core_bpe_en- Distribution:
shipped - License:
MIT - Upstream:
in-repo/core@in-repo:core - Expected files: vocab.txt, merges.txt
- Description: Tiny built-in English classic BPE model (vocab.txt + merges.txt).
- Distribution:
bpe_gpt2 / openai
:openai_gpt2_bpe- Distribution:
artifact_public - License:
MIT - Upstream:
openaipublic/gpt-2@openaipublic:gpt-2/encodings/main - Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: OpenAI GPT-2 byte-level BPE assets (encoder.json + vocab.bpe).
- Distribution:
bpe_gpt2 / phi
:phi2_bpe- Distribution:
artifact_public - License:
MIT - Upstream:
microsoft/phi-2@huggingface:microsoft/phi-2@810d367871c1d460086d9f82db8696f2e0a0fcd0 - Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: Microsoft Phi-2 GPT2-style tokenizer files (vocab.json + merges.txt).
- Distribution:
bpe_gpt2 / roberta
:roberta_base_bpe- Distribution:
artifact_public - License:
MIT - Upstream:
FacebookAI/roberta-base@huggingface:FacebookAI/roberta-base@e2da8e2f811d1448a5b465c236feacd80ffbac7b - Expected files: vocab.json + merges.txt, encoder.json + vocab.bpe
- Description: RoBERTa-base byte-level BPE tokenizer files (vocab.json + merges.txt).
- Distribution:
hf_tokenizer_json / llama
:llama3_8b_tokenizer- Distribution:
installable_gated - License:
Llama-3.1-Community-License - Upstream:
meta-llama/Meta-Llama-3-8B-Instruct@huggingface:meta-llama/Meta-Llama-3-8B-Instruct@main - Expected files: tokenizer.json (preferred), vocab.json + merges.txt (fallback)
- Description: Meta Llama 3 8B tokenizer (gated; install with install_model!).
- Distribution:
hf_tokenizer_json / qwen
:qwen2_5_bpe- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
Qwen/Qwen2.5-7B@huggingface:Qwen/Qwen2.5-7B@d149729398750b98c0af14eb82c78cfe92750796 - Expected files: tokenizer.json (preferred), vocab.json + merges.txt (fallback)
- Description: Qwen2.5 BPE tokenizer assets (tokenizer.json with vocab/merges fallback).
- Distribution:
sentencepiece_model / core
:core_sentencepiece_unigram_en- Distribution:
shipped - License:
MIT - Upstream:
in-repo/core@in-repo:core - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Tiny built-in SentencePiece Unigram model (.model).
- Distribution:
sentencepiece_model / llama
:llama2_tokenizer- Distribution:
installable_gated - License:
Llama-2-Community-License - Upstream:
meta-llama/Llama-2-7b-hf@huggingface:meta-llama/Llama-2-7b-hf@main - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Meta Llama 2 tokenizer (gated; install with install_model!).
- Distribution:
sentencepiece_model / mistral
:mistral_v1_sentencepiece- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
mistralai/Mixtral-8x7B-Instruct-v0.1@huggingface:mistralai/Mixtral-8x7B-Instruct-v0.1@eba92302a2861cdc0098cc54bc9f17cb2c47eb61 - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Mistral/Mixtral tokenizer.model SentencePiece model.
- Distribution:
:mistral_v3_sentencepiece- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
mistralai/Mistral-7B-Instruct-v0.3@huggingface:mistralai/Mistral-7B-Instruct-v0.3@c170c708c41dac9275d15a8fff4eca08d52bab71 - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Mistral-7B-Instruct-v0.3 tokenizer.model.v3 SentencePiece model.
- Distribution:
sentencepiece_model / t5
:t5_small_sentencepiece_unigram- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
google-t5/t5-small@huggingface:google-t5/t5-small@df1b051c49625cf57a3d0d8d3863ed4d13564fe4 - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: Hugging Face google-t5/t5-small SentencePiece model (Unigram).
- Distribution:
sentencepiece_model / xlm_roberta
:xlm_roberta_base_sentencepiece_bpe- Distribution:
artifact_public - License:
MIT - Upstream:
FacebookAI/xlm-roberta-base@huggingface:FacebookAI/xlm-roberta-base@e73636d4f797dec63c3081bb6ed5c7b0bb3f2089 - Expected files: spm.model / tokenizer.model / tokenizer.model.v3 / sentencepiece.bpe.model
- Description: XLM-RoBERTa-base sentencepiece.bpe.model file.
- Distribution:
tiktoken / openai
:tiktoken_cl100k_base- Distribution:
artifact_public - License:
MIT - Upstream:
openaipublic/encodings@openaipublic:encodings/cl100k_base.tiktoken - Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken cl100k_base encoding.
- Distribution:
:tiktoken_o200k_base- Distribution:
artifact_public - License:
MIT - Upstream:
openaipublic/encodings@openaipublic:encodings/o200k_base.tiktoken - Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken o200k_base encoding.
- Distribution:
:tiktoken_p50k_base- Distribution:
artifact_public - License:
MIT - Upstream:
openaipublic/encodings@openaipublic:encodings/p50k_base.tiktoken - Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken p50k_base encoding.
- Distribution:
:tiktoken_r50k_base- Distribution:
artifact_public - License:
MIT - Upstream:
openaipublic/encodings@openaipublic:encodings/r50k_base.tiktoken - Expected files: *.tiktoken or tokenizer.model (tiktoken text)
- Description: OpenAI tiktoken r50k_base encoding.
- Distribution:
wordpiece_vocab / bert
:bert_base_multilingual_cased_wordpiece- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
google-bert/bert-base-multilingual-cased@huggingface:google-bert/bert-base-multilingual-cased@3f076fdb1ab68d5b2880cb87a0886f315b8146f8 - Expected files: vocab.txt
- Description: Hugging Face bert-base-multilingual-cased WordPiece vocabulary.
- Distribution:
:bert_base_uncased_wordpiece- Distribution:
artifact_public - License:
Apache-2.0 - Upstream:
bert-base-uncased@huggingface:bert-base-uncased@86b5e0934494bd15c9632b12f734a8a67f723594 - Expected files: vocab.txt
- Description: Hugging Face bert-base-uncased WordPiece vocabulary.
- Distribution:
wordpiece_vocab / core
:core_wordpiece_en- Distribution:
shipped - License:
MIT - Upstream:
in-repo/core@in-repo:core - Expected files: vocab.txt
- Description: Tiny built-in English WordPiece model.
- Distribution:
describe_model(key) includes provenance metadata such as license, family, distribution, upstream_repo, upstream_ref, and upstream_files.
Built-ins resolve from artifact paths when present, with in-repo fallback model files only for tiny :core_* assets.
prefetch_models(recommended_defaults_for_llms())