KeemenaSubwords.jl
Downstream of KeemenaPreprocessing.jl.
KeemenaSubwords provides Julia-native loaders and tokenization primitives for:
- classic BPE,
- byte-level BPE,
- WordPiece,
- SentencePiece,
- tiktoken,
- Hugging Face
tokenizer.json.
Start Here
If you are new to the package, start with Concepts for the core contracts and first-hour workflows.
- Token ids are 1-based in KeemenaSubwords.
- Offsets are UTF-8 codeunit half-open spans:
[start, stop). - Byte-level tokenizers can emit offsets that are valid codeunit spans but not always safe Julia string slice boundaries on multibyte text.
Quick Start
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
pieces = tokenize(tok, "hello world")
ids = encode(tok, "hello world"; add_special_tokens=true)
text = decode(tok, ids)Model Discovery
available_models()
available_models(distribution=:artifact_public)
available_models(distribution=:installable_gated)
describe_model(:qwen2_5_bpe)
recommended_defaults_for_llms()Key Workflows
# local path auto-detection
load_tokenizer("/path/to/model_dir")
# explicit loaders
load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
load_sentencepiece("/path/to/tokenizer.model")
load_tiktoken("/path/to/tokenizer.model")
# gated install flow
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])Quick Guide Recipes
- Quick Guide Recipes: choose your path
- Quick handlers (1-2 line workflows)
- Pretrained tokenizer recipes
- Training recipes (experimental)
- Structured outputs and batching (training-ready tensors)
- Offsets alignment worked examples
- Tokenizer formats and required files
- Installable gated models
- LLM cookbook
Documentation Map
- Concepts
- Quick guide recipes
- Structured outputs and batching (go-to for training-ready tensors)
- Built-in model inventory
- Normalization and offsets contract
- Offsets alignment worked examples
- Training (experimental)
- Format contracts
- Local path recipes
- LLM cookbook (model selection, installs, and interop)
- Installable gated models
- Troubleshooting
- API reference
KeemenaPreprocessing Integration
using KeemenaPreprocessing
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
cfg = PreprocessConfiguration(tokenizer_name = keemena_callable(tok))
bundle = preprocess_corpus(["hello world"]; config=cfg)See API reference for explicit loader APIs and the full exported reference.