KeemenaSubwords.jl

Downstream of KeemenaPreprocessing.jl.

KeemenaSubwords provides Julia-native loaders and tokenization primitives for:

  • classic BPE,
  • byte-level BPE,
  • WordPiece,
  • SentencePiece,
  • tiktoken,
  • Hugging Face tokenizer.json.

Start Here

If you are new to the package, start with Concepts for the core contracts and first-hour workflows.

  • Token ids are 1-based in KeemenaSubwords.
  • Offsets are UTF-8 codeunit half-open spans: [start, stop).
  • Byte-level tokenizers can emit offsets that are valid codeunit spans but not always safe Julia string slice boundaries on multibyte text.

Quick Start

using KeemenaSubwords

tok = load_tokenizer(:core_bpe_en)
pieces = tokenize(tok, "hello world")
ids = encode(tok, "hello world"; add_special_tokens=true)
text = decode(tok, ids)

Model Discovery

available_models()
available_models(distribution=:artifact_public)
available_models(distribution=:installable_gated)
describe_model(:qwen2_5_bpe)
recommended_defaults_for_llms()

Key Workflows

# local path auto-detection
load_tokenizer("/path/to/model_dir")

# explicit loaders
load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
load_sentencepiece("/path/to/tokenizer.model")
load_tiktoken("/path/to/tokenizer.model")

# gated install flow
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])

Quick Guide Recipes

Documentation Map

KeemenaPreprocessing Integration

using KeemenaPreprocessing
using KeemenaSubwords

tok = load_tokenizer(:core_bpe_en)
cfg = PreprocessConfiguration(tokenizer_name = keemena_callable(tok))
bundle = preprocess_corpus(["hello world"]; config=cfg)

See API reference for explicit loader APIs and the full exported reference.