Integration With KeemenaPreprocessing

KeemenaSubwords tokenizers are callable and work with KeemenaPreprocessing's callable tokenizer contract.

using KeemenaPreprocessing
using KeemenaSubwords

tokenizer = load_tokenizer(:core_bpe_en)

cfg = PreprocessConfiguration(tokenizer_name = keemena_callable(tokenizer))
bundle = preprocess_corpus(["hello world", "hello keemena"]; config=cfg)

# KeemenaPreprocessing stores callable levels under Symbol(typeof(tokenizer))
lvl = level_key(tokenizer)
subword_corpus = get_corpus(bundle, lvl)

For the normalization/offsets alignment contract (clean_text -> tokenization_text -> encode_result(...; assume_normalized=true)), see Normalization and Offsets Contract.

For onboarding context and token/id semantics, see Concepts.

Alignment rule: Use tokenization_text = tokenization_view(tokenizer, clean_text), then call encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true, ...). Word and subword offsets must both be interpreted in the same tokenization_text coordinate system.

See Offsets Alignment Examples for a worked subword-to-word mapping tutorial.