Subwords via KeemenaPreprocessing
This guide explains how to use subwords from the perspective of a real project, not only from an API surface perspective.
In plain terms, KeemenaPreprocessing lets you keep your usual corpus-cleaning and bundling flow, while still getting modern subword tokenization via KeemenaSubwords.
If you are deciding between multiple integration styles, start with the chooser below and then jump to the relevant section.
PreprocessConfiguration(subword = SubwordOptions(...)) supports two canonical modes:
:tokenizer_native- keep tokenizer-native ids.:bundle_reindexed- segment with tokenizer pieces, then rebuild ids/vocab inside KeemenaPreprocessing.
Compatibility aliases:
:native->:tokenizer_native:corpus->:bundle_reindexed:tokenizer->:tokenizer_native:bundle->:bundle_reindexed
Both modes store subword artifacts (offsets/masks/metadata) in bundle extras.
Level naming depends on integration path:
- First-party subword path (
subword = SubwordOptions(...)) defaults to level:subword. - Legacy callable bridge (
tokenizer_name = keemena_callable(tokenizer)) keeps the primary stream under:word.
Which call path should you use?
Most users should start with the first row, then move to the others only if they need more direct tokenizer control.
| User goal | Recommended call path | Why |
|---|---|---|
| I want one package for preprocessing + subwords (default user path). | using KeemenaPreprocessing then preprocess_corpus(...; config = PreprocessConfiguration(subword = SubwordOptions(...))) | Single import, corpus bundle outputs, stable accessor helpers. |
| I want integrated preprocessing but explicit tokenizer control. | using KeemenaPreprocessing + using KeemenaSubwords, then pass SubwordOptions(source = tokenizer, ...) | Keep bundle workflow while controlling tokenizer loading/selection directly. |
| I want full manual tokenizer workflows without bundle/vocabulary assembly. | using KeemenaSubwords then load_tokenizer(...) + encode_result(...) / quick_encode_batch(...) | Lowest-level control for model-facing tokenization pipelines. |
| I already have a callable-tokenizer integration and want to keep it unchanged. | PreprocessConfiguration(tokenizer_name = keemena_callable(tokenizer)) | Backward-compatible legacy bridge; primary level remains :word. |
Mode guidance inside the first-party subword path:
- Use
mode = :tokenizer_nativefor pretrained-tokenizer-compatible ids. - Use
mode = :bundle_reindexedwhen KeemenaPreprocessing should own ids/vocab. - Use
cleaning_profile = :subword_cooperative(aliases::subword_safe,:cooperative) when you want conservative cleaning before tokenizer normalization.
Typical user scenarios
If you are fine-tuning or evaluating a pretrained model and need tokenizer id compatibility, use:
subword = SubwordOptions(..., mode = :tokenizer_native)
You keep ids exactly as the tokenizer defines them, which makes it safer to feed outputs into model code that expects that id space.
If you are building a corpus-specific vocabulary and want KeemenaPreprocessing to own the id mapping, use:
subword = SubwordOptions(..., mode = :bundle_reindexed)
In this mode, KeemenaSubwords still does segmentation, but the final ids and vocabulary are rebuilt by KeemenaPreprocessing (honoring settings like minimum_token_frequency and special_tokens).
If you need full low-level control over tokenizers (for example, standalone batch encoding utilities without bundle assembly), use KeemenaSubwords directly.
Mode choice in plain language
:tokenizer_native
Choose this when your downstream stack already "speaks tokenizer-native ids".
You get:
- ids that match direct
KeemenaSubwords.encode_result(...), - offsets/masks stored in bundle extras,
- easier parity checks between integrated and direct-tokenizer workflows.
:bundle_reindexed
Choose this when your downstream stack should use a corpus-owned vocabulary inside KeemenaPreprocessing.
You get:
- tokenizer-quality segmentation,
- bundle-owned vocabulary + ids,
- frequency filtering and special-token policy controlled by
PreprocessConfiguration.
One-package usage
using KeemenaPreprocessing
docs = [
"Subword tokenization helps with rare compounds and morphology.",
"Tokenizer-native ids are preserved for downstream model compatibility.",
]
cfg = PreprocessConfiguration(
cleaning_profile = :subword_cooperative,
subword = SubwordOptions(
source = :core_bpe_en, # model key, path, FilesSpec, NamedTuple, or tokenizer object
mode = :tokenizer_native, # keep tokenizer-native ids
level_name = :subword, # default
add_special_tokens = true,
apply_tokenization_view = true,
return_offsets = true,
return_masks = true,
),
)
bundle = preprocess_corpus(docs; config = cfg)
ids = get_token_ids(bundle, :subword)
offsets = get_subword_offsets(bundle)
attn = get_subword_attention_mask(bundle)
ttype = get_subword_token_type_ids(bundle)
specials = get_subword_special_tokens_mask(bundle)
meta = get_subword_metadata(bundle)For bundle-owned ids:
cfg = PreprocessConfiguration(
cleaning_profile = :subword_cooperative,
subword = SubwordOptions(
source = :core_bpe_en,
mode = :bundle_reindexed, # rebuild ids/vocab inside KeemenaPreprocessing
level_name = :subword,
),
minimum_token_frequency = 2,
)Subwords with streaming (large corpora)
If you like the one-package subword API but your corpus is large, you can use the streaming entrypoints with the same cfg.
using KeemenaPreprocessing
cfg = PreprocessConfiguration(
cleaning_profile = :subword_cooperative,
subword = SubwordOptions(
source = :core_bpe_en,
mode = :tokenizer_native, # or :bundle_reindexed
level_name = :subword,
),
)
# 1) Channel API
ch = preprocess_corpus_streaming("data/*"; cfg = cfg, chunk_tokens = 250_000)
# 2) Materialized chunk vector
chunks = preprocess_corpus_streaming_chunks("data/*"; cfg = cfg, chunk_tokens = 250_000)
# 3) Single merged bundle
merged = preprocess_corpus_streaming_full("data/*"; cfg = cfg, chunk_tokens = 250_000)
ids = get_token_ids(merged, :subword)If you pass vocab = ... with mode = :tokenizer_native, that vocabulary must exactly match the tokenizer-native id mapping (same token id order and special-token ids). Use vocab = nothing to auto-build the canonical native vocabulary. When accepted, mapping/special ids are normalized to the canonical tokenizer-native mapping while preserving your provided token_frequencies.
Practical guidance:
- Use
:tokenizer_nativewhen downstream code expects tokenizer-native ids. - Use
:bundle_reindexedwhen you want corpus-owned ids/vocabulary withminimum_token_frequencyand your special-token policy. - All subword artifact helpers (
get_subword_offsets, masks, metadata) work for streaming bundles as well.
Integrated workflow with explicit tokenizer control
Use this when you want KeemenaPreprocessing bundles, but you also want to control tokenizer loading directly (for example pinned local files or a preloaded tokenizer object):
using KeemenaPreprocessing
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
cfg = PreprocessConfiguration(
cleaning_profile = :subword_cooperative,
subword = SubwordOptions(
source = tok,
mode = :tokenizer_native,
level_name = :subword,
),
)
bundle = preprocess_corpus(["A realistic paragraph goes here."]; config = cfg)KeemenaSubwords-only workflow (no bundle layer)
If you do not need PreprocessBundle, you can call KeemenaSubwords directly:
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
result = encode_result(tok, "Some text"; return_offsets = true, return_masks = true)This path is useful when you want raw tokenizer-centric workflows without corpus vocabulary construction.
Cleaning cooperation with subwords
cleaning_profile = :subword_cooperative is the recommended profile for most subword model workflows.
With subword !== nothing, this profile keeps the existing cleaning pipeline but disables aggressive edits that often hurt tokenizer-native parity:
lowercase = falsestrip_accents = falseremove_punctuation = falsereplace_urls = falsereplace_emails = falsereplace_numbers = false
The rest of the hygiene pipeline (control-character handling, whitespace normalization, etc.) still runs as configured.
If you want the original behavior, set cleaning_profile = :classic. If cooperative cleaning is requested without subword, configuration falls back to :classic with a warning.
What gets stored where
bundle.levels[:subword]:- token ids (tokenizer-native in
:tokenizer_native, bundle-reindexed in:bundle_reindexed), - level vocabulary matching the selected mode,
- document offsets for token slicing.
- structural offsets beyond
document_offsetsare not currently populated on the subword corpus (byte_offsets,character_offsets,word_offsets,sentence_offsets,paragraph_offsetsremainnothing).
- token ids (tokenizer-native in
bundle.extras(via accessor helpers):- per-document subword offsets,
- attention masks,
- token type ids,
- special token masks,
- tokenizer metadata and tokenization texts.
Alignment note:
- Subword-only bundles do not auto-create byte/character/word alignment maps.
- Streaming merged bundles rebuild alignments only when a
:wordlevel is present.
get_subword_metadata(bundle).tokenization_texts stores the tokenizer-view text used as the offset reference (tokenizer-normalized text).
Use accessors instead of reading extras directly:
get_subword_offsets(bundle)
get_subword_attention_mask(bundle)
get_subword_token_type_ids(bundle)
get_subword_special_tokens_mask(bundle)
get_subword_metadata(bundle)Source forms for SubwordOptions(source = ...)
source accepts:
Symbolmodel key (e.g.:core_bpe_en)- local path
String NamedTuplemodel/file specificationKeemenaSubwords.FilesSpec- preloaded
KeemenaSubwords.AbstractSubwordTokenizer
format is forwarded for path-based loading (for example format = :hf_tokenizer_json).
Realistic paragraph checks (recommended pattern)
For production-like validation, run pipeline parity on a multi-sentence paragraph:
- Clean raw docs with
clean_documents. - Run
preprocess_corpuson raw docs and cleaned docs. - Assert token ids and offsets match between both runs.
- Compare each document's ids/offsets/masks against direct
KeemenaSubwords.encode_result(...)output.
This is the same pattern used in package tests for long-text subword coverage and alignment sanity.
Legacy callable bridge (still supported)
If you want full manual control through a callable tokenizer adapter, the existing path remains valid:
using KeemenaPreprocessing
using KeemenaSubwords
tok = load_tokenizer(:core_bpe_en)
cfg = PreprocessConfiguration(tokenizer_name = keemena_callable(tok))
bundle = preprocess_corpus(["hello world"]; config = cfg)In this legacy callable path, the primary token stream remains under :word.