Training (Experimental)
Training support is currently experimental and intentionally separated from the pretrained tokenizer loading/encoding workflows.
Available now:
train_bpe(...)train_bytebpe(...)train_unigram(...)train_wordpiece(...)train_wordpiece_result(...)train_sentencepiece(...)train_sentencepiece_result(...)train_hf_bert_wordpiece(...)train_hf_bert_wordpiece_result(...)train_hf_roberta_bytebpe(...)train_hf_roberta_bytebpe_result(...)train_hf_gpt2_bytebpe(...)train_hf_gpt2_bytebpe_result(...)save_training_bundle(result, out_dir; ...)load_training_bundle(out_dir)
Training API
KeemenaSubwords.Training.train_bpe — Function
Train a character-level BPE tokenizer.
KeemenaSubwords.Training.train_bpe_result — Function
Train a character-level BPE tokenizer and return model artifacts.
KeemenaSubwords.Training.train_bytebpe — Function
Train a byte-level BPE tokenizer.
KeemenaSubwords.Training.train_bytebpe_result — Function
Train a byte-level BPE tokenizer and return model artifacts.
KeemenaSubwords.Training.train_unigram — Function
High-level Unigram training entry point.
KeemenaSubwords.Training.train_unigram_result — Function
Train a Unigram tokenizer and return model artifacts.
KeemenaSubwords.Training.train_wordpiece — Function
Train a WordPiece tokenizer.
KeemenaSubwords.Training.train_wordpiece_result — Function
Train a WordPiece tokenizer and return model artifacts.
KeemenaSubwords.Training.train_sentencepiece — Function
Train a SentencePiece tokenizer.
KeemenaSubwords.Training.train_sentencepiece_result — Function
Train a SentencePiece tokenizer and return model artifacts.
KeemenaSubwords.Training.train_hf_bert_wordpiece — Function
train_hf_bert_wordpiece(corpus; kwargs...) -> HuggingFaceJSONTokenizerTrain a BERT-style WordPiece tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:
BertNormalizerBertPreTokenizerBertProcessing(CLS/SEP insertion)WordPiecedecoder
Special token behavior:
add_special_tokens=trueinserts[CLS]and[SEP]via post-processing.- Special tokens present verbatim in input text can also be matched via HF
added_tokenspatterns.
KeemenaPreprocessing integration:
tokenization_text = tokenization_view(tokenizer, clean_text)encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)
Export/reload flow:
export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
KeemenaSubwords.Training.train_hf_bert_wordpiece_result — Function
train_hf_bert_wordpiece_result(corpus; kwargs...) ->
TrainingResult{HuggingFaceJSONTokenizer,BertWordPieceTrainingConfig,BertWordPieceTrainingArtifacts}Train a BERT-style WordPiece tokenizer and return:
tokenizer::HuggingFaceJSONTokenizerconfig::BertWordPieceTrainingConfigartifacts::BertWordPieceTrainingArtifacts
The returned tokenizer includes BertNormalizer, BertPreTokenizer, BertProcessing, and WordPiece decoding, with special tokens exported as HF added_tokens for deterministic save/reload parity.
KeemenaSubwords.Training.train_hf_roberta_bytebpe — Function
train_hf_roberta_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizerTrain a RoBERTa-style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:
- ByteLevel pre-tokenization
- RobertaProcessing (
<s> ... </s>insertion) - ByteLevel decoding
Special token behavior:
add_special_tokens=trueinserts BOS/EOS via RobertaProcessing.- Special tokens present verbatim in input text can be matched via HF
added_tokenspatterns. - By default the preset enables HF-style ByteLevel settings:
use_regex=true,add_prefix_space=true, andtrim_offsets=true.
KeemenaPreprocessing integration:
tokenization_text = tokenization_view(tokenizer, clean_text)encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)
Export/reload flow:
export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
KeemenaSubwords.Training.train_hf_roberta_bytebpe_result — Function
train_hf_roberta_bytebpe_result(corpus; kwargs...) ->
TrainingResult{HuggingFaceJSONTokenizer,RobertaByteBPETrainingConfig,RobertaByteBPETrainingArtifacts}Train a RoBERTa-style ByteLevel BPE tokenizer and return:
tokenizer::HuggingFaceJSONTokenizerconfig::RobertaByteBPETrainingConfigartifacts::RobertaByteBPETrainingArtifacts
The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native pipeline (ByteLevel pre-tokenizer/decoder + RobertaProcessing).
KeemenaSubwords.Training.train_hf_gpt2_bytebpe — Function
train_hf_gpt2_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizerTrain a GPT-2 style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:
- No-op normalizer
- ByteLevel pre-tokenization
- ByteLevel post-processing (no BOS/EOS insertion)
- ByteLevel decoding
Special token behavior:
- By default, this preset uses a single special token:
special_tokens=Dict(:unk => "<|endoftext|>"). add_special_tokens=truedoes not change ids by default because GPT-2 style ByteLevel pipelines do not auto-insert BOS/EOS.- Special tokens present verbatim in input text can still be matched through HF
added_tokenspatterns.
KeemenaPreprocessing integration:
tokenization_text = tokenization_view(tokenizer, clean_text)encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)
Export/reload flow:
export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
KeemenaSubwords.Training.train_hf_gpt2_bytebpe_result — Function
train_hf_gpt2_bytebpe_result(corpus; kwargs...) ->
TrainingResult{HuggingFaceJSONTokenizer,GPT2ByteBPETrainingConfig,GPT2ByteBPETrainingArtifacts}Train a GPT-2 style ByteLevel BPE tokenizer and return:
tokenizer::HuggingFaceJSONTokenizerconfig::GPT2ByteBPETrainingConfigartifacts::GPT2ByteBPETrainingArtifacts
The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native ByteLevel pipeline. By default, exported HF JSON uses model.unk_token = null for GPT-2 compatibility while the internal Julia base tokenizer still uses a concrete unknown token string.
KeemenaSubwords.Training.write_training_manifest — Function
write_training_manifest(outdir, manifest)Write a TrainingManifestV1 to outdir/keemena_training_manifest.json.
KeemenaSubwords.Training.read_training_manifest — Function
read_training_manifest(outdir) -> TrainingManifestV1Read outdir/keemena_training_manifest.json.
KeemenaSubwords.Training.save_training_bundle — Function
save_training_bundle(result, outdir; export_format=:auto, overwrite=false)Export a trained tokenizer result and write a deterministic v1 manifest into outdir so the bundle can be reloaded later with load_training_bundle.
KeemenaSubwords.Training.load_training_bundle — Function
load_training_bundle(outdir) -> AbstractSubwordTokenizerLoad a tokenizer bundle previously written by save_training_bundle.
HF BERT WordPiece Preset
using KeemenaSubwords
corpus = [
"Hello, world!",
"Café naïve façade",
"你好 世界",
]
tok = train_hf_bert_wordpiece(
corpus;
vocab_size=128,
min_frequency=1,
lowercase=true,
strip_accents=nothing,
handle_chinese_chars=true,
clean_text=true,
)
export_tokenizer(tok, "out_hf_bert"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_bert/tokenizer.json")HF RoBERTa ByteBPE Preset
using KeemenaSubwords
corpus = [
"hello world",
"hello, world!",
"café costs 5 euros",
]
tok = train_hf_roberta_bytebpe(
corpus;
vocab_size=384,
min_frequency=1,
)
export_tokenizer(tok, "out_hf_roberta"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_roberta/tokenizer.json")RoBERTa preset defaults are chosen for HF-style ByteLevel behavior:
use_regex=trueapplies GPT-2 ByteLevel regex splitting.add_prefix_space=truematches RoBERTa-style leading-space handling.trim_offsets=truetrims span edges for whitespace while preserving the offsets contract: non-span specials use sentinel(0,0), while trimmed real tokens may become empty but remain in-bounds spans like(k,k)(never sentinel).
HF GPT-2 ByteBPE Preset
using KeemenaSubwords
corpus = [
"Hello my friend, how is your day going?",
"café 🙂",
]
tok = train_hf_gpt2_bytebpe(
corpus;
vocab_size=384,
min_frequency=1,
)
export_tokenizer(tok, "out_hf_gpt2"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_gpt2/tokenizer.json")Note on pretokenizer
pretokenizeris used only during training to split input text into units for frequency counts.- Trained tokenizers do not persist or apply the training
pretokenizerat runtime. - For consistent behavior, apply equivalent preprocessing upstream (for example via KeemenaPreprocessing) before calling
encode/encode_result. - ByteBPE exports as
vocab.txt + merges.txt; when reloading exported files, useformat=:bytebpeif format auto-detection is ambiguous.
Training Bundles
using KeemenaSubwords
corpus = ["hello world", "café costs 5"]
result = train_wordpiece_result(corpus; vocab_size=96, min_frequency=1)
save_training_bundle(result, "out_bundle")
reloaded = load_training_bundle("out_bundle")
encode(reloaded, "hello café"; add_special_tokens=false)save_training_bundle writes exported tokenizer files plus keemena_training_manifest.json, so reload does not require remembering loader kwargs. Offsets behavior remains unchanged and compatible with tokenization_view(...) + encode_result(...; assume_normalized=true).
Current behavior:
- SentencePiece training supports both
model_type=:unigramandmodel_type=:bpe. - Unigram training defaults to SentencePiece-style
whitespace_marker="▁"so multi-word text can round-trip throughdecode(encode(...)). - If
whitespace_marker="", runtime Unigram tokenization is still word-split, so decoding may collapse spaces in multi-word text (for example"hello world"->"helloworld").
The pretrained-tokenizer APIs (load_tokenizer, tokenize, encode, encode_result, decode) remain stable and independent from training codepaths.