Training (Experimental)

Training support is currently experimental and intentionally separated from the pretrained tokenizer loading/encoding workflows.

Available now:

train_bpe(...)
train_bytebpe(...)
train_unigram(...)
train_wordpiece(...)
train_wordpiece_result(...)
train_sentencepiece(...)
train_sentencepiece_result(...)
train_hf_bert_wordpiece(...)
train_hf_bert_wordpiece_result(...)
train_hf_roberta_bytebpe(...)
train_hf_roberta_bytebpe_result(...)
train_hf_gpt2_bytebpe(...)
train_hf_gpt2_bytebpe_result(...)
save_training_bundle(result, out_dir; ...)
load_training_bundle(out_dir)

Training API

KeemenaSubwords.Training.train_bpe — Function

Train a character-level BPE tokenizer.

source

KeemenaSubwords.Training.train_bpe_result — Function

Train a character-level BPE tokenizer and return model artifacts.

source

KeemenaSubwords.Training.train_bytebpe — Function

Train a byte-level BPE tokenizer.

source

KeemenaSubwords.Training.train_bytebpe_result — Function

Train a byte-level BPE tokenizer and return model artifacts.

source

KeemenaSubwords.Training.train_unigram — Function

High-level Unigram training entry point.

source

KeemenaSubwords.Training.train_unigram_result — Function

Train a Unigram tokenizer and return model artifacts.

source

KeemenaSubwords.Training.train_wordpiece — Function

Train a WordPiece tokenizer.

source

KeemenaSubwords.Training.train_wordpiece_result — Function

Train a WordPiece tokenizer and return model artifacts.

source

KeemenaSubwords.Training.train_sentencepiece — Function

Train a SentencePiece tokenizer.

source

KeemenaSubwords.Training.train_sentencepiece_result — Function

Train a SentencePiece tokenizer and return model artifacts.

source

KeemenaSubwords.Training.train_hf_bert_wordpiece — Function

train_hf_bert_wordpiece(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a BERT-style WordPiece tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

BertNormalizer
BertPreTokenizer
BertProcessing (CLS/SEP insertion)
WordPiece decoder

Special token behavior:

add_special_tokens=true inserts [CLS] and [SEP] via post-processing.
Special tokens present verbatim in input text can also be matched via HF added_tokens patterns.

KeemenaPreprocessing integration:

tokenization_text = tokenization_view(tokenizer, clean_text)
encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))

source

KeemenaSubwords.Training.train_hf_bert_wordpiece_result — Function

train_hf_bert_wordpiece_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,BertWordPieceTrainingConfig,BertWordPieceTrainingArtifacts}

Train a BERT-style WordPiece tokenizer and return:

tokenizer::HuggingFaceJSONTokenizer
config::BertWordPieceTrainingConfig
artifacts::BertWordPieceTrainingArtifacts

The returned tokenizer includes BertNormalizer, BertPreTokenizer, BertProcessing, and WordPiece decoding, with special tokens exported as HF added_tokens for deterministic save/reload parity.

source

KeemenaSubwords.Training.train_hf_roberta_bytebpe — Function

train_hf_roberta_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a RoBERTa-style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

ByteLevel pre-tokenization
RobertaProcessing (<s> ... </s> insertion)
ByteLevel decoding

Special token behavior:

add_special_tokens=true inserts BOS/EOS via RobertaProcessing.
Special tokens present verbatim in input text can be matched via HF added_tokens patterns.
By default the preset enables HF-style ByteLevel settings: use_regex=true, add_prefix_space=true, and trim_offsets=true.

KeemenaPreprocessing integration:

tokenization_text = tokenization_view(tokenizer, clean_text)
encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))

source

KeemenaSubwords.Training.train_hf_roberta_bytebpe_result — Function

train_hf_roberta_bytebpe_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,RobertaByteBPETrainingConfig,RobertaByteBPETrainingArtifacts}

Train a RoBERTa-style ByteLevel BPE tokenizer and return:

tokenizer::HuggingFaceJSONTokenizer
config::RobertaByteBPETrainingConfig
artifacts::RobertaByteBPETrainingArtifacts

The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native pipeline (ByteLevel pre-tokenizer/decoder + RobertaProcessing).

source

KeemenaSubwords.Training.train_hf_gpt2_bytebpe — Function

train_hf_gpt2_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a GPT-2 style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

No-op normalizer
ByteLevel pre-tokenization
ByteLevel post-processing (no BOS/EOS insertion)
ByteLevel decoding

Special token behavior:

By default, this preset uses a single special token: special_tokens=Dict(:unk => "<|endoftext|>").
add_special_tokens=true does not change ids by default because GPT-2 style ByteLevel pipelines do not auto-insert BOS/EOS.
Special tokens present verbatim in input text can still be matched through HF added_tokens patterns.

KeemenaPreprocessing integration:

tokenization_text = tokenization_view(tokenizer, clean_text)
encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))

source

KeemenaSubwords.Training.train_hf_gpt2_bytebpe_result — Function

train_hf_gpt2_bytebpe_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,GPT2ByteBPETrainingConfig,GPT2ByteBPETrainingArtifacts}

Train a GPT-2 style ByteLevel BPE tokenizer and return:

tokenizer::HuggingFaceJSONTokenizer
config::GPT2ByteBPETrainingConfig
artifacts::GPT2ByteBPETrainingArtifacts

The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native ByteLevel pipeline. By default, exported HF JSON uses model.unk_token = null for GPT-2 compatibility while the internal Julia base tokenizer still uses a concrete unknown token string.

source

KeemenaSubwords.Training.write_training_manifest — Function

write_training_manifest(outdir, manifest)

Write a TrainingManifestV1 to outdir/keemena_training_manifest.json.

source

KeemenaSubwords.Training.read_training_manifest — Function

read_training_manifest(outdir) -> TrainingManifestV1

Read outdir/keemena_training_manifest.json.

source

KeemenaSubwords.Training.save_training_bundle — Function

save_training_bundle(result, outdir; export_format=:auto, overwrite=false)

Export a trained tokenizer result and write a deterministic v1 manifest into outdir so the bundle can be reloaded later with load_training_bundle.

source

KeemenaSubwords.Training.load_training_bundle — Function

load_training_bundle(outdir) -> AbstractSubwordTokenizer

Load a tokenizer bundle previously written by save_training_bundle.

source

HF BERT WordPiece Preset

using KeemenaSubwords

corpus = [
    "Hello, world!",
    "Café naïve façade",
    "你好 世界",
]

tok = train_hf_bert_wordpiece(
    corpus;
    vocab_size=128,
    min_frequency=1,
    lowercase=true,
    strip_accents=nothing,
    handle_chinese_chars=true,
    clean_text=true,
)

export_tokenizer(tok, "out_hf_bert"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_bert/tokenizer.json")

HF RoBERTa ByteBPE Preset

using KeemenaSubwords

corpus = [
    "hello world",
    "hello, world!",
    "café costs 5 euros",
]

tok = train_hf_roberta_bytebpe(
    corpus;
    vocab_size=384,
    min_frequency=1,
)

export_tokenizer(tok, "out_hf_roberta"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_roberta/tokenizer.json")

RoBERTa preset defaults are chosen for HF-style ByteLevel behavior:

use_regex=true applies GPT-2 ByteLevel regex splitting.
add_prefix_space=true matches RoBERTa-style leading-space handling.
trim_offsets=true trims span edges for whitespace while preserving the offsets contract: non-span specials use sentinel (0,0), while trimmed real tokens may become empty but remain in-bounds spans like (k,k) (never sentinel).

HF GPT-2 ByteBPE Preset

using KeemenaSubwords

corpus = [
    "Hello my friend, how is your day going?",
    "café 🙂",
]

tok = train_hf_gpt2_bytebpe(
    corpus;
    vocab_size=384,
    min_frequency=1,
)

export_tokenizer(tok, "out_hf_gpt2"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_gpt2/tokenizer.json")

Note on pretokenizer

pretokenizer is used only during training to split input text into units for frequency counts.
Trained tokenizers do not persist or apply the training pretokenizer at runtime.
For consistent behavior, apply equivalent preprocessing upstream (for example via KeemenaPreprocessing) before calling encode/encode_result.
ByteBPE exports as vocab.txt + merges.txt; when reloading exported files, use format=:bytebpe if format auto-detection is ambiguous.

Training Bundles

using KeemenaSubwords

corpus = ["hello world", "café costs 5"]
result = train_wordpiece_result(corpus; vocab_size=96, min_frequency=1)

save_training_bundle(result, "out_bundle")
reloaded = load_training_bundle("out_bundle")

encode(reloaded, "hello café"; add_special_tokens=false)

save_training_bundle writes exported tokenizer files plus keemena_training_manifest.json, so reload does not require remembering loader kwargs. Offsets behavior remains unchanged and compatible with tokenization_view(...) + encode_result(...; assume_normalized=true).

Current behavior:

SentencePiece training supports both model_type=:unigram and model_type=:bpe.
Unigram training defaults to SentencePiece-style whitespace_marker="▁" so multi-word text can round-trip through decode(encode(...)).
If whitespace_marker="", runtime Unigram tokenization is still word-split, so decoding may collapse spaces in multi-word text (for example "hello world" -> "helloworld").

The pretrained-tokenizer APIs (load_tokenizer, tokenize, encode, encode_result, decode) remain stable and independent from training codepaths.