Training (Experimental)

Training support is currently experimental and intentionally separated from the pretrained tokenizer loading/encoding workflows.

Available now:

  • train_bpe(...)
  • train_bytebpe(...)
  • train_unigram(...)
  • train_wordpiece(...)
  • train_wordpiece_result(...)
  • train_sentencepiece(...)
  • train_sentencepiece_result(...)
  • train_hf_bert_wordpiece(...)
  • train_hf_bert_wordpiece_result(...)
  • train_hf_roberta_bytebpe(...)
  • train_hf_roberta_bytebpe_result(...)
  • train_hf_gpt2_bytebpe(...)
  • train_hf_gpt2_bytebpe_result(...)
  • save_training_bundle(result, out_dir; ...)
  • load_training_bundle(out_dir)

Training API

KeemenaSubwords.Training.train_hf_bert_wordpieceFunction
train_hf_bert_wordpiece(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a BERT-style WordPiece tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

  • BertNormalizer
  • BertPreTokenizer
  • BertProcessing (CLS/SEP insertion)
  • WordPiece decoder

Special token behavior:

  • add_special_tokens=true inserts [CLS] and [SEP] via post-processing.
  • Special tokens present verbatim in input text can also be matched via HF added_tokens patterns.

KeemenaPreprocessing integration:

  • tokenization_text = tokenization_view(tokenizer, clean_text)
  • encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

  • export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
  • load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
source
KeemenaSubwords.Training.train_hf_bert_wordpiece_resultFunction
train_hf_bert_wordpiece_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,BertWordPieceTrainingConfig,BertWordPieceTrainingArtifacts}

Train a BERT-style WordPiece tokenizer and return:

  • tokenizer::HuggingFaceJSONTokenizer
  • config::BertWordPieceTrainingConfig
  • artifacts::BertWordPieceTrainingArtifacts

The returned tokenizer includes BertNormalizer, BertPreTokenizer, BertProcessing, and WordPiece decoding, with special tokens exported as HF added_tokens for deterministic save/reload parity.

source
KeemenaSubwords.Training.train_hf_roberta_bytebpeFunction
train_hf_roberta_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a RoBERTa-style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

  • ByteLevel pre-tokenization
  • RobertaProcessing (<s> ... </s> insertion)
  • ByteLevel decoding

Special token behavior:

  • add_special_tokens=true inserts BOS/EOS via RobertaProcessing.
  • Special tokens present verbatim in input text can be matched via HF added_tokens patterns.
  • By default the preset enables HF-style ByteLevel settings: use_regex=true, add_prefix_space=true, and trim_offsets=true.

KeemenaPreprocessing integration:

  • tokenization_text = tokenization_view(tokenizer, clean_text)
  • encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

  • export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
  • load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
source
KeemenaSubwords.Training.train_hf_roberta_bytebpe_resultFunction
train_hf_roberta_bytebpe_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,RobertaByteBPETrainingConfig,RobertaByteBPETrainingArtifacts}

Train a RoBERTa-style ByteLevel BPE tokenizer and return:

  • tokenizer::HuggingFaceJSONTokenizer
  • config::RobertaByteBPETrainingConfig
  • artifacts::RobertaByteBPETrainingArtifacts

The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native pipeline (ByteLevel pre-tokenizer/decoder + RobertaProcessing).

source
KeemenaSubwords.Training.train_hf_gpt2_bytebpeFunction
train_hf_gpt2_bytebpe(corpus; kwargs...) -> HuggingFaceJSONTokenizer

Train a GPT-2 style ByteLevel BPE tokenizer and return a HuggingFaceJSONTokenizer pipeline composed of:

  • No-op normalizer
  • ByteLevel pre-tokenization
  • ByteLevel post-processing (no BOS/EOS insertion)
  • ByteLevel decoding

Special token behavior:

  • By default, this preset uses a single special token: special_tokens=Dict(:unk => "<|endoftext|>").
  • add_special_tokens=true does not change ids by default because GPT-2 style ByteLevel pipelines do not auto-insert BOS/EOS.
  • Special tokens present verbatim in input text can still be matched through HF added_tokens patterns.

KeemenaPreprocessing integration:

  • tokenization_text = tokenization_view(tokenizer, clean_text)
  • encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true)

Export/reload flow:

  • export_tokenizer(tokenizer, out_dir; format=:hf_tokenizer_json)
  • load_hf_tokenizer_json(joinpath(out_dir, "tokenizer.json"))
source
KeemenaSubwords.Training.train_hf_gpt2_bytebpe_resultFunction
train_hf_gpt2_bytebpe_result(corpus; kwargs...) ->
    TrainingResult{HuggingFaceJSONTokenizer,GPT2ByteBPETrainingConfig,GPT2ByteBPETrainingArtifacts}

Train a GPT-2 style ByteLevel BPE tokenizer and return:

  • tokenizer::HuggingFaceJSONTokenizer
  • config::GPT2ByteBPETrainingConfig
  • artifacts::GPT2ByteBPETrainingArtifacts

The returned tokenizer wraps an inner trained ByteBPETokenizer and preserves a HF-native ByteLevel pipeline. By default, exported HF JSON uses model.unk_token = null for GPT-2 compatibility while the internal Julia base tokenizer still uses a concrete unknown token string.

source
KeemenaSubwords.Training.save_training_bundleFunction
save_training_bundle(result, outdir; export_format=:auto, overwrite=false)

Export a trained tokenizer result and write a deterministic v1 manifest into outdir so the bundle can be reloaded later with load_training_bundle.

source

HF BERT WordPiece Preset

using KeemenaSubwords

corpus = [
    "Hello, world!",
    "Café naïve façade",
    "你好 世界",
]

tok = train_hf_bert_wordpiece(
    corpus;
    vocab_size=128,
    min_frequency=1,
    lowercase=true,
    strip_accents=nothing,
    handle_chinese_chars=true,
    clean_text=true,
)

export_tokenizer(tok, "out_hf_bert"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_bert/tokenizer.json")

HF RoBERTa ByteBPE Preset

using KeemenaSubwords

corpus = [
    "hello world",
    "hello, world!",
    "café costs 5 euros",
]

tok = train_hf_roberta_bytebpe(
    corpus;
    vocab_size=384,
    min_frequency=1,
)

export_tokenizer(tok, "out_hf_roberta"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_roberta/tokenizer.json")

RoBERTa preset defaults are chosen for HF-style ByteLevel behavior:

  • use_regex=true applies GPT-2 ByteLevel regex splitting.
  • add_prefix_space=true matches RoBERTa-style leading-space handling.
  • trim_offsets=true trims span edges for whitespace while preserving the offsets contract: non-span specials use sentinel (0,0), while trimmed real tokens may become empty but remain in-bounds spans like (k,k) (never sentinel).

HF GPT-2 ByteBPE Preset

using KeemenaSubwords

corpus = [
    "Hello my friend, how is your day going?",
    "café 🙂",
]

tok = train_hf_gpt2_bytebpe(
    corpus;
    vocab_size=384,
    min_frequency=1,
)

export_tokenizer(tok, "out_hf_gpt2"; format=:hf_tokenizer_json)
reloaded = load_hf_tokenizer_json("out_hf_gpt2/tokenizer.json")

Note on pretokenizer

  • pretokenizer is used only during training to split input text into units for frequency counts.
  • Trained tokenizers do not persist or apply the training pretokenizer at runtime.
  • For consistent behavior, apply equivalent preprocessing upstream (for example via KeemenaPreprocessing) before calling encode/encode_result.
  • ByteBPE exports as vocab.txt + merges.txt; when reloading exported files, use format=:bytebpe if format auto-detection is ambiguous.

Training Bundles

using KeemenaSubwords

corpus = ["hello world", "café costs 5"]
result = train_wordpiece_result(corpus; vocab_size=96, min_frequency=1)

save_training_bundle(result, "out_bundle")
reloaded = load_training_bundle("out_bundle")

encode(reloaded, "hello café"; add_special_tokens=false)

save_training_bundle writes exported tokenizer files plus keemena_training_manifest.json, so reload does not require remembering loader kwargs. Offsets behavior remains unchanged and compatible with tokenization_view(...) + encode_result(...; assume_normalized=true).

Current behavior:

  • SentencePiece training supports both model_type=:unigram and model_type=:bpe.
  • Unigram training defaults to SentencePiece-style whitespace_marker="▁" so multi-word text can round-trip through decode(encode(...)).
  • If whitespace_marker="", runtime Unigram tokenization is still word-split, so decoding may collapse spaces in multi-word text (for example "hello world" -> "helloworld").

The pretrained-tokenizer APIs (load_tokenizer, tokenize, encode, encode_result, decode) remain stable and independent from training codepaths.