Quick Guide Recipes
This page is a choose-your-path entry point for the most common KeemenaSubwords workflows. Each recipe is written as a guided walkthrough and links to deeper docs when you need detail.
Core invariants:
- Token ids in KeemenaSubwords are 1-based.
- Offsets are 1-based UTF-8 codeunit half-open spans
[start, stop). (0, 0)is the no-span sentinel.
Recommended pipeline contract:
clean_text comes from preprocessing, then tokenization_text = tokenization_view(tokenizer, clean_text), then encode_result(tokenizer, tokenization_text; assume_normalized=true, return_offsets=true, return_masks=true, ...).
Quick Handlers (1-2 line workflows)
Pick one quick handler and run it. You do not need all of them. Each handler returns extra outputs on purpose, so you can ignore fields you do not need now. If you want to customize behavior, the detailed recipes below walk through each step.
Pick one:
- One string -> token ids (numbers) and token pieces (readable chunks) with
quick_tokenize - Many strings -> per-string results (no padding) with
quick_encode_batch - Many strings -> padded training tensors and causal labels with
quick_causal_lm_batch - Train -> save bundle -> reload -> sanity check with
quick_train_bundle
One string to tokens with quick_tokenize
- You have: one input
String, for example"hello world". - You get: token ids (
Intnumbers), token pieces (readable chunks), and decoded text as a round-trip sanity check. - Use it for: quick model-input prep, inspection, and early alignment checks.
Super simple:
using KeemenaSubwords
(let out = quick_tokenize(:core_bpe_en, "hello world"); (pieces=out.token_pieces, ids=out.token_ids, decoded=out.decoded_text) end)(pieces = ["<s>", "hello</w>", "world</w>", "</s>"], ids = [3, 24, 29, 4], decoded = "hello world")Peek inside:
using KeemenaSubwords
(let out = quick_tokenize(:core_bpe_en, "hello world hello"); (offsets_reference=out.metadata.offsets_reference, offsets=out.offsets) end)(offsets_reference = :input_text, offsets = [(0, 0), (1, 6), (7, 12), (13, 18), (0, 0)])Returns: token_pieces, token_ids, decoded_text, tokenization_text, offsets, attention_mask, token_type_ids, special_tokens_mask, and metadata. It returns offsets and masks by default, and you can ignore them unless you are doing alignment or training.
Common knobs:
add_special_tokens: include start and end tokens used by many models.return_offsets: include where each token came from in the string.return_masks: include 0/1 masks that are useful for training.apply_tokenization_view: run tokenizer-specific normalization view before encoding.
Many strings to per-sequence results with quick_encode_batch
- You have:
Vector{String}, for example["hello world", "hello"]. - You get: one structured result per string, plus per-string sequence lengths.
- Use it for: batched preprocessing before you decide how to pad or collate.
Super simple:
using KeemenaSubwords
quick_encode_batch(:core_wordpiece_en, ["hello world", "hello"]).sequence_lengths2-element Vector{Int64}:
4
3Peek inside:
using KeemenaSubwords
(let out = quick_encode_batch(:core_wordpiece_en, ["hello world hello", "hello world"]); (tokens=first(out.results[1].tokens, 6), ids=first(out.results[1].ids, 6)) end)(tokens = ["[CLS]", "hello", "world", "hello", "[SEP]"], ids = [3, 6, 7, 6, 4])Returns: tokenization_texts, results, and sequence_lengths. Full structured per-string outputs live at .results.
Common knobs:
add_special_tokens: include model-specific start and end tokens.return_offsets: keep alignment spans for each token.return_masks: keep 0/1 masks for each sequence.apply_tokenization_view: normalize each input into tokenizer coordinates first.
Go deeper: P5, Structured Outputs and Batching.
Many strings to training tensors with quick_causal_lm_batch
- You have:
Vector{String}for a training batch. - You get: padded
ids,attention_mask, andlabelsmatrices shaped(seq_len, batch). - Use it for: next-token training pipelines that need dense tensors.
Super simple:
using KeemenaSubwords
(let out = quick_causal_lm_batch(:core_wordpiece_en, ["hello world", "hello"]); (ids_size=size(out.ids), labels_size=size(out.labels), pad_token_id=out.pad_token_id) end)(ids_size = (4, 2), labels_size = (4, 2), pad_token_id = 1)Peek inside:
using KeemenaSubwords
(let out = quick_causal_lm_batch(:core_wordpiece_en, ["hello world hello", "hello world"]); (ids_col_1=out.ids[:, 1], labels_col_1=out.labels[:, 1], ignore_index=out.ignore_index) end)(ids_col_1 = [3, 6, 7, 6, 4], labels_col_1 = [6, 7, 6, 4, -100], ignore_index = -100)Returns: ids, attention_mask, labels, token_type_ids, special_tokens_mask, tokenization_texts, sequence_lengths, pad_token_id, ignore_index, and zero_based. labels are the next-token targets. Padding and the last real token are set to ignore_index.
Common knobs:
ignore_index: value used where loss should be ignored.zero_based: convert labels to 0-based ids for external consumers.pad_to_multiple_of: round sequence length up for kernel-friendly shapes.add_special_tokens: include model boundary tokens before collation.
Go deeper: P6, Structured Outputs and Batching.
Train a tokenizer bundle with quick_train_bundle
- You have: a small local corpus (
Vector{String}). - You get: saved bundle files and a reloadable tokenizer with sanity encode/decode outputs.
- Use it for: local tokenizer training and reproducible reload for experiments.
Super simple:
using KeemenaSubwords
quick_train_bundle(["hello world", "hello tokenizer", "world tokenizer"]; vocab_size=48, min_frequency=1).bundle_files2-element Vector{String}:
"keemena_training_manifest.json"
"vocab.txt"Peek inside:
using KeemenaSubwords
(let out = quick_train_bundle(["alpha beta", "beta gamma", "alpha gamma"]; vocab_size=40, min_frequency=1); (sanity_ids=out.sanity_encoded_ids, sanity_decoded=out.sanity_decoded_text) end)(sanity_ids = [1, 1], sanity_decoded = "[UNK] [UNK]")Returns: bundle_directory, bundle_files, tokenizer, training_summary, sanity_encoded_ids, and sanity_decoded_text. This writes a small tokenizer bundle to disk so you can reload it later without remembering training settings.
Common knobs:
vocab_size: target vocabulary size.min_frequency: minimum token frequency to keep.sanity_text: string used for encode/decode sanity check.overwrite: allow reusing an existing bundle directory.
Go deeper: Training (experimental), Formats.
How to use this page
Use this 3-step mental model:
- Choose a tokenizer source: shipped registry key, local files, or gated install.
- Choose output shape: single input with
encode_result, or many inputs withencode_batch_resultplus collation. - Choose metadata: offsets for alignment tasks, masks for training tasks, or both.
Choose a recipe by goal
- I just want token ids from text -> P1
- I need offsets for alignment -> P3
- I have many texts and want a batch -> P5
- I need training-ready tensors and causal labels -> P6
- I need Python interop -> P7
- I want to load local files -> P8
- I need a gated tokenizer -> P9
- I want to train a tokenizer -> T1
Pretrained tokenizer recipes (common)
P1: Load a shipped tokenizer and encode or decode
- You have:
text::String, for example"hello world". - You want: token pieces (
Vector{String}), token ids (Vector{Int}), and a decoded string. - Objective: quickly verify end-to-end tokenization behavior on a built-in model, with the option to inspect offsets and masks in the same call.
- Steps:
- Call
quick_tokenize(:core_bpe_en, text)for a one-call output bundle. - Read
token_pieces,token_ids, anddecoded_text. - If needed, pass options like
add_special_tokens=falseorreturn_offsets=false.
- Call
using KeemenaSubwords
text = "hello world"
quick_output = quick_tokenize(
:core_bpe_en,
text;
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
(
token_pieces=quick_output.token_pieces,
token_ids=quick_output.token_ids,
decoded_text=quick_output.decoded_text,
offsets_reference=quick_output.metadata.offsets_reference,
)(token_pieces = ["<s>", "hello</w>", "world</w>", "</s>"], token_ids = [3, 24, 29, 4], decoded_text = "hello world", offsets_reference = :input_text)- What you should see:
token_piecesis a vector of strings.token_idsis a vector of integers.decoded_textis a string and should be close to input text for covered vocabulary.
- Concerns and setup notes:
tokenizereturns readable pieces,encodereturns integer ids, anddecodemaps ids back to text.quick_tokenizeusesencode_resultinternally, so offsets and masks are available by default.add_special_tokens=trueincludes model specials (useful for model input); setfalsefor raw spans.- Ids are always 1-based in this package.
- Next: if you need model selection, go to P2. If you need offsets, go to P3.
What quick_tokenize does under the hood
using KeemenaSubwords
tokenizer = load_tokenizer(:core_bpe_en)
text = "hello world"
tokenization_text = tokenization_view(tokenizer, text)
result = encode_result(
tokenizer,
tokenization_text;
assume_normalized=true,
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
(
token_pieces=result.tokens,
token_ids=result.ids,
decoded_text=decode(tokenizer, result.ids),
)(token_pieces = ["<s>", "hello</w>", "world</w>", "</s>"], token_ids = [3, 24, 29, 4], decoded_text = "hello world")P2: Discover models and inspect metadata
- You have: no tokenizer picked yet.
- You want: a shortlist of candidates with provenance and defaults.
- Objective: choose a safe model key and understand where it comes from.
- Steps:
- Call
available_models(shipped=true)for built-in keys. - Call
recommended_defaults_for_llms()for practical default candidates. - Call
describe_model(key)for provenance, distribution, and file expectations.
- Call
using KeemenaSubwords
shipped_model_keys = available_models(shipped=true)
recommended_keys = recommended_defaults_for_llms()
core_wordpiece_info = describe_model(:core_wordpiece_en)
preview_count = min(length(shipped_model_keys), 8)
shipped_preview = shipped_model_keys[1:preview_count]
(
shipped_preview=shipped_preview,
recommended_keys=recommended_keys,
core_wordpiece=(
format=core_wordpiece_info.format,
distribution=core_wordpiece_info.distribution,
description=core_wordpiece_info.description,
),
)(shipped_preview = [:bert_base_multilingual_cased_wordpiece, :bert_base_uncased_wordpiece, :core_bpe_en, :core_sentencepiece_unigram_en, :core_wordpiece_en, :mistral_v1_sentencepiece, :mistral_v3_sentencepiece, :openai_gpt2_bpe], recommended_keys = [:tiktoken_cl100k_base, :tiktoken_o200k_base, :mistral_v3_sentencepiece, :phi2_bpe, :qwen2_5_bpe, :roberta_base_bpe, :xlm_roberta_base_sentencepiece_bpe], core_wordpiece = (format = :wordpiece_vocab, distribution = :shipped, description = "Tiny built-in English WordPiece model."))- What you should see:
shipped_previewcontains local-ready model keys.recommended_keyscontains practical LLM defaults.describe_modelreturns structured metadata (format, distribution, description, upstream info).
- Concerns and setup notes:
shippedmeans tiny models included in the repository.artifact_publicmeans downloadable public artifacts.installable_gatedmeans install flow requires credentials and license acceptance.- This recipe is offline-safe because it does not download.
- Next: model inventory details are in Built-In Models. For install flows, go to P9.
P3: Get ids plus offsets plus masks for alignment
- You have:
clean_text::Stringfrom preprocessing. - You want: a
TokenizationResultwith ids, offsets, and masks. - Objective: get alignment-ready metadata in one call.
- Steps:
- Call
load_tokenizer(...). - Call
tokenization_view(tokenizer, clean_text)to get tokenizer-coordinate text. - Call
encode_result(...; assume_normalized=true, return_offsets=true, return_masks=true). - Read
result.metadata.offsets_referenceto confirm what string offsets are relative to.
- Call
using KeemenaSubwords
tokenizer = load_tokenizer(:core_sentencepiece_unigram_en)
clean_text = "Hello, world! Offsets demo."
tokenization_text = tokenization_view(tokenizer, clean_text)
result = encode_result(
tokenizer,
tokenization_text;
assume_normalized=true,
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
@assert result.offsets !== nothing
@assert result.special_tokens_mask !== nothing
preview_rows = [
(
token_index=i,
token=result.tokens[i],
offset=result.offsets[i],
is_special=result.special_tokens_mask[i] == 1,
)
for i in 1:min(length(result.ids), 12)
]
(
offsets_reference=result.metadata.offsets_reference,
token_count=length(result.ids),
preview_rows=preview_rows,
)(offsets_reference = :input_text, token_count = 6, preview_rows = @NamedTuple{token_index::Int64, token::String, offset::Tuple{Int64, Int64}, is_special::Bool}[(token_index = 1, token = "<s>", offset = (0, 0), is_special = 1), (token_index = 2, token = "<unk>", offset = (1, 7), is_special = 1), (token_index = 3, token = "<unk>", offset = (8, 14), is_special = 1), (token_index = 4, token = "<unk>", offset = (15, 22), is_special = 1), (token_index = 5, token = "<unk>", offset = (23, 28), is_special = 1), (token_index = 6, token = "</s>", offset = (0, 0), is_special = 1)])- What you should see:
result.offsetsandresult.special_tokens_maskare present.offsets_referenceis:input_textbecauseassume_normalized=trueand you passedtokenization_text.- Preview rows include offsets and special-token flags per token.
- Concerns and setup notes:
- Use
assume_normalized=trueonly when the input text is alreadytokenization_view(...)output. - Offsets are relative to whatever
offsets_referencesays. - Inserted specials often have
(0, 0)sentinel offsets.
- Use
- Next: for offset semantics go to Normalization and Offsets Contract. For alignment algorithms go to Offsets Alignment Examples.
P4: Span text extraction from offsets (safe slicing)
- You have: token offsets from
TokenizationResult. - You want: readable text snippets per token span.
- Objective: debug and inspect token alignment quickly.
- Steps:
- Build
tokenization_textwithtokenization_view. - Produce offsets with
encode_result(...; return_offsets=true). - Call
try_span_substring(tokenization_text, offset)for each offset.
- Build
using KeemenaSubwords
tokenizer = load_tokenizer(:core_sentencepiece_unigram_en)
tokenization_text = tokenization_view(tokenizer, "Hello, world! Offsets demo.")
result = encode_result(
tokenizer,
tokenization_text;
assume_normalized=true,
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
@assert result.offsets !== nothing
span_preview = [
(
token_index=i,
token=result.tokens[i],
offset=result.offsets[i],
span_text=try_span_substring(tokenization_text, result.offsets[i]),
)
for i in 1:min(length(result.ids), 12)
]
span_preview6-element Vector{@NamedTuple{token_index::Int64, token::String, offset::Tuple{Int64, Int64}, span_text::String}}:
(token_index = 1, token = "<s>", offset = (0, 0), span_text = "")
(token_index = 2, token = "<unk>", offset = (1, 7), span_text = "Hello,")
(token_index = 3, token = "<unk>", offset = (8, 14), span_text = "world!")
(token_index = 4, token = "<unk>", offset = (15, 22), span_text = "Offsets")
(token_index = 5, token = "<unk>", offset = (23, 28), span_text = "demo.")
(token_index = 6, token = "</s>", offset = (0, 0), span_text = "")- What you should see:
- Most spanful offsets produce
span_text::String. - Sentinel or empty spans produce
"". - In byte-level cases,
try_span_substringmay returnnothingfor non-boundary spans.
- Most spanful offsets produce
- Concerns and setup notes:
try_span_substringreturnsnothingwhen boundaries are not valid Julia string boundaries.- If you need bytes regardless of boundaries, use
span_codeunits(tokenization_text, offset). - Keep extraction text and offset coordinate text consistent (
tokenization_text).
- Next: go to Offsets Alignment Examples for overlap mapping and span-label workflows.
P5: Batch encode multiple sequences (no padding yet)
- You have: many texts (
Vector{String}). - You want:
Vector{TokenizationResult}with one structured output per input sequence. - Objective: prepare data for later collation while preserving per-sequence metadata.
- Steps:
- Normalize each input with
tokenization_view. - Call
encode_batch_result(...)with offsets and masks enabled. - Inspect per-sequence lengths before padding.
- Normalize each input with
using KeemenaSubwords
tokenizer = load_tokenizer(:core_wordpiece_en)
clean_texts = ["hello world", "hello", "world hello world"]
tokenization_texts = [tokenization_view(tokenizer, clean_text) for clean_text in clean_texts]
batch_results = encode_batch_result(
tokenizer,
tokenization_texts;
assume_normalized=true,
add_special_tokens=true,
return_offsets=true,
return_masks=true,
)
sequence_lengths = [length(result.ids) for result in batch_results]
(
sequence_lengths=sequence_lengths,
has_variable_lengths=length(unique(sequence_lengths)) > 1,
)(sequence_lengths = [4, 3, 5], has_variable_lengths = true)- What you should see:
batch_resultsis a vector, not a matrix.- Sequence lengths can differ.
- Each element still has its own ids, masks, and optional offsets.
- Concerns and setup notes:
- No padding is applied automatically.
- This is intentional: you can choose task-specific collation later.
- Next: for padding and training tensors, go to P6 and Structured Outputs and Batching.
P6: Padding plus labels for training (pointer recipe)
- You have: many input texts (
Vector{String}), or precomputedVector{TokenizationResult}if you split steps yourself. - You want: padded
(seq_len, batch)matrices and causal LM labels. - Objective: build training-ready tensors with explicit padding and masking behavior.
- Steps:
- Call
quick_causal_lm_batch(...)to run encode, collate, and label-shift in one call. - Read
ids,attention_mask, andlabels. - Use
zero_based=trueonly for external consumers that expect 0-based labels.
- Call
using KeemenaSubwords
training_batch = quick_causal_lm_batch(
:core_wordpiece_en,
["hello world", "hello"];
add_special_tokens=true,
return_offsets=false,
ignore_index=-100,
zero_based=false,
)
@assert size(training_batch.ids) == size(training_batch.attention_mask)
@assert size(training_batch.labels) == size(training_batch.ids)
(
ids_size=size(training_batch.ids),
labels_size=size(training_batch.labels),
ignore_index_count=count(==(-100), training_batch.labels),
pad_token_id=training_batch.pad_token_id,
)(ids_size = (4, 2), labels_size = (4, 2), ignore_index_count = 3, pad_token_id = 1)- What you should see:
ids,attention_mask, andlabelsall share the same matrix shape.ignore_index_countis positive (padding and final-token masking).pad_token_idis inferred from tokenizerpad_idoreos_id.
- Concerns and setup notes:
- KeemenaSubwords ids are 1-based.
ignore_index=-100is the common causal LM training convention.- Final valid token in each sequence should remain ignored.
- Next: go to Structured Outputs and Batching for fuller collation, causal labels, and block packing.
What quick_causal_lm_batch does under the hood
using KeemenaSubwords
tokenizer = load_tokenizer(:core_wordpiece_en)
input_texts = ["hello world", "hello"]
batch_encoding = quick_encode_batch(
tokenizer,
input_texts;
add_special_tokens=true,
return_offsets=false,
return_masks=true,
)
collated = collate_padded_batch(batch_encoding.results; tokenizer=tokenizer)
labels = causal_lm_labels(collated.ids, collated.attention_mask; ignore_index=-100, zero_based=false)
(
sequence_lengths=batch_encoding.sequence_lengths,
ids_size=size(collated.ids),
labels_size=size(labels),
)(sequence_lengths = [4, 3], ids_size = (4, 2), labels_size = (4, 2))P7: Export to Hugging Face tokenizer.json for Python
- You have: a tokenizer loaded in Julia.
- You want: a
tokenizer.jsonfile that Python can load. - Objective: share identical tokenization rules across Julia and Python.
- Steps:
- Load or train a tokenizer in Julia.
- Call
export_tokenizer(...; format=:hf_tokenizer_json). - Load the emitted
tokenizer.jsonin Python usingPreTrainedTokenizerFast.
using KeemenaSubwords
tokenizer = load_tokenizer(:core_wordpiece_en)
output_directory = mktempdir()
export_tokenizer(tokenizer, output_directory; format=:hf_tokenizer_json)
isfile(joinpath(output_directory, "tokenizer.json"))truePython usage (non-executable in Documenter):
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="out_tokenizer/tokenizer.json")- What you should see:
- Julia writes
tokenizer.jsonto the output directory. - Python loads that same file with
PreTrainedTokenizerFast.
- Julia writes
- Concerns and setup notes:
- Exported file captures tokenizer pipeline behavior supported by the package.
- Keep format contracts in mind when sharing across runtimes.
- Next: details are in Tokenizer Formats and Required Files and LLM Cookbook.
P8: Load from a local path (auto-detect plus override)
- You have: local tokenizer files on disk.
- You want: a loaded tokenizer without guessing format details manually.
- Objective: use auto-detection when it works, and explicit overrides when needed.
- Steps:
- Try
load_tokenizer("/path/to/model_dir")first. - If format is ambiguous, set
format=:...explicitly. - Validate with a tiny
encodeortokenizecall.
- Try
Decision tree:
- If auto-detect works, use
load_tokenizer(path). - If ambiguous or incorrect, use
load_tokenizer(path; format=:...).
# non-executable path placeholders
tokenizer_auto = load_tokenizer("/path/to/model_dir")
tokenizer_tiktoken = load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
tokenizer_sentencepiece = load_tokenizer("/path/to/tokenizer.model"; format=:sentencepiece_model)- What you should see:
- Auto-detect succeeds for common directory layouts.
- Explicit override resolves ambiguous
.modelcases.
- Concerns and setup notes:
formatselects a file contract (required filenames and parsing rules).- Placeholder paths here are non-executable in docs.
- Next: go to Loading Tokenizers From Local Paths and Tokenizer Formats and Required Files.
P9: Install and load a gated model
- You have: model access credentials and accepted upstream license terms.
- You want: installed local assets for a gated model key.
- Objective: run a reproducible install-then-load workflow.
- Steps:
- Set an HF token (for example in
ENV["HF_TOKEN"]). - Call
install_model!(...; token=ENV["HF_TOKEN"]). - Call
load_tokenizer(:model_key)after install.
- Set an HF token (for example in
# non-executable gated workflow
install_model!(:llama3_8b_tokenizer; token=ENV["HF_TOKEN"])
tokenizer = load_tokenizer(:llama3_8b_tokenizer)- What you should see:
- Install step fetches and stores assets for the key.
- Load step then resolves locally by model key.
- Concerns and setup notes:
- You must accept upstream license terms before access is granted.
- Keep secrets in environment variables, not source files.
- Next: go to Installable Gated Models and LLM Cookbook.
Training recipes (experimental)
Training is usually appropriate when you need domain adaptation, research control over vocabulary behavior, or constrained deployments that need custom tokenizer assets. Training APIs are experimental and may evolve faster than pretrained loading and encoding APIs.
T1: Train a tiny WordPiece tokenizer, save, reload, and encode
- You have: a small in-memory corpus (
Vector{String}). - You want: a trained tokenizer bundle you can reload reproducibly.
- Objective: run a fully local training round trip without network access.
- Steps:
- Call
quick_train_bundle(:wordpiece, corpus; ...). - Read
bundle_filesandtokenizer. - Confirm sanity ids and decoded text from the returned fields.
- Call
using KeemenaSubwords
training_corpus = [
"hello world",
"hello tokenizer",
"world of subwords",
]
training_output = quick_train_bundle(
:wordpiece,
training_corpus;
vocab_size=64,
min_frequency=1,
sanity_text="hello world",
)
(
bundle_directory=training_output.bundle_directory,
bundle_files=training_output.bundle_files,
sanity_encoded_ids=training_output.sanity_encoded_ids,
sanity_decoded_text=training_output.sanity_decoded_text,
)(bundle_directory = "/tmp/jl_lugS1s", bundle_files = ["keemena_training_manifest.json", "vocab.txt"], sanity_encoded_ids = [57, 56], sanity_decoded_text = "hello world")- What you should see:
bundle_filesincludes tokenizer exports and the training manifest.- Returned tokenizer has already been reloaded from bundle.
- Workflow is deterministic for the same corpus and config.
- Concerns and setup notes:
- The bundle gives reproducible reload without remembering loader kwargs.
quick_train_bundlecurrently supports:wordpieceplus HF preset trainer symbols.- Avoid asserting exact token ids across different configs.
- Next: full API and preset coverage are in Training (experimental).
What quick_train_bundle does under the hood
using KeemenaSubwords
training_corpus = ["hello world", "hello tokenizer", "world tokenizer"]
training_result = train_wordpiece_result(training_corpus; vocab_size=48, min_frequency=1)
bundle_directory = mktempdir()
save_training_bundle(training_result, bundle_directory; overwrite=true)
reloaded_tokenizer = load_training_bundle(bundle_directory)
encoded_ids = encode(reloaded_tokenizer, "hello world"; add_special_tokens=false)
decoded_text = decode(reloaded_tokenizer, encoded_ids)
(
bundle_files=sort(readdir(bundle_directory)),
encoded_ids=encoded_ids,
decoded_text=decoded_text,
)(bundle_files = ["keemena_training_manifest.json", "vocab.txt"], encoded_ids = [44, 45], decoded_text = "hello world")T2: Train HF BERT WordPiece preset and export tokenizer.json
- You have: text suited to BERT-style tokenization behavior.
- You want: a BERT preset tokenizer and optional HF export.
- Objective: use familiar BERT normalization and pretokenization defaults.
- Steps:
- Call
train_hf_bert_wordpiece(corpus; ...). - Export with
export_tokenizer(...; format=:hf_tokenizer_json).
- Call
# non-executable training preset sketch
training_corpus = ["Hello, world!", "Tokenizer training example"]
tokenizer = train_hf_bert_wordpiece(training_corpus; vocab_size=128, min_frequency=1)
export_tokenizer(tokenizer, "out_hf_bert"; format=:hf_tokenizer_json)- What you should see:
- Training returns a WordPiece tokenizer configured for BERT-style behavior.
- Export creates
tokenizer.jsonfor HF interop.
- Concerns and setup notes:
- Presets are convenient defaults, not strict replication of every upstream variant.
- Next: see Training (experimental) and Tokenizer Formats and Required Files.
T3: Train HF RoBERTa or GPT-2 ByteBPE preset
- You have: corpus data for byte-level subword training.
- You want: a byte-level preset tokenizer in RoBERTa or GPT-2 style.
- Objective: use byte-level presets for robust coverage of arbitrary UTF-8 input.
- Steps:
- Call
train_hf_roberta_bytebpe(...)ortrain_hf_gpt2_bytebpe(...). - Export with
export_tokenizer(...; format=:hf_tokenizer_json)if needed.
- Call
# non-executable training preset sketch
training_corpus = ["hello world", "cafe costs 5"]
tokenizer = train_hf_roberta_bytebpe(training_corpus; vocab_size=384, min_frequency=1)
export_tokenizer(tokenizer, "out_hf_roberta"; format=:hf_tokenizer_json)- What you should see:
- Training returns a byte-level tokenizer preset.
- Exported artifacts are reloadable in Julia and usable in compatible external tools.
- Concerns and setup notes:
- Byte-level offsets still follow the package contract.
- Span boundaries may not always be safe Julia string boundaries on multibyte text.
- Next: offset rules are in Normalization and Offsets Contract and examples are in Offsets Alignment Examples.
Other options (short list)
- Cache tokenizers for repeated use with
get_tokenizer_cached(...)and clear cache withclear_tokenizer_cache!(). - Use explicit loaders when file contracts are known, for example
load_bpe_gpt2,load_sentencepiece, andload_tiktoken. - Convert to 0-based ids only when an external consumer requires it:
ids_zero_based = token_ids .- 1.