API Reference
Explicit Loader APIs
KeemenaSubwords.load_bpe — Function
Load a BPE tokenizer from either a directory (vocab.txt + merges.txt) or a vocab file path.
Load a BPE tokenizer from explicit vocab + merges paths.
KeemenaSubwords.load_bytebpe — Function
Load a byte-level BPE tokenizer from a directory (vocab.txt + merges.txt) or vocab path.
Load a byte-level BPE tokenizer from explicit vocab + merges paths.
KeemenaSubwords.load_bpe_gpt2 — Function
Load GPT-2 / RoBERTa style BPE from vocab.json + merges.txt.
Example: load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
KeemenaSubwords.load_bpe_encoder — Function
Load GPT-2 encoder variant from encoder.json + vocab.bpe.
Example: load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")
KeemenaSubwords.load_unigram — Function
Load a Unigram tokenizer from unigram.tsv (file or directory).
Expected format (tab-separated): token<TAB>score[<TAB>special_symbol]
KeemenaSubwords.load_wordpiece — Function
Load a WordPiece tokenizer from a vocab file path or a directory containing vocab.txt.
Examples:
load_wordpiece("/path/to/vocab.txt")load_wordpiece("/path/to/model_dir")
KeemenaSubwords.load_sentencepiece — Function
Load a SentencePiece .model file.
Supported inputs:
- standard SentencePiece binary protobuf
.model/.model.v3payloads - Keemena text-exported model files:
- key/value lines (
type=unigram|bpe,whitespace_marker=▁,unk_token=<unk>) - piece rows:
piece<TAB>token<TAB>score[<TAB>special_symbol] - bpe merge rows (for
type=bpe):merge<TAB>left<TAB>right
- key/value lines (
Examples:
load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)
KeemenaSubwords.load_tiktoken — Function
Load a tiktoken encoding file (*.tiktoken).
The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.
Examples:
load_tiktoken("/path/to/o200k_base.tiktoken")load_tiktoken("/path/to/tokenizer.model")(when file contains tiktoken text lines)
KeemenaSubwords.load_hf_tokenizer_json — Function
Load a Hugging Face tokenizer.json tokenizer in pure Julia.
Expected files:
tokenizer.jsondirectly, or- a directory containing
tokenizer.json.
Examples:
load_hf_tokenizer_json("/path/to/tokenizer.json")load_hf_tokenizer_json("/path/to/model_dir")
KeemenaSubwords.load_tokenizer — Function
Load tokenizer by built-in model name.
Load tokenizer from file system path.
Common format contracts:
:hf_tokenizer_json->tokenizer.json:bpe_gpt2->vocab.json+merges.txt:bpe_encoder->encoder.json+vocab.bpe:wordpiece/:wordpiece_vocab->vocab.txt:sentencepiece_model->*.model/*.model.v3/sentencepiece.bpe.model:tiktoken->*.tiktokenor tiktoken-texttokenizer.model
Examples:
load_tokenizer("/path/to/model_dir")load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)
Load tokenizer from explicit (vocab_path, merges_path) tuple.
This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.
Load tokenizer from a named specification.
Examples:
(format=:wordpiece, path="/.../vocab.txt")(format=:hf_tokenizer_json, path="/.../tokenizer.json")(format=:unigram, path="/.../unigram.tsv")(format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")(format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")(format=:wordpiece, vocab_txt="/.../vocab.txt")(alias)(format=:sentencepiece_model, model_file="/.../tokenizer.model")(alias)(format=:tiktoken, encoding_file="/.../o200k_base.tiktoken")(alias)(format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json")(alias)(format=:unigram, unigram_tsv="/.../unigram.tsv")(alias)
Load tokenizer from a FilesSpec.
KeemenaSubwords.detect_tokenizer_format — Function
Detect tokenizer format from a local file or directory.
Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.
Examples:
detect_tokenizer_format("/path/to/model_dir")detect_tokenizer_format("/path/to/tokenizer.model")
KeemenaSubwords.detect_tokenizer_files — Function
Inspect a tokenizer directory and return detected candidate files.
Example: detect_tokenizer_files("/path/to/model_dir")
Structured encoding and file-spec APIs are also part of the public surface: TokenizationResult, FilesSpec, normalize, tokenization_view, requires_tokenizer_normalization, offsets_coordinate_system, offsets_index_base, offsets_span_style, offsets_sentinel, has_span, has_nonempty_span, span_ncodeunits, span_codeunits, is_valid_string_boundary, try_span_substring, offsets_are_nonoverlapping, validate_offsets_contract, assert_offsets_contract, encode_result, encode_batch_result.
Quick Handler APIs
KeemenaSubwords.quick_tokenize — Function
quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTupleHigh-level one-call wrapper for common single-text tokenization workflows.
This helper applies the recommended offsets pipeline by default:
tokenization_text = tokenization_view(tokenizer, input_text)encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)
Supported inputs:
quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=truereturn_masks::Bool=true
Returns a NamedTuple with keys:
token_piecestoken_idsdecoded_texttokenization_textoffsetsattention_masktoken_type_idsspecial_tokens_maskmetadata
KeemenaSubwords.quick_encode_batch — Function
quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTupleHigh-level wrapper for batch structured encoding.
By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=truereturn_masks::Bool=trueformat::Union{Nothing,Symbol}=nothing(source overloads only)prefetch::Bool=true(source overloads only)
Returns a NamedTuple with keys:
tokenization_textsresultssequence_lengthsmetadata
KeemenaSubwords.collate_padded_batch — Function
collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTupleCollate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.
Returns:
ids::Matrix{Int}attention_mask::Matrix{Int}token_type_ids::Matrix{Int}special_tokens_mask::Matrix{Int}sequence_lengths::Vector{Int}pad_token_id::Intpad_side::Symbol
Padding behavior:
idsare filled withpad_token_id.attention_maskuses1for valid tokens and0for padding.token_type_idsdefaults to0where missing.special_tokens_maskdefaults to0on valid tokens where missing and uses1on padding positions.
Pad token selection:
- If
pad_token_idis provided, it is used directly. - Otherwise
pad_id(tokenizer)is used when available. - Otherwise
eos_id(tokenizer)is used when available. - If none are available, throws an
ArgumentError.
Optional keyword arguments:
pad_to_multiple_of::Union{Nothing,Int}=nothingpad_side::Symbol=:right(only right padding is currently supported)
KeemenaSubwords.causal_lm_labels — Function
causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).
For each sequence column:
- valid non-final positions receive the next valid token id,
- the final valid position receives
ignore_index, - padding positions receive
ignore_index.
When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.
KeemenaSubwords.quick_causal_lm_batch — Function
quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTupleOne-call helper for training-ready causal LM tensors.
Pipeline:
quick_encode_batch(...; return_masks=true)collate_padded_batch(...)causal_lm_labels(...)
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=falsepad_token_id::Union{Nothing,Int}=nothingpad_to_multiple_of::Union{Nothing,Int}=nothingpad_side::Symbol=:rightignore_index::Int=-100zero_based::Bool=falseformat::Union{Nothing,Symbol}=nothing(source overloads only)prefetch::Bool=true(source overloads only)
Returns a NamedTuple with keys:
idsattention_masklabelstoken_type_idsspecial_tokens_masktokenization_textssequence_lengthspad_token_idignore_indexzero_based
KeemenaSubwords.quick_train_bundle — Function
quick_train_bundle(trainer, corpus; kwargs...) -> NamedTupleHigh-level training round-trip helper:
- Train with a selected trainer.
- Save a training bundle with
save_training_bundle. - Reload with
load_training_bundle. - Run a sanity encode/decode pass.
Supported trainer symbols:
:wordpiece:hf_bert_wordpiece:hf_roberta_bytebpe:hf_gpt2_bytebpe
Convenience overload:
quick_train_bundle(corpus; kwargs...)defaults totrainer=:wordpiece.
Keyword arguments:
bundle_directory::Union{Nothing,AbstractString}=nothingoverwrite::Bool=trueexport_format::Symbol=:autosanity_text::AbstractString="hello world"- plus trainer-specific keywords forwarded to the selected
train_*_result.
Returns a NamedTuple with keys:
bundle_directorybundle_filestraining_summarytokenizersanity_encoded_idssanity_decoded_text
Registry and Installation APIs
KeemenaSubwords.available_models — Function
List available built-in model names.
KeemenaSubwords.describe_model — Function
Describe a built-in model.
KeemenaSubwords.model_path — Function
Resolve built-in model name to on-disk path.
KeemenaSubwords.prefetch_models — Function
Ensure artifact-backed built-in models are present on disk.
Returns a dictionary of key => is_available.
KeemenaSubwords.register_local_model! — Function
Register a local tokenizer path under a symbolic key and persist it in the cache registry.
Register local model files by explicit specification.
Register local model files from a FilesSpec.
KeemenaSubwords.install_model! — Function
Install an installable-gated tokenizer into the user cache and register it by key.
KeemenaSubwords.install_llama2_tokenizer! — Function
Install the gated LLaMA 2 tokenizer files into local cache and register them.
This is a convenience wrapper over install_model!(:llama2_tokenizer; ...).
KeemenaSubwords.install_llama3_8b_tokenizer! — Function
Install the gated LLaMA 3 8B tokenizer files into local cache and register them.
This is a convenience wrapper over install_model!(:llama3_8b_tokenizer; ...).
KeemenaSubwords.download_hf_files — Function
Download selected files from a Hugging Face repository revision into cache.
This helper is opt-in and useful for user-managed / gated tokenizers.
KeemenaSubwords.recommended_defaults_for_llms — Function
Recommended built-in keys for LLM-oriented default prefetching.
register_external_model! remains available as a deprecated compatibility alias; prefer register_local_model! in new code.
Full Exported API
KeemenaSubwords.AbstractSubwordTokenizer — Type
Abstract parent type for all subword tokenizers.
Tokenizers are callable and support: tokenizer(text::AbstractString) -> Vector{String}.
KeemenaSubwords.FilesSpec — Type
Structured file specification for local tokenizer loading/registration.
Use path for single-file formats and explicit pairs for multi-file formats.
KeemenaSubwords.SubwordVocabulary — Type
Vocabulary container with forward/reverse lookup and special token IDs.
IDs are 1-based.
KeemenaSubwords.TokenizationResult — Type
Structured tokenization output for downstream pipelines.
Offset contract:
- coordinate unit: UTF-8 codeunits.
- index base: 1.
- span style: half-open
[start, stop). - valid bounds for spanful tokens:
1 <= start <= stop <= ncodeunits(text) + 1. - sentinel for tokens without source-text spans:
(0, 0). - inserted post-processor specials use sentinel offsets.
- present-in-text special added tokens keep real spans, and may still have
special_tokens_mask[i] == 1. special_tokens_maskmarks special-token identity;offsetsdetermine span participation.
KeemenaSubwords.TokenizerMetadata — Type
Common metadata for tokenizer instances.
KeemenaSubwords.assert_offsets_contract — Method
Assert offsets satisfy the package offset contract.
Throws ArgumentError on first contract violation. With require_string_boundaries=true, non-empty spans must start/end on valid Julia string boundaries.
KeemenaSubwords.asset_status — Method
Return prefetch status for a single model key.
KeemenaSubwords.available_models — Method
List available built-in model names.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
Return beginning-of-sequence token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.bos_id — Method
BOS token ID if available.
KeemenaSubwords.cached_tokenizers — Method
List cache keys for in-session cached tokenizers.
KeemenaSubwords.causal_lm_labels — Method
causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).
For each sequence column:
- valid non-final positions receive the next valid token id,
- the final valid position receives
ignore_index, - padding positions receive
ignore_index.
When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.
KeemenaSubwords.clear_tokenizer_cache! — Method
Clear the in-session tokenizer cache used by one-call convenience APIs.
KeemenaSubwords.collate_padded_batch — Method
collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTupleCollate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.
Returns:
ids::Matrix{Int}attention_mask::Matrix{Int}token_type_ids::Matrix{Int}special_tokens_mask::Matrix{Int}sequence_lengths::Vector{Int}pad_token_id::Intpad_side::Symbol
Padding behavior:
idsare filled withpad_token_id.attention_maskuses1for valid tokens and0for padding.token_type_idsdefaults to0where missing.special_tokens_maskdefaults to0on valid tokens where missing and uses1on padding positions.
Pad token selection:
- If
pad_token_idis provided, it is used directly. - Otherwise
pad_id(tokenizer)is used when available. - Otherwise
eos_id(tokenizer)is used when available. - If none are available, throws an
ArgumentError.
Optional keyword arguments:
pad_to_multiple_of::Union{Nothing,Int}=nothingpad_side::Symbol=:right(only right padding is currently supported)
KeemenaSubwords.decode — Method
One-call decode by tokenizer path/directory.
KeemenaSubwords.decode — Method
Decode token IDs into text.
KeemenaSubwords.decode — Method
Decode token IDs to text.
KeemenaSubwords.decode — Method
Decode byte-level BPE IDs back to text.
KeemenaSubwords.decode — Method
Decode SentencePiece IDs back to text.
KeemenaSubwords.decode — Method
One-call decode by model key.
KeemenaSubwords.decode — Method
Decode tiktoken rank IDs to text.
KeemenaSubwords.decode — Method
Decode unigram token IDs back to text.
KeemenaSubwords.decode — Method
Decode WordPiece token IDs back into text.
KeemenaSubwords.describe_model — Method
Describe a built-in model.
KeemenaSubwords.detect_tokenizer_files — Method
Inspect a tokenizer directory and return detected candidate files.
Example: detect_tokenizer_files("/path/to/model_dir")
KeemenaSubwords.detect_tokenizer_format — Method
Detect tokenizer format from a local file or directory.
Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.
Examples:
detect_tokenizer_format("/path/to/model_dir")detect_tokenizer_format("/path/to/tokenizer.model")
KeemenaSubwords.download_hf_files — Method
Download selected files from a Hugging Face repository revision into cache.
This helper is opt-in and useful for user-managed / gated tokenizers.
KeemenaSubwords.encode — Method
One-call encode by tokenizer path/directory.
KeemenaSubwords.encode — Method
Encode text into token IDs.
KeemenaSubwords.encode — Method
Encode text to token IDs.
KeemenaSubwords.encode — Method
Encode text to byte-level BPE IDs.
KeemenaSubwords.encode — Method
Encode text to SentencePiece IDs.
KeemenaSubwords.encode — Method
One-call encode by model key.
KeemenaSubwords.encode — Method
Encode text into tiktoken rank IDs (1-based in this package).
KeemenaSubwords.encode — Method
Encode text to unigram token IDs.
KeemenaSubwords.encode — Method
Encode text to WordPiece token IDs.
KeemenaSubwords.encode_batch_result — Method
Batch variant of encode_result.
KeemenaSubwords.encode_result — Function
Encode text and return a structured TokenizationResult.
Key keyword arguments:
assume_normalized::Bool=false: whentrue, tokenizer intrinsic normalization is skipped and offsets are computed against the exact providedtext.return_offsets::Bool=false: include token-level offsets when available.return_masks::Bool=false: include attention/token-type/special-token masks.
Offset note:
- Offsets use the package-wide 1-based UTF-8 codeunit half-open convention.
assume_normalizedchanges whether intrinsic normalization runs; it does not change the offset coordinate system.
KeemenaSubwords.encode_result — Method
One-call structured encode by tokenizer path/directory.
KeemenaSubwords.encode_result — Method
One-call structured encode by model key.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
Return end-of-sequence token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.eos_id — Method
EOS token ID if available.
KeemenaSubwords.export_tokenizer — Method
Export tokenizer to external formats.
Supported format values:
:internal:bpe/:bpe_gpt2:wordpiece_vocab:unigram_tsv:sentencepiece_model:hf_tokenizer_json
KeemenaSubwords.get_tokenizer_cached — Method
Return a cached tokenizer for a model key or path, loading and caching on first use.
KeemenaSubwords.has_nonempty_span — Method
Return true when an offset carries a non-empty source-text span.
KeemenaSubwords.has_span — Method
Return true when an offset carries a real source-text span.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Map ID to token string.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.id_to_token — Method
Reverse token lookup.
KeemenaSubwords.install_llama2_tokenizer! — Method
Install the gated LLaMA 2 tokenizer files into local cache and register them.
This is a convenience wrapper over install_model!(:llama2_tokenizer; ...).
KeemenaSubwords.install_llama3_8b_tokenizer! — Method
Install the gated LLaMA 3 8B tokenizer files into local cache and register them.
This is a convenience wrapper over install_model!(:llama3_8b_tokenizer; ...).
KeemenaSubwords.install_model! — Method
Install an installable-gated tokenizer into the user cache and register it by key.
KeemenaSubwords.is_valid_string_boundary — Method
Return whether idx is a valid Julia string boundary for text.
This includes the exclusive end boundary ncodeunits(text) + 1.
KeemenaSubwords.keemena_callable — Method
Return a function compatible with KeemenaPreprocessing's callable tokenizer contract.
KeemenaSubwords.level_key — Method
Level key used by KeemenaPreprocessing for callable tokenizers.
KeemenaSubwords.load_bpe — Method
Load a BPE tokenizer from explicit vocab + merges paths.
KeemenaSubwords.load_bpe — Method
Load a BPE tokenizer from either a directory (vocab.txt + merges.txt) or a vocab file path.
KeemenaSubwords.load_bpe_encoder — Method
Load GPT-2 encoder variant from encoder.json + vocab.bpe.
Example: load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")
KeemenaSubwords.load_bpe_gpt2 — Method
Load GPT-2 / RoBERTa style BPE from vocab.json + merges.txt.
Example: load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")
KeemenaSubwords.load_bytebpe — Method
Load a byte-level BPE tokenizer from explicit vocab + merges paths.
KeemenaSubwords.load_bytebpe — Method
Load a byte-level BPE tokenizer from a directory (vocab.txt + merges.txt) or vocab path.
KeemenaSubwords.load_hf_tokenizer_json — Method
Load a Hugging Face tokenizer.json tokenizer in pure Julia.
Expected files:
tokenizer.jsondirectly, or- a directory containing
tokenizer.json.
Examples:
load_hf_tokenizer_json("/path/to/tokenizer.json")load_hf_tokenizer_json("/path/to/model_dir")
KeemenaSubwords.load_sentencepiece — Method
Load a SentencePiece .model file.
Supported inputs:
- standard SentencePiece binary protobuf
.model/.model.v3payloads - Keemena text-exported model files:
- key/value lines (
type=unigram|bpe,whitespace_marker=▁,unk_token=<unk>) - piece rows:
piece<TAB>token<TAB>score[<TAB>special_symbol] - bpe merge rows (for
type=bpe):merge<TAB>left<TAB>right
- key/value lines (
Examples:
load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)
KeemenaSubwords.load_tiktoken — Method
Load a tiktoken encoding file (*.tiktoken).
The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.
Examples:
load_tiktoken("/path/to/o200k_base.tiktoken")load_tiktoken("/path/to/tokenizer.model")(when file contains tiktoken text lines)
KeemenaSubwords.load_tokenizer — Method
Load tokenizer from file system path.
Common format contracts:
:hf_tokenizer_json->tokenizer.json:bpe_gpt2->vocab.json+merges.txt:bpe_encoder->encoder.json+vocab.bpe:wordpiece/:wordpiece_vocab->vocab.txt:sentencepiece_model->*.model/*.model.v3/sentencepiece.bpe.model:tiktoken->*.tiktokenor tiktoken-texttokenizer.model
Examples:
load_tokenizer("/path/to/model_dir")load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)
KeemenaSubwords.load_tokenizer — Method
Load tokenizer from a FilesSpec.
KeemenaSubwords.load_tokenizer — Method
Load tokenizer from a named specification.
Examples:
(format=:wordpiece, path="/.../vocab.txt")(format=:hf_tokenizer_json, path="/.../tokenizer.json")(format=:unigram, path="/.../unigram.tsv")(format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")(format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")(format=:wordpiece, vocab_txt="/.../vocab.txt")(alias)(format=:sentencepiece_model, model_file="/.../tokenizer.model")(alias)(format=:tiktoken, encoding_file="/.../o200k_base.tiktoken")(alias)(format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json")(alias)(format=:unigram, unigram_tsv="/.../unigram.tsv")(alias)
KeemenaSubwords.load_tokenizer — Method
Load tokenizer by built-in model name.
KeemenaSubwords.load_tokenizer — Method
Load tokenizer from explicit (vocab_path, merges_path) tuple.
This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.
KeemenaSubwords.load_unigram — Method
Load a Unigram tokenizer from unigram.tsv (file or directory).
Expected format (tab-separated): token<TAB>score[<TAB>special_symbol]
KeemenaSubwords.load_wordpiece — Method
Load a WordPiece tokenizer from a vocab file path or a directory containing vocab.txt.
Examples:
load_wordpiece("/path/to/vocab.txt")load_wordpiece("/path/to/model_dir")
KeemenaSubwords.model_info — Method
Return model metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_info — Method
Tokenizer metadata.
KeemenaSubwords.model_path — Method
Resolve built-in model name to on-disk path.
KeemenaSubwords.normalize — Method
Return tokenizer intrinsic normalization output.
This does not perform pipeline-level preprocessing. Tokenizers without intrinsic normalization return text unchanged.
KeemenaSubwords.normalize_text — Method
Normalize text using an optional user-provided callable.
KeemenaSubwords.offsets_are_nonoverlapping — Method
Return whether participating offsets are non-overlapping in sequence order.
Participating offsets satisfy:
- not sentinel when
ignore_sentinel=true - not empty when
ignore_empty=true
For participating offsets, this enforces next.start >= prev.stop.
KeemenaSubwords.offsets_coordinate_system — Method
Offset coordinate system for TokenizationResult.offsets.
Offsets are UTF-8 codeunit indices with half-open span convention [start, stop).
KeemenaSubwords.offsets_index_base — Method
Offset index base for TokenizationResult.offsets.
Offsets are 1-based codeunit indices.
KeemenaSubwords.offsets_sentinel — Method
Sentinel used for tokens without a source-text span.
KeemenaSubwords.offsets_span_style — Method
Offset span style.
TokenizationResult.offsets use half-open spans [start, stop).
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Return padding-token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.pad_id — Method
Padding token ID if available.
KeemenaSubwords.prefetch_models — Function
Ensure artifact-backed built-in models are present on disk.
Returns a dictionary of key => is_available.
KeemenaSubwords.prefetch_models_status — Function
Return detailed prefetch status for built-in model keys.
Each value includes:
available::Boolmethod::Symbol(:artifact,:fallback_download,:already_present, or:failed)path::Union{Nothing,String}error::Union{Nothing,String}
KeemenaSubwords.print_asset_status — Method
Print a compact prefetch status line for one model key.
KeemenaSubwords.quick_causal_lm_batch — Method
quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTupleOne-call helper for training-ready causal LM tensors.
Pipeline:
quick_encode_batch(...; return_masks=true)collate_padded_batch(...)causal_lm_labels(...)
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=falsepad_token_id::Union{Nothing,Int}=nothingpad_to_multiple_of::Union{Nothing,Int}=nothingpad_side::Symbol=:rightignore_index::Int=-100zero_based::Bool=falseformat::Union{Nothing,Symbol}=nothing(source overloads only)prefetch::Bool=true(source overloads only)
Returns a NamedTuple with keys:
idsattention_masklabelstoken_type_idsspecial_tokens_masktokenization_textssequence_lengthspad_token_idignore_indexzero_based
KeemenaSubwords.quick_encode_batch — Method
quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTupleHigh-level wrapper for batch structured encoding.
By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=truereturn_masks::Bool=trueformat::Union{Nothing,Symbol}=nothing(source overloads only)prefetch::Bool=true(source overloads only)
Returns a NamedTuple with keys:
tokenization_textsresultssequence_lengthsmetadata
KeemenaSubwords.quick_tokenize — Method
quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTupleHigh-level one-call wrapper for common single-text tokenization workflows.
This helper applies the recommended offsets pipeline by default:
tokenization_text = tokenization_view(tokenizer, input_text)encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)
Supported inputs:
quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)
Keyword arguments:
add_special_tokens::Bool=trueapply_tokenization_view::Bool=truereturn_offsets::Bool=truereturn_masks::Bool=true
Returns a NamedTuple with keys:
token_piecestoken_idsdecoded_texttokenization_textoffsetsattention_masktoken_type_idsspecial_tokens_maskmetadata
KeemenaSubwords.quick_train_bundle — Method
quick_train_bundle(trainer, corpus; kwargs...) -> NamedTupleHigh-level training round-trip helper:
- Train with a selected trainer.
- Save a training bundle with
save_training_bundle. - Reload with
load_training_bundle. - Run a sanity encode/decode pass.
Supported trainer symbols:
:wordpiece:hf_bert_wordpiece:hf_roberta_bytebpe:hf_gpt2_bytebpe
Convenience overload:
quick_train_bundle(corpus; kwargs...)defaults totrainer=:wordpiece.
Keyword arguments:
bundle_directory::Union{Nothing,AbstractString}=nothingoverwrite::Bool=trueexport_format::Symbol=:autosanity_text::AbstractString="hello world"- plus trainer-specific keywords forwarded to the selected
train_*_result.
Returns a NamedTuple with keys:
bundle_directorybundle_filestraining_summarytokenizersanity_encoded_idssanity_decoded_text
KeemenaSubwords.recommended_defaults_for_llms — Method
Recommended built-in keys for LLM-oriented default prefetching.
KeemenaSubwords.register_external_model! — Method
Deprecated alias kept for compatibility. Use register_local_model! instead.
KeemenaSubwords.register_local_model! — Method
Register a local tokenizer path under a symbolic key and persist it in the cache registry.
KeemenaSubwords.register_local_model! — Method
Register local model files from a FilesSpec.
KeemenaSubwords.register_local_model! — Method
Register local model files by explicit specification.
KeemenaSubwords.requires_tokenizer_normalization — Method
Whether this tokenizer defines intrinsic normalization that can change text.
KeemenaSubwords.save_tokenizer — Method
Save tokenizer to a canonical on-disk format.
format=:internal chooses a tokenizer-family specific default:
WordPieceTokenizer->vocab.txtBPETokenizer/ByteBPETokenizer->vocab.txt+merges.txtUnigramTokenizer->unigram.tsvSentencePieceTokenizer->spm.model
KeemenaSubwords.span_codeunits — Method
Return the offset span as UTF-8 codeunits.
Sentinel and empty spans return UInt8[]. Invalid or out-of-bounds spans also return UInt8[] to keep this helper non-throwing for downstream inspection.
KeemenaSubwords.span_ncodeunits — Method
Return span length measured in UTF-8 codeunits.
Sentinel and empty spans return 0.
KeemenaSubwords.special_tokens — Method
Return special token IDs keyed by symbol.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.special_tokens — Method
Special token IDs.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Map token string to ID, falling back to :unk.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.token_to_id — Method
Forward token lookup.
KeemenaSubwords.tokenization_view — Method
Canonical tokenizer text view used for subword offsets/alignment.
KeemenaSubwords.tokenize — Method
One-call tokenize by tokenizer path/directory.
KeemenaSubwords.tokenize — Method
Tokenize text into subword pieces.
KeemenaSubwords.tokenize — Method
Tokenize with classic BPE merges.
KeemenaSubwords.tokenize — Method
Tokenize text by first mapping bytes to unicode symbols, then applying BPE merges.
KeemenaSubwords.tokenize — Method
Tokenize text with SentencePiece wrapper behavior.
KeemenaSubwords.tokenize — Method
One-call tokenize by model key.
KeemenaSubwords.tokenize — Method
Tokenize text into b64:<...> token pieces.
KeemenaSubwords.tokenize — Method
Tokenize text using deterministic Viterbi segmentation.
KeemenaSubwords.tokenize — Method
Greedy longest-match WordPiece tokenization.
KeemenaSubwords.try_span_substring — Method
Attempt to return a substring for a half-open codeunit span [start, stop).
Sentinel and empty spans return "". If span boundaries are not valid Julia string boundaries, this returns nothing. This helper never throws.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Return unknown-token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.unk_id — Method
Unknown token ID.
KeemenaSubwords.validate_offsets_contract — Method
Validate offsets against the package offset contract.
Returns true when all offsets satisfy bounds/sentinel invariants. With require_string_boundaries=true, non-empty spans must also start/end on valid Julia string boundaries.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.
KeemenaSubwords.vocab_size — Method
Vocabulary size.