API Reference

standard SentencePiece binary protobuf .model/.model.v3 payloads
Keemena text-exported model files:
- key/value lines (type=unigram|bpe, whitespace_marker=▁, unk_token=<unk>)
- piece rows: piece<TAB>token<TAB>score[<TAB>special_symbol]
- bpe merge rows (for type=bpe): merge<TAB>left<TAB>right

Examples:

load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)
load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)

source

KeemenaSubwords.load_tiktoken — Function

Load a tiktoken encoding file (*.tiktoken).

The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.

Examples:

load_tiktoken("/path/to/o200k_base.tiktoken")
load_tiktoken("/path/to/tokenizer.model") (when file contains tiktoken text lines)

source

KeemenaSubwords.load_hf_tokenizer_json — Function

Load a Hugging Face tokenizer.json tokenizer in pure Julia.

Expected files:

tokenizer.json directly, or
a directory containing tokenizer.json.

Examples:

load_hf_tokenizer_json("/path/to/tokenizer.json")
load_hf_tokenizer_json("/path/to/model_dir")

source

KeemenaSubwords.load_tokenizer — Function

Load tokenizer by built-in model name.

source

Load tokenizer from file system path.

Common format contracts:

:hf_tokenizer_json -> tokenizer.json
:bpe_gpt2 -> vocab.json + merges.txt
:bpe_encoder -> encoder.json + vocab.bpe
:wordpiece / :wordpiece_vocab -> vocab.txt
:sentencepiece_model -> *.model / *.model.v3 / sentencepiece.bpe.model
:tiktoken -> *.tiktoken or tiktoken-text tokenizer.model

Examples:

load_tokenizer("/path/to/model_dir")
load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)

source

Load tokenizer from explicit (vocab_path, merges_path) tuple.

This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.

source

Load tokenizer from a named specification.

Examples:

(format=:wordpiece, path="/.../vocab.txt")
(format=:hf_tokenizer_json, path="/.../tokenizer.json")
(format=:unigram, path="/.../unigram.tsv")
(format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")
(format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")
(format=:wordpiece, vocab_txt="/.../vocab.txt") (alias)
(format=:sentencepiece_model, model_file="/.../tokenizer.model") (alias)
(format=:tiktoken, encoding_file="/.../o200k_base.tiktoken") (alias)
(format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json") (alias)
(format=:unigram, unigram_tsv="/.../unigram.tsv") (alias)

source

Load tokenizer from a FilesSpec.

source

KeemenaSubwords.detect_tokenizer_format — Function

Detect tokenizer format from a local file or directory.

Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.

Examples:

detect_tokenizer_format("/path/to/model_dir")
detect_tokenizer_format("/path/to/tokenizer.model")

source

KeemenaSubwords.detect_tokenizer_files — Function

Inspect a tokenizer directory and return detected candidate files.

Example: detect_tokenizer_files("/path/to/model_dir")

source

Structured encoding and file-spec APIs are also part of the public surface: TokenizationResult, FilesSpec, normalize, tokenization_view, requires_tokenizer_normalization, offsets_coordinate_system, offsets_index_base, offsets_span_style, offsets_sentinel, has_span, has_nonempty_span, span_ncodeunits, span_codeunits, is_valid_string_boundary, try_span_substring, offsets_are_nonoverlapping, validate_offsets_contract, assert_offsets_contract, encode_result, encode_batch_result.

Quick Handler APIs

KeemenaSubwords.quick_tokenize — Function

quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTuple

High-level one-call wrapper for common single-text tokenization workflows.

This helper applies the recommended offsets pipeline by default:

tokenization_text = tokenization_view(tokenizer, input_text)
encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)

Supported inputs:

quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)
quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)
quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=true
return_masks::Bool=true

Returns a NamedTuple with keys:

token_pieces
token_ids
decoded_text
tokenization_text
offsets
attention_mask
token_type_ids
special_tokens_mask
metadata

source

KeemenaSubwords.quick_encode_batch — Function

quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

High-level wrapper for batch structured encoding.

By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=true
return_masks::Bool=true
format::Union{Nothing,Symbol}=nothing (source overloads only)
prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

tokenization_texts
results
sequence_lengths
metadata

source

KeemenaSubwords.collate_padded_batch — Function

collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTuple

Collate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.

Returns:

ids::Matrix{Int}
attention_mask::Matrix{Int}
token_type_ids::Matrix{Int}
special_tokens_mask::Matrix{Int}
sequence_lengths::Vector{Int}
pad_token_id::Int
pad_side::Symbol

Padding behavior:

ids are filled with pad_token_id.
attention_mask uses 1 for valid tokens and 0 for padding.
token_type_ids defaults to 0 where missing.
special_tokens_mask defaults to 0 on valid tokens where missing and uses 1 on padding positions.

Pad token selection:

If pad_token_id is provided, it is used directly.
Otherwise pad_id(tokenizer) is used when available.
Otherwise eos_id(tokenizer) is used when available.
If none are available, throws an ArgumentError.

Optional keyword arguments:

pad_to_multiple_of::Union{Nothing,Int}=nothing
pad_side::Symbol=:right (only right padding is currently supported)

source

KeemenaSubwords.causal_lm_labels — Function

causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}

Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).

For each sequence column:

valid non-final positions receive the next valid token id,
the final valid position receives ignore_index,
padding positions receive ignore_index.

When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.

source

KeemenaSubwords.quick_causal_lm_batch — Function

quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

One-call helper for training-ready causal LM tensors.

Pipeline:

quick_encode_batch(...; return_masks=true)
collate_padded_batch(...)
causal_lm_labels(...)

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=false
pad_token_id::Union{Nothing,Int}=nothing
pad_to_multiple_of::Union{Nothing,Int}=nothing
pad_side::Symbol=:right
ignore_index::Int=-100
zero_based::Bool=false
format::Union{Nothing,Symbol}=nothing (source overloads only)
prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

ids
attention_mask
labels
token_type_ids
special_tokens_mask
tokenization_texts
sequence_lengths
pad_token_id
ignore_index
zero_based

source

KeemenaSubwords.quick_train_bundle — Function

quick_train_bundle(trainer, corpus; kwargs...) -> NamedTuple

High-level training round-trip helper:

Train with a selected trainer.
Save a training bundle with save_training_bundle.
Reload with load_training_bundle.
Run a sanity encode/decode pass.

Supported trainer symbols:

:wordpiece
:hf_bert_wordpiece
:hf_roberta_bytebpe
:hf_gpt2_bytebpe

Convenience overload:

quick_train_bundle(corpus; kwargs...) defaults to trainer=:wordpiece.

Keyword arguments:

bundle_directory::Union{Nothing,AbstractString}=nothing
overwrite::Bool=true
export_format::Symbol=:auto
sanity_text::AbstractString="hello world"
plus trainer-specific keywords forwarded to the selected train_*_result.

Returns a NamedTuple with keys:

bundle_directory
bundle_files
training_summary
tokenizer
sanity_encoded_ids
sanity_decoded_text

source

Registry and Installation APIs

KeemenaSubwords.available_models — Function

List available built-in model names.

source

KeemenaSubwords.describe_model — Function

Describe a built-in model.

source

KeemenaSubwords.model_path — Function

Resolve built-in model name to on-disk path.

source

KeemenaSubwords.prefetch_models — Function

Ensure artifact-backed built-in models are present on disk.

Returns a dictionary of key => is_available.

source

KeemenaSubwords.register_local_model! — Function

source

KeemenaSubwords.install_model! — Function

Install an installable-gated tokenizer into the user cache and register it by key.

source

KeemenaSubwords.install_llama2_tokenizer! — Function

Install the gated LLaMA 2 tokenizer files into local cache and register them.

This is a convenience wrapper over install_model!(:llama2_tokenizer; ...).

source

KeemenaSubwords.install_llama3_8b_tokenizer! — Function

Install the gated LLaMA 3 8B tokenizer files into local cache and register them.

This is a convenience wrapper over install_model!(:llama3_8b_tokenizer; ...).

source

KeemenaSubwords.download_hf_files — Function

Download selected files from a Hugging Face repository revision into cache.

This helper is opt-in and useful for user-managed / gated tokenizers.

source

KeemenaSubwords.recommended_defaults_for_llms — Function

Recommended built-in keys for LLM-oriented default prefetching.

source

register_external_model! remains available as a deprecated compatibility alias; prefer register_local_model! in new code.

Full Exported API

KeemenaSubwords.AbstractSubwordTokenizer — Type

Abstract parent type for all subword tokenizers.

Tokenizers are callable and support: tokenizer(text::AbstractString) -> Vector{String}.

source

KeemenaSubwords.FilesSpec — Type

Structured file specification for local tokenizer loading/registration.

Use path for single-file formats and explicit pairs for multi-file formats.

source

KeemenaSubwords.SubwordVocabulary — Type

Vocabulary container with forward/reverse lookup and special token IDs.

IDs are 1-based.

source

KeemenaSubwords.TokenizationResult — Type

Structured tokenization output for downstream pipelines.

Offset contract:

coordinate unit: UTF-8 codeunits.
index base: 1.
span style: half-open [start, stop).
valid bounds for spanful tokens: 1 <= start <= stop <= ncodeunits(text) + 1.
sentinel for tokens without source-text spans: (0, 0).
inserted post-processor specials use sentinel offsets.
present-in-text special added tokens keep real spans, and may still have special_tokens_mask[i] == 1.
special_tokens_mask marks special-token identity; offsets determine span participation.

source

KeemenaSubwords.TokenizerMetadata — Type

Common metadata for tokenizer instances.

source

KeemenaSubwords.assert_offsets_contract — Method

Assert offsets satisfy the package offset contract.

Throws ArgumentError on first contract violation. With require_string_boundaries=true, non-empty spans must start/end on valid Julia string boundaries.

source

KeemenaSubwords.asset_status — Method

Return prefetch status for a single model key.

source

KeemenaSubwords.available_models — Method

List available built-in model names.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

Return beginning-of-sequence token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.bos_id — Method

BOS token ID if available.

source

KeemenaSubwords.cached_tokenizers — Method

List cache keys for in-session cached tokenizers.

source

KeemenaSubwords.causal_lm_labels — Method

causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}

Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).

For each sequence column:

valid non-final positions receive the next valid token id,
the final valid position receives ignore_index,
padding positions receive ignore_index.

When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.

source

KeemenaSubwords.clear_tokenizer_cache! — Method

Clear the in-session tokenizer cache used by one-call convenience APIs.

source

KeemenaSubwords.collate_padded_batch — Method

collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTuple

Collate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.

Returns:

ids::Matrix{Int}
attention_mask::Matrix{Int}
token_type_ids::Matrix{Int}
special_tokens_mask::Matrix{Int}
sequence_lengths::Vector{Int}
pad_token_id::Int
pad_side::Symbol

Padding behavior:

ids are filled with pad_token_id.
attention_mask uses 1 for valid tokens and 0 for padding.
token_type_ids defaults to 0 where missing.
special_tokens_mask defaults to 0 on valid tokens where missing and uses 1 on padding positions.

Pad token selection:

If pad_token_id is provided, it is used directly.
Otherwise pad_id(tokenizer) is used when available.
Otherwise eos_id(tokenizer) is used when available.
If none are available, throws an ArgumentError.

Optional keyword arguments:

pad_to_multiple_of::Union{Nothing,Int}=nothing
pad_side::Symbol=:right (only right padding is currently supported)

source

KeemenaSubwords.decode — Method

One-call decode by tokenizer path/directory.

source

KeemenaSubwords.decode — Method

Decode token IDs into text.

source

KeemenaSubwords.decode — Method

Decode token IDs to text.

source

KeemenaSubwords.decode — Method

Decode byte-level BPE IDs back to text.

source

KeemenaSubwords.decode — Method

Decode SentencePiece IDs back to text.

source

KeemenaSubwords.decode — Method

One-call decode by model key.

source

KeemenaSubwords.decode — Method

Decode tiktoken rank IDs to text.

source

KeemenaSubwords.decode — Method

Decode unigram token IDs back to text.

source

KeemenaSubwords.decode — Method

Decode WordPiece token IDs back into text.

source

KeemenaSubwords.describe_model — Method

Describe a built-in model.

source

KeemenaSubwords.detect_tokenizer_files — Method

Inspect a tokenizer directory and return detected candidate files.

Example: detect_tokenizer_files("/path/to/model_dir")

source

KeemenaSubwords.detect_tokenizer_format — Method

Detect tokenizer format from a local file or directory.

Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.

Examples:

detect_tokenizer_format("/path/to/model_dir")
detect_tokenizer_format("/path/to/tokenizer.model")

source

KeemenaSubwords.download_hf_files — Method

Download selected files from a Hugging Face repository revision into cache.

This helper is opt-in and useful for user-managed / gated tokenizers.

source

KeemenaSubwords.encode — Method

One-call encode by tokenizer path/directory.

source

KeemenaSubwords.encode — Method

Encode text into token IDs.

source

KeemenaSubwords.encode — Method

Encode text to token IDs.

source

KeemenaSubwords.encode — Method

Encode text to byte-level BPE IDs.

source

KeemenaSubwords.encode — Method

Encode text to SentencePiece IDs.

source

KeemenaSubwords.encode — Method

One-call encode by model key.

source

KeemenaSubwords.encode — Method

Encode text into tiktoken rank IDs (1-based in this package).

source

KeemenaSubwords.encode — Method

Encode text to unigram token IDs.

source

KeemenaSubwords.encode — Method

Encode text to WordPiece token IDs.

source

KeemenaSubwords.encode_batch_result — Method

Batch variant of encode_result.

source

KeemenaSubwords.encode_result — Function

Encode text and return a structured TokenizationResult.

Key keyword arguments:

assume_normalized::Bool=false: when true, tokenizer intrinsic normalization is skipped and offsets are computed against the exact provided text.
return_offsets::Bool=false: include token-level offsets when available.
return_masks::Bool=false: include attention/token-type/special-token masks.

Offset note:

Offsets use the package-wide 1-based UTF-8 codeunit half-open convention.
assume_normalized changes whether intrinsic normalization runs; it does not change the offset coordinate system.

source

KeemenaSubwords.encode_result — Method

One-call structured encode by tokenizer path/directory.

source

KeemenaSubwords.encode_result — Method

One-call structured encode by model key.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

Return end-of-sequence token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.eos_id — Method

EOS token ID if available.

source

KeemenaSubwords.export_tokenizer — Method

Export tokenizer to external formats.

Supported format values:

:internal
:bpe / :bpe_gpt2
:wordpiece_vocab
:unigram_tsv
:sentencepiece_model
:hf_tokenizer_json

source

KeemenaSubwords.get_tokenizer_cached — Method

Return a cached tokenizer for a model key or path, loading and caching on first use.

source

KeemenaSubwords.has_nonempty_span — Method

Return true when an offset carries a non-empty source-text span.

source

KeemenaSubwords.has_span — Method

Return true when an offset carries a real source-text span.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Map ID to token string.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.id_to_token — Method

Reverse token lookup.

source

KeemenaSubwords.install_llama2_tokenizer! — Method

Install the gated LLaMA 2 tokenizer files into local cache and register them.

This is a convenience wrapper over install_model!(:llama2_tokenizer; ...).

source

KeemenaSubwords.install_llama3_8b_tokenizer! — Method

Install the gated LLaMA 3 8B tokenizer files into local cache and register them.

This is a convenience wrapper over install_model!(:llama3_8b_tokenizer; ...).

source

KeemenaSubwords.install_model! — Method

Install an installable-gated tokenizer into the user cache and register it by key.

source

KeemenaSubwords.is_valid_string_boundary — Method

Return whether idx is a valid Julia string boundary for text.

This includes the exclusive end boundary ncodeunits(text) + 1.

source

KeemenaSubwords.keemena_callable — Method

Return a function compatible with KeemenaPreprocessing's callable tokenizer contract.

source

KeemenaSubwords.level_key — Method

Level key used by KeemenaPreprocessing for callable tokenizers.

source

KeemenaSubwords.load_bpe — Method

Load a BPE tokenizer from explicit vocab + merges paths.

source

KeemenaSubwords.load_bpe — Method

Load a BPE tokenizer from either a directory (vocab.txt + merges.txt) or a vocab file path.

source

KeemenaSubwords.load_bpe_encoder — Method

Load GPT-2 encoder variant from encoder.json + vocab.bpe.

Example: load_bpe_encoder("/path/to/encoder.json", "/path/to/vocab.bpe")

source

KeemenaSubwords.load_bpe_gpt2 — Method

Load GPT-2 / RoBERTa style BPE from vocab.json + merges.txt.

Example: load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")

source

KeemenaSubwords.load_bytebpe — Method

Load a byte-level BPE tokenizer from explicit vocab + merges paths.

source

KeemenaSubwords.load_bytebpe — Method

Load a byte-level BPE tokenizer from a directory (vocab.txt + merges.txt) or vocab path.

source

KeemenaSubwords.load_hf_tokenizer_json — Method

Load a Hugging Face tokenizer.json tokenizer in pure Julia.

Expected files:

tokenizer.json directly, or
a directory containing tokenizer.json.

Examples:

load_hf_tokenizer_json("/path/to/tokenizer.json")
load_hf_tokenizer_json("/path/to/model_dir")

source

KeemenaSubwords.load_sentencepiece — Method

Load a SentencePiece .model file.

Supported inputs:

standard SentencePiece binary protobuf .model/.model.v3 payloads
Keemena text-exported model files:
- key/value lines (type=unigram|bpe, whitespace_marker=▁, unk_token=<unk>)
- piece rows: piece<TAB>token<TAB>score[<TAB>special_symbol]
- bpe merge rows (for type=bpe): merge<TAB>left<TAB>right

Examples:

load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)
load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)

source

KeemenaSubwords.load_tiktoken — Method

Load a tiktoken encoding file (*.tiktoken).

The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.

Examples:

load_tiktoken("/path/to/o200k_base.tiktoken")
load_tiktoken("/path/to/tokenizer.model") (when file contains tiktoken text lines)

source

KeemenaSubwords.load_tokenizer — Method

Load tokenizer from file system path.

Common format contracts:

:hf_tokenizer_json -> tokenizer.json
:bpe_gpt2 -> vocab.json + merges.txt
:bpe_encoder -> encoder.json + vocab.bpe
:wordpiece / :wordpiece_vocab -> vocab.txt
:sentencepiece_model -> *.model / *.model.v3 / sentencepiece.bpe.model
:tiktoken -> *.tiktoken or tiktoken-text tokenizer.model

Examples:

load_tokenizer("/path/to/model_dir")
load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)

source

KeemenaSubwords.load_tokenizer — Method

Load tokenizer from a FilesSpec.

source

KeemenaSubwords.load_tokenizer — Method

Load tokenizer from a named specification.

Examples:

(format=:wordpiece, path="/.../vocab.txt")
(format=:hf_tokenizer_json, path="/.../tokenizer.json")
(format=:unigram, path="/.../unigram.tsv")
(format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")
(format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")
(format=:wordpiece, vocab_txt="/.../vocab.txt") (alias)
(format=:sentencepiece_model, model_file="/.../tokenizer.model") (alias)
(format=:tiktoken, encoding_file="/.../o200k_base.tiktoken") (alias)
(format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json") (alias)
(format=:unigram, unigram_tsv="/.../unigram.tsv") (alias)

source

KeemenaSubwords.load_tokenizer — Method

Load tokenizer by built-in model name.

source

KeemenaSubwords.load_tokenizer — Method

Load tokenizer from explicit (vocab_path, merges_path) tuple.

This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.

source

KeemenaSubwords.load_unigram — Method

Load a Unigram tokenizer from unigram.tsv (file or directory).

Expected format (tab-separated): token<TAB>score[<TAB>special_symbol]

source

KeemenaSubwords.load_wordpiece — Method

Load a WordPiece tokenizer from a vocab file path or a directory containing vocab.txt.

Examples:

load_wordpiece("/path/to/vocab.txt")
load_wordpiece("/path/to/model_dir")

source

KeemenaSubwords.model_info — Method

Return model metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_info — Method

Tokenizer metadata.

source

KeemenaSubwords.model_path — Method

Resolve built-in model name to on-disk path.

source

KeemenaSubwords.normalize — Method

Return tokenizer intrinsic normalization output.

This does not perform pipeline-level preprocessing. Tokenizers without intrinsic normalization return text unchanged.

source

KeemenaSubwords.normalize_text — Method

Normalize text using an optional user-provided callable.

source

KeemenaSubwords.offsets_are_nonoverlapping — Method

Return whether participating offsets are non-overlapping in sequence order.

Participating offsets satisfy:

not sentinel when ignore_sentinel=true
not empty when ignore_empty=true

For participating offsets, this enforces next.start >= prev.stop.

source

KeemenaSubwords.offsets_coordinate_system — Method

Offset coordinate system for TokenizationResult.offsets.

Offsets are UTF-8 codeunit indices with half-open span convention [start, stop).

source

KeemenaSubwords.offsets_index_base — Method

Offset index base for TokenizationResult.offsets.

Offsets are 1-based codeunit indices.

source

KeemenaSubwords.offsets_sentinel — Method

Sentinel used for tokens without a source-text span.

source

KeemenaSubwords.offsets_span_style — Method

Offset span style.

TokenizationResult.offsets use half-open spans [start, stop).

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Return padding-token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.pad_id — Method

Padding token ID if available.

source

KeemenaSubwords.prefetch_models — Function

Ensure artifact-backed built-in models are present on disk.

Returns a dictionary of key => is_available.

source

KeemenaSubwords.prefetch_models_status — Function

Return detailed prefetch status for built-in model keys.

Each value includes:

available::Bool
method::Symbol (:artifact, :fallback_download, :already_present, or :failed)
path::Union{Nothing,String}
error::Union{Nothing,String}

source

KeemenaSubwords.print_asset_status — Method

Print a compact prefetch status line for one model key.

source

KeemenaSubwords.quick_causal_lm_batch — Method

quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

One-call helper for training-ready causal LM tensors.

Pipeline:

quick_encode_batch(...; return_masks=true)
collate_padded_batch(...)
causal_lm_labels(...)

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=false
pad_token_id::Union{Nothing,Int}=nothing
pad_to_multiple_of::Union{Nothing,Int}=nothing
pad_side::Symbol=:right
ignore_index::Int=-100
zero_based::Bool=false
format::Union{Nothing,Symbol}=nothing (source overloads only)
prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

ids
attention_mask
labels
token_type_ids
special_tokens_mask
tokenization_texts
sequence_lengths
pad_token_id
ignore_index
zero_based

source

KeemenaSubwords.quick_encode_batch — Method

quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

High-level wrapper for batch structured encoding.

By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=true
return_masks::Bool=true
format::Union{Nothing,Symbol}=nothing (source overloads only)
prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

tokenization_texts
results
sequence_lengths
metadata

source

KeemenaSubwords.quick_tokenize — Method

quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTuple

High-level one-call wrapper for common single-text tokenization workflows.

This helper applies the recommended offsets pipeline by default:

tokenization_text = tokenization_view(tokenizer, input_text)
encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)

Supported inputs:

quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)
quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)
quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)

Keyword arguments:

add_special_tokens::Bool=true
apply_tokenization_view::Bool=true
return_offsets::Bool=true
return_masks::Bool=true

Returns a NamedTuple with keys:

token_pieces
token_ids
decoded_text
tokenization_text
offsets
attention_mask
token_type_ids
special_tokens_mask
metadata

source

KeemenaSubwords.quick_train_bundle — Method

quick_train_bundle(trainer, corpus; kwargs...) -> NamedTuple

High-level training round-trip helper:

Train with a selected trainer.
Save a training bundle with save_training_bundle.
Reload with load_training_bundle.
Run a sanity encode/decode pass.

Supported trainer symbols:

:wordpiece
:hf_bert_wordpiece
:hf_roberta_bytebpe
:hf_gpt2_bytebpe

Convenience overload:

quick_train_bundle(corpus; kwargs...) defaults to trainer=:wordpiece.

Keyword arguments:

bundle_directory::Union{Nothing,AbstractString}=nothing
overwrite::Bool=true
export_format::Symbol=:auto
sanity_text::AbstractString="hello world"
plus trainer-specific keywords forwarded to the selected train_*_result.

Returns a NamedTuple with keys:

bundle_directory
bundle_files
training_summary
tokenizer
sanity_encoded_ids
sanity_decoded_text

source

KeemenaSubwords.recommended_defaults_for_llms — Method

Recommended built-in keys for LLM-oriented default prefetching.

source

KeemenaSubwords.register_external_model! — Method

Deprecated alias kept for compatibility. Use register_local_model! instead.

source

KeemenaSubwords.register_local_model! — Method

source

KeemenaSubwords.register_local_model! — Method

source

KeemenaSubwords.register_local_model! — Method

source

KeemenaSubwords.requires_tokenizer_normalization — Method

Whether this tokenizer defines intrinsic normalization that can change text.

source

KeemenaSubwords.save_tokenizer — Method

Save tokenizer to a canonical on-disk format.

format=:internal chooses a tokenizer-family specific default:

WordPieceTokenizer -> vocab.txt
BPETokenizer / ByteBPETokenizer -> vocab.txt + merges.txt
UnigramTokenizer -> unigram.tsv
SentencePieceTokenizer -> spm.model

source

KeemenaSubwords.span_codeunits — Method

Return the offset span as UTF-8 codeunits.

Sentinel and empty spans return UInt8[]. Invalid or out-of-bounds spans also return UInt8[] to keep this helper non-throwing for downstream inspection.

source

KeemenaSubwords.span_ncodeunits — Method

Return span length measured in UTF-8 codeunits.

Sentinel and empty spans return 0.

source

KeemenaSubwords.special_tokens — Method

Return special token IDs keyed by symbol.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.special_tokens — Method

Special token IDs.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Map token string to ID, falling back to :unk.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.token_to_id — Method

Forward token lookup.

source

KeemenaSubwords.tokenization_view — Method

Canonical tokenizer text view used for subword offsets/alignment.

source

KeemenaSubwords.tokenize — Method

One-call tokenize by tokenizer path/directory.

source

KeemenaSubwords.tokenize — Method

Tokenize text into subword pieces.

source

KeemenaSubwords.tokenize — Method

Tokenize with classic BPE merges.

source

KeemenaSubwords.tokenize — Method

Tokenize text by first mapping bytes to unicode symbols, then applying BPE merges.

source

KeemenaSubwords.tokenize — Method

Tokenize text with SentencePiece wrapper behavior.

source

KeemenaSubwords.tokenize — Method

One-call tokenize by model key.

source

KeemenaSubwords.tokenize — Method

Tokenize text into b64:<...> token pieces.

source

KeemenaSubwords.tokenize — Method

Tokenize text using deterministic Viterbi segmentation.

source

KeemenaSubwords.tokenize — Method

Greedy longest-match WordPiece tokenization.

source

KeemenaSubwords.try_span_substring — Method

Attempt to return a substring for a half-open codeunit span [start, stop).

Sentinel and empty spans return "". If span boundaries are not valid Julia string boundaries, this returns nothing. This helper never throws.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Return unknown-token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.unk_id — Method

Unknown token ID.

source

KeemenaSubwords.validate_offsets_contract — Method

Validate offsets against the package offset contract.

Returns true when all offsets satisfy bounds/sentinel invariants. With require_string_boundaries=true, non-empty spans must also start/end on valid Julia string boundaries.

source

KeemenaSubwords.vocab_size — Method

Vocabulary size.

source

KeemenaSubwords.vocab_size — Method

Vocabulary size.

source

KeemenaSubwords.vocab_size — Method

Vocabulary size.

source

KeemenaSubwords.vocab_size — Method

Vocabulary size.