API Reference

Explicit Loader APIs

KeemenaSubwords.load_bpeFunction

Load a BPE tokenizer from either a directory (vocab.txt + merges.txt) or a vocab file path.

source

Load a BPE tokenizer from explicit vocab + merges paths.

source
KeemenaSubwords.load_bytebpeFunction

Load a byte-level BPE tokenizer from a directory (vocab.txt + merges.txt) or vocab path.

source

Load a byte-level BPE tokenizer from explicit vocab + merges paths.

source
KeemenaSubwords.load_bpe_gpt2Function

Load GPT-2 / RoBERTa style BPE from vocab.json + merges.txt.

Example: load_bpe_gpt2("/path/to/vocab.json", "/path/to/merges.txt")

source
KeemenaSubwords.load_unigramFunction

Load a Unigram tokenizer from unigram.tsv (file or directory).

Expected format (tab-separated): token<TAB>score[<TAB>special_symbol]

source
KeemenaSubwords.load_wordpieceFunction

Load a WordPiece tokenizer from a vocab file path or a directory containing vocab.txt.

Examples:

  • load_wordpiece("/path/to/vocab.txt")
  • load_wordpiece("/path/to/model_dir")
source
KeemenaSubwords.load_sentencepieceFunction

Load a SentencePiece .model file.

Supported inputs:

  • standard SentencePiece binary protobuf .model/.model.v3 payloads
  • Keemena text-exported model files:
    • key/value lines (type=unigram|bpe, whitespace_marker=▁, unk_token=<unk>)
    • piece rows: piece<TAB>token<TAB>score[<TAB>special_symbol]
    • bpe merge rows (for type=bpe): merge<TAB>left<TAB>right

Examples:

  • load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)
  • load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)
source
KeemenaSubwords.load_tiktokenFunction

Load a tiktoken encoding file (*.tiktoken).

The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.

Examples:

  • load_tiktoken("/path/to/o200k_base.tiktoken")
  • load_tiktoken("/path/to/tokenizer.model") (when file contains tiktoken text lines)
source
KeemenaSubwords.load_hf_tokenizer_jsonFunction

Load a Hugging Face tokenizer.json tokenizer in pure Julia.

Expected files:

  • tokenizer.json directly, or
  • a directory containing tokenizer.json.

Examples:

  • load_hf_tokenizer_json("/path/to/tokenizer.json")
  • load_hf_tokenizer_json("/path/to/model_dir")
source
KeemenaSubwords.load_tokenizerFunction

Load tokenizer by built-in model name.

source

Load tokenizer from file system path.

Common format contracts:

  • :hf_tokenizer_json -> tokenizer.json
  • :bpe_gpt2 -> vocab.json + merges.txt
  • :bpe_encoder -> encoder.json + vocab.bpe
  • :wordpiece / :wordpiece_vocab -> vocab.txt
  • :sentencepiece_model -> *.model / *.model.v3 / sentencepiece.bpe.model
  • :tiktoken -> *.tiktoken or tiktoken-text tokenizer.model

Examples:

  • load_tokenizer("/path/to/model_dir")
  • load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
  • load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)
source

Load tokenizer from explicit (vocab_path, merges_path) tuple.

This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.

source

Load tokenizer from a named specification.

Examples:

  • (format=:wordpiece, path="/.../vocab.txt")
  • (format=:hf_tokenizer_json, path="/.../tokenizer.json")
  • (format=:unigram, path="/.../unigram.tsv")
  • (format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")
  • (format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")
  • (format=:wordpiece, vocab_txt="/.../vocab.txt") (alias)
  • (format=:sentencepiece_model, model_file="/.../tokenizer.model") (alias)
  • (format=:tiktoken, encoding_file="/.../o200k_base.tiktoken") (alias)
  • (format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json") (alias)
  • (format=:unigram, unigram_tsv="/.../unigram.tsv") (alias)
source

Load tokenizer from a FilesSpec.

source
KeemenaSubwords.detect_tokenizer_formatFunction

Detect tokenizer format from a local file or directory.

Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.

Examples:

  • detect_tokenizer_format("/path/to/model_dir")
  • detect_tokenizer_format("/path/to/tokenizer.model")
source

Structured encoding and file-spec APIs are also part of the public surface: TokenizationResult, FilesSpec, normalize, tokenization_view, requires_tokenizer_normalization, offsets_coordinate_system, offsets_index_base, offsets_span_style, offsets_sentinel, has_span, has_nonempty_span, span_ncodeunits, span_codeunits, is_valid_string_boundary, try_span_substring, offsets_are_nonoverlapping, validate_offsets_contract, assert_offsets_contract, encode_result, encode_batch_result.

Quick Handler APIs

KeemenaSubwords.quick_tokenizeFunction
quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTuple

High-level one-call wrapper for common single-text tokenization workflows.

This helper applies the recommended offsets pipeline by default:

  1. tokenization_text = tokenization_view(tokenizer, input_text)
  2. encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)

Supported inputs:

  • quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)
  • quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)
  • quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=true
  • return_masks::Bool=true

Returns a NamedTuple with keys:

  • token_pieces
  • token_ids
  • decoded_text
  • tokenization_text
  • offsets
  • attention_mask
  • token_type_ids
  • special_tokens_mask
  • metadata
source
KeemenaSubwords.quick_encode_batchFunction
quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

High-level wrapper for batch structured encoding.

By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=true
  • return_masks::Bool=true
  • format::Union{Nothing,Symbol}=nothing (source overloads only)
  • prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

  • tokenization_texts
  • results
  • sequence_lengths
  • metadata
source
KeemenaSubwords.collate_padded_batchFunction
collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTuple

Collate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.

Returns:

  • ids::Matrix{Int}
  • attention_mask::Matrix{Int}
  • token_type_ids::Matrix{Int}
  • special_tokens_mask::Matrix{Int}
  • sequence_lengths::Vector{Int}
  • pad_token_id::Int
  • pad_side::Symbol

Padding behavior:

  • ids are filled with pad_token_id.
  • attention_mask uses 1 for valid tokens and 0 for padding.
  • token_type_ids defaults to 0 where missing.
  • special_tokens_mask defaults to 0 on valid tokens where missing and uses 1 on padding positions.

Pad token selection:

  • If pad_token_id is provided, it is used directly.
  • Otherwise pad_id(tokenizer) is used when available.
  • Otherwise eos_id(tokenizer) is used when available.
  • If none are available, throws an ArgumentError.

Optional keyword arguments:

  • pad_to_multiple_of::Union{Nothing,Int}=nothing
  • pad_side::Symbol=:right (only right padding is currently supported)
source
KeemenaSubwords.causal_lm_labelsFunction
causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}

Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).

For each sequence column:

  • valid non-final positions receive the next valid token id,
  • the final valid position receives ignore_index,
  • padding positions receive ignore_index.

When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.

source
KeemenaSubwords.quick_causal_lm_batchFunction
quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

One-call helper for training-ready causal LM tensors.

Pipeline:

  1. quick_encode_batch(...; return_masks=true)
  2. collate_padded_batch(...)
  3. causal_lm_labels(...)

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=false
  • pad_token_id::Union{Nothing,Int}=nothing
  • pad_to_multiple_of::Union{Nothing,Int}=nothing
  • pad_side::Symbol=:right
  • ignore_index::Int=-100
  • zero_based::Bool=false
  • format::Union{Nothing,Symbol}=nothing (source overloads only)
  • prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

  • ids
  • attention_mask
  • labels
  • token_type_ids
  • special_tokens_mask
  • tokenization_texts
  • sequence_lengths
  • pad_token_id
  • ignore_index
  • zero_based
source
KeemenaSubwords.quick_train_bundleFunction
quick_train_bundle(trainer, corpus; kwargs...) -> NamedTuple

High-level training round-trip helper:

  1. Train with a selected trainer.
  2. Save a training bundle with save_training_bundle.
  3. Reload with load_training_bundle.
  4. Run a sanity encode/decode pass.

Supported trainer symbols:

  • :wordpiece
  • :hf_bert_wordpiece
  • :hf_roberta_bytebpe
  • :hf_gpt2_bytebpe

Convenience overload:

  • quick_train_bundle(corpus; kwargs...) defaults to trainer=:wordpiece.

Keyword arguments:

  • bundle_directory::Union{Nothing,AbstractString}=nothing
  • overwrite::Bool=true
  • export_format::Symbol=:auto
  • sanity_text::AbstractString="hello world"
  • plus trainer-specific keywords forwarded to the selected train_*_result.

Returns a NamedTuple with keys:

  • bundle_directory
  • bundle_files
  • training_summary
  • tokenizer
  • sanity_encoded_ids
  • sanity_decoded_text
source

Registry and Installation APIs

KeemenaSubwords.download_hf_filesFunction

Download selected files from a Hugging Face repository revision into cache.

This helper is opt-in and useful for user-managed / gated tokenizers.

source

register_external_model! remains available as a deprecated compatibility alias; prefer register_local_model! in new code.

Full Exported API

KeemenaSubwords.FilesSpecType

Structured file specification for local tokenizer loading/registration.

Use path for single-file formats and explicit pairs for multi-file formats.

source
KeemenaSubwords.TokenizationResultType

Structured tokenization output for downstream pipelines.

Offset contract:

  • coordinate unit: UTF-8 codeunits.
  • index base: 1.
  • span style: half-open [start, stop).
  • valid bounds for spanful tokens: 1 <= start <= stop <= ncodeunits(text) + 1.
  • sentinel for tokens without source-text spans: (0, 0).
  • inserted post-processor specials use sentinel offsets.
  • present-in-text special added tokens keep real spans, and may still have special_tokens_mask[i] == 1.
  • special_tokens_mask marks special-token identity; offsets determine span participation.
source
KeemenaSubwords.assert_offsets_contractMethod

Assert offsets satisfy the package offset contract.

Throws ArgumentError on first contract violation. With require_string_boundaries=true, non-empty spans must start/end on valid Julia string boundaries.

source
KeemenaSubwords.causal_lm_labelsMethod
causal_lm_labels(ids, attention_mask; ignore_index=-100, zero_based=false) -> Matrix{Int}

Build next-token labels for causal language modeling from padded ids and attention_mask matrices shaped (sequence_length, batch_size).

For each sequence column:

  • valid non-final positions receive the next valid token id,
  • the final valid position receives ignore_index,
  • padding positions receive ignore_index.

When zero_based=true, subtracts 1 from all non-ignored labels to support consumers that expect 0-based ids.

source
KeemenaSubwords.collate_padded_batchMethod
collate_padded_batch(results; tokenizer=nothing, pad_token_id=nothing, kwargs...) -> NamedTuple

Collate Vector{TokenizationResult} into dense (sequence_length, batch_size) matrices.

Returns:

  • ids::Matrix{Int}
  • attention_mask::Matrix{Int}
  • token_type_ids::Matrix{Int}
  • special_tokens_mask::Matrix{Int}
  • sequence_lengths::Vector{Int}
  • pad_token_id::Int
  • pad_side::Symbol

Padding behavior:

  • ids are filled with pad_token_id.
  • attention_mask uses 1 for valid tokens and 0 for padding.
  • token_type_ids defaults to 0 where missing.
  • special_tokens_mask defaults to 0 on valid tokens where missing and uses 1 on padding positions.

Pad token selection:

  • If pad_token_id is provided, it is used directly.
  • Otherwise pad_id(tokenizer) is used when available.
  • Otherwise eos_id(tokenizer) is used when available.
  • If none are available, throws an ArgumentError.

Optional keyword arguments:

  • pad_to_multiple_of::Union{Nothing,Int}=nothing
  • pad_side::Symbol=:right (only right padding is currently supported)
source
KeemenaSubwords.detect_tokenizer_formatMethod

Detect tokenizer format from a local file or directory.

Returns one of symbols such as :hf_tokenizer_json, :bpe_gpt2, :bpe_encoder, :sentencepiece_model, :tiktoken, :wordpiece, :bpe, or :unigram.

Examples:

  • detect_tokenizer_format("/path/to/model_dir")
  • detect_tokenizer_format("/path/to/tokenizer.model")
source
KeemenaSubwords.encode_resultFunction

Encode text and return a structured TokenizationResult.

Key keyword arguments:

  • assume_normalized::Bool=false: when true, tokenizer intrinsic normalization is skipped and offsets are computed against the exact provided text.
  • return_offsets::Bool=false: include token-level offsets when available.
  • return_masks::Bool=false: include attention/token-type/special-token masks.

Offset note:

  • Offsets use the package-wide 1-based UTF-8 codeunit half-open convention.
  • assume_normalized changes whether intrinsic normalization runs; it does not change the offset coordinate system.
source
KeemenaSubwords.export_tokenizerMethod

Export tokenizer to external formats.

Supported format values:

  • :internal
  • :bpe / :bpe_gpt2
  • :wordpiece_vocab
  • :unigram_tsv
  • :sentencepiece_model
  • :hf_tokenizer_json
source
KeemenaSubwords.load_hf_tokenizer_jsonMethod

Load a Hugging Face tokenizer.json tokenizer in pure Julia.

Expected files:

  • tokenizer.json directly, or
  • a directory containing tokenizer.json.

Examples:

  • load_hf_tokenizer_json("/path/to/tokenizer.json")
  • load_hf_tokenizer_json("/path/to/model_dir")
source
KeemenaSubwords.load_sentencepieceMethod

Load a SentencePiece .model file.

Supported inputs:

  • standard SentencePiece binary protobuf .model/.model.v3 payloads
  • Keemena text-exported model files:
    • key/value lines (type=unigram|bpe, whitespace_marker=▁, unk_token=<unk>)
    • piece rows: piece<TAB>token<TAB>score[<TAB>special_symbol]
    • bpe merge rows (for type=bpe): merge<TAB>left<TAB>right

Examples:

  • load_sentencepiece("/path/to/tokenizer.model"; kind=:auto)
  • load_sentencepiece("/path/to/tokenizer.model.v3"; kind=:bpe)
source
KeemenaSubwords.load_tiktokenMethod

Load a tiktoken encoding file (*.tiktoken).

The expected format is line-based: <base64_token_bytes><space><rank> where ranks are non-negative integers.

Examples:

  • load_tiktoken("/path/to/o200k_base.tiktoken")
  • load_tiktoken("/path/to/tokenizer.model") (when file contains tiktoken text lines)
source
KeemenaSubwords.load_tokenizerMethod

Load tokenizer from file system path.

Common format contracts:

  • :hf_tokenizer_json -> tokenizer.json
  • :bpe_gpt2 -> vocab.json + merges.txt
  • :bpe_encoder -> encoder.json + vocab.bpe
  • :wordpiece / :wordpiece_vocab -> vocab.txt
  • :sentencepiece_model -> *.model / *.model.v3 / sentencepiece.bpe.model
  • :tiktoken -> *.tiktoken or tiktoken-text tokenizer.model

Examples:

  • load_tokenizer("/path/to/model_dir")
  • load_tokenizer("/path/to/tokenizer.model"; format=:tiktoken)
  • load_tokenizer("/path/to/tokenizer.json"; format=:hf_tokenizer_json)
source
KeemenaSubwords.load_tokenizerMethod

Load tokenizer from a named specification.

Examples:

  • (format=:wordpiece, path="/.../vocab.txt")
  • (format=:hf_tokenizer_json, path="/.../tokenizer.json")
  • (format=:unigram, path="/.../unigram.tsv")
  • (format=:bpe_gpt2, vocab_json="/.../vocab.json", merges_txt="/.../merges.txt")
  • (format=:bpe_encoder, encoder_json="/.../encoder.json", vocab_bpe="/.../vocab.bpe")
  • (format=:wordpiece, vocab_txt="/.../vocab.txt") (alias)
  • (format=:sentencepiece_model, model_file="/.../tokenizer.model") (alias)
  • (format=:tiktoken, encoding_file="/.../o200k_base.tiktoken") (alias)
  • (format=:hf_tokenizer_json, tokenizer_json="/.../tokenizer.json") (alias)
  • (format=:unigram, unigram_tsv="/.../unigram.tsv") (alias)
source
KeemenaSubwords.load_tokenizerMethod

Load tokenizer from explicit (vocab_path, merges_path) tuple.

This tuple form is for classic BPE/byte-level BPE (vocab.txt + merges.txt) or explicit JSON-pair loaders (vocab.json + merges.txt, encoder.json + vocab.bpe) when accompanied by format.

source
KeemenaSubwords.load_unigramMethod

Load a Unigram tokenizer from unigram.tsv (file or directory).

Expected format (tab-separated): token<TAB>score[<TAB>special_symbol]

source
KeemenaSubwords.load_wordpieceMethod

Load a WordPiece tokenizer from a vocab file path or a directory containing vocab.txt.

Examples:

  • load_wordpiece("/path/to/vocab.txt")
  • load_wordpiece("/path/to/model_dir")
source
KeemenaSubwords.normalizeMethod

Return tokenizer intrinsic normalization output.

This does not perform pipeline-level preprocessing. Tokenizers without intrinsic normalization return text unchanged.

source
KeemenaSubwords.offsets_are_nonoverlappingMethod

Return whether participating offsets are non-overlapping in sequence order.

Participating offsets satisfy:

  • not sentinel when ignore_sentinel=true
  • not empty when ignore_empty=true

For participating offsets, this enforces next.start >= prev.stop.

source
KeemenaSubwords.prefetch_models_statusFunction

Return detailed prefetch status for built-in model keys.

Each value includes:

  • available::Bool
  • method::Symbol (:artifact, :fallback_download, :already_present, or :failed)
  • path::Union{Nothing,String}
  • error::Union{Nothing,String}
source
KeemenaSubwords.quick_causal_lm_batchMethod
quick_causal_lm_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

One-call helper for training-ready causal LM tensors.

Pipeline:

  1. quick_encode_batch(...; return_masks=true)
  2. collate_padded_batch(...)
  3. causal_lm_labels(...)

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=false
  • pad_token_id::Union{Nothing,Int}=nothing
  • pad_to_multiple_of::Union{Nothing,Int}=nothing
  • pad_side::Symbol=:right
  • ignore_index::Int=-100
  • zero_based::Bool=false
  • format::Union{Nothing,Symbol}=nothing (source overloads only)
  • prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

  • ids
  • attention_mask
  • labels
  • token_type_ids
  • special_tokens_mask
  • tokenization_texts
  • sequence_lengths
  • pad_token_id
  • ignore_index
  • zero_based
source
KeemenaSubwords.quick_encode_batchMethod
quick_encode_batch(tokenizer_or_source, input_texts; kwargs...) -> NamedTuple

High-level wrapper for batch structured encoding.

By default, each input text is first converted with tokenization_view so offsets and alignment metadata are anchored to tokenizer-coordinate text.

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=true
  • return_masks::Bool=true
  • format::Union{Nothing,Symbol}=nothing (source overloads only)
  • prefetch::Bool=true (source overloads only)

Returns a NamedTuple with keys:

  • tokenization_texts
  • results
  • sequence_lengths
  • metadata
source
KeemenaSubwords.quick_tokenizeMethod
quick_tokenize(tokenizer_or_source, input_text; kwargs...) -> NamedTuple

High-level one-call wrapper for common single-text tokenization workflows.

This helper applies the recommended offsets pipeline by default:

  1. tokenization_text = tokenization_view(tokenizer, input_text)
  2. encode_result(tokenizer, tokenization_text; assume_normalized=true, ...)

Supported inputs:

  • quick_tokenize(tokenizer::AbstractSubwordTokenizer, input_text; ...)
  • quick_tokenize(source::Symbol, input_text; format=nothing, prefetch=true, ...)
  • quick_tokenize(source::AbstractString, input_text; format=nothing, prefetch=true, ...)

Keyword arguments:

  • add_special_tokens::Bool=true
  • apply_tokenization_view::Bool=true
  • return_offsets::Bool=true
  • return_masks::Bool=true

Returns a NamedTuple with keys:

  • token_pieces
  • token_ids
  • decoded_text
  • tokenization_text
  • offsets
  • attention_mask
  • token_type_ids
  • special_tokens_mask
  • metadata
source
KeemenaSubwords.quick_train_bundleMethod
quick_train_bundle(trainer, corpus; kwargs...) -> NamedTuple

High-level training round-trip helper:

  1. Train with a selected trainer.
  2. Save a training bundle with save_training_bundle.
  3. Reload with load_training_bundle.
  4. Run a sanity encode/decode pass.

Supported trainer symbols:

  • :wordpiece
  • :hf_bert_wordpiece
  • :hf_roberta_bytebpe
  • :hf_gpt2_bytebpe

Convenience overload:

  • quick_train_bundle(corpus; kwargs...) defaults to trainer=:wordpiece.

Keyword arguments:

  • bundle_directory::Union{Nothing,AbstractString}=nothing
  • overwrite::Bool=true
  • export_format::Symbol=:auto
  • sanity_text::AbstractString="hello world"
  • plus trainer-specific keywords forwarded to the selected train_*_result.

Returns a NamedTuple with keys:

  • bundle_directory
  • bundle_files
  • training_summary
  • tokenizer
  • sanity_encoded_ids
  • sanity_decoded_text
source
KeemenaSubwords.save_tokenizerMethod

Save tokenizer to a canonical on-disk format.

format=:internal chooses a tokenizer-family specific default:

  • WordPieceTokenizer -> vocab.txt
  • BPETokenizer / ByteBPETokenizer -> vocab.txt + merges.txt
  • UnigramTokenizer -> unigram.tsv
  • SentencePieceTokenizer -> spm.model
source
KeemenaSubwords.span_codeunitsMethod

Return the offset span as UTF-8 codeunits.

Sentinel and empty spans return UInt8[]. Invalid or out-of-bounds spans also return UInt8[] to keep this helper non-throwing for downstream inspection.

source
KeemenaSubwords.try_span_substringMethod

Attempt to return a substring for a half-open codeunit span [start, stop).

Sentinel and empty spans return "". If span boundaries are not valid Julia string boundaries, this returns nothing. This helper never throws.

source
KeemenaSubwords.validate_offsets_contractMethod

Validate offsets against the package offset contract.

Returns true when all offsets satisfy bounds/sentinel invariants. With require_string_boundaries=true, non-empty spans must also start/end on valid Julia string boundaries.

source