<!– Source of truth: notes/OffsetContract.md. Synced to docs/src/normalizationoffsetscontract.md by tools/syncoffsetcontract.jl. Edit source then run the sync tool. –>
Normalization and Offsets Contract
This document is the canonical contract for normalization and offsets behavior in KeemenaSubwords when integrated with KeemenaPreprocessing.
For step-by-step usage patterns, see Offsets Alignment Examples.
Normalization Ownership
- Pipeline normalization (KeemenaPreprocessing): produces
clean_text. - Tokenizer intrinsic normalization (KeemenaSubwords): produces
tokenization_text.
Use:
tokenization_text = tokenization_view(tokenizer, clean_text)Then call:
result = encode_result(
tokenizer,
tokenization_text;
assume_normalized=true,
return_offsets=true,
return_masks=true,
add_special_tokens=true,
)When assume_normalized=true, KeemenaSubwords must not re-run tokenizer intrinsic normalization.
Offset Convention
- coordinate unit: UTF-8 codeunits
- index base: 1-based
- span style: half-open
[start, stop) - valid bounds for spanful offsets:
1 <= start <= stop <= ncodeunits(text) + 1
Programmatic helpers:
offsets_coordinate_system() == :utf8_codeunitsoffsets_index_base() == 1offsets_span_style() == :half_openoffsets_sentinel() == (0, 0)has_span(offset)is true iffoffset != (0, 0)has_nonempty_span(offset)is true iff the offset is spanful andstop > startspan_ncodeunits(offset)returns span length in codeunits (0for sentinel/empty)
Sentinel and Special Tokens
Sentinel:
(0, 0)means "no source-text span".- In a 1-based scheme,
(0, 0)is out-of-range and unambiguous.
Special token semantics:
- Inserted special tokens (TemplateProcessing/post-processor inserted):
special_tokens_mask[i] == 1offsets[i] == (0, 0)
- Special tokens matched from user text as added tokens:
special_tokens_mask[i] == 1offsets[i]is a real span into the input text (offsets[i] != (0, 0))
Important:
special_tokens_maskmarks special-token identity.- Span participation is determined by offsets/sentinel, not by mask alone.
Alignment Rule for KeemenaPreprocessing
Canonical coordinate system is tokenization_text.
KeemenaPreprocessing should:
- Produce
clean_textvia pipeline normalization. - Produce
tokenization_text = tokenization_view(tokenizer, clean_text). - Call
encode_result(tokenizer, tokenization_text; assume_normalized=true, ...). - Compute both word offsets and subword offsets on
tokenization_text. - Ignore sentinel offsets
(0, 0)during span alignment. - Do not drop all
mask==1tokens blindly; present-in-text special tokens may have real spans.
Recommended span participation policy:
- Participate in span alignment iff
has_nonempty_span(offset).
Downstream-Safe Span Inspection
Offsets are codeunit spans. Do not assume they are always valid Julia string slicing boundaries, especially for byte-level tokenizers on multibyte text.
Use these helpers for robust downstream handling:
span_codeunits(text, offset):- returns
UInt8[]for sentinel/empty spans, - returns the exact byte slice for non-empty spans.
- returns
try_span_substring(text, offset):- returns
""for sentinel/empty spans, - returns
Stringonly when both boundaries are valid Julia string boundaries, - returns
nothingotherwise.
- returns
is_valid_string_boundary(text, idx)can be used to inspect boundary validity.offsets_are_nonoverlapping(offsets; ignore_sentinel=true, ignore_empty=true)validates a downstream non-overlap invariant.
Boundary Validity Expectations By Tokenizer Family
- Non-byte-level tokenizers (for example WordPiece, SentencePiece, Unigram TSV, and non-byte-level HF tokenizer.json pipelines) are expected to produce spanful offsets that land on valid Julia string boundaries.
- Byte-level tokenizers (for example ByteBPE and HF ByteLevel pretokenizers) may produce non-boundary offsets on multibyte Unicode inputs.
Downstream interpretation:
- When
try_span_substring(text, offset)returnsnothingfor a byte-level multibyte case, treat this as expected "unsafe to slice" behavior. - Use
span_codeunits(text, offset)for byte-accurate span inspection regardless of string-boundary validity.
Strict Validator Helpers (Maintainer Debugging)
For tokenizer development and regression debugging, KeemenaSubwords also exposes strict contract validators:
validate_offsets_contract(text, offsets; require_string_boundaries=false): returnsBoolwithout throwing.assert_offsets_contract(text, offsets; require_string_boundaries=false): throwsArgumentErroron first violation with a targeted message.
Use require_string_boundaries=true when validating string-level tokenizers where spanful offsets are expected to land on valid Julia string boundaries.