Alignment helpers
Many downstream tasks need to project annotations or errors between tokenisation levels: eg "Which word does byte 123 belong to?". For the sentinel and monotonicity rules that alignment assumes, see Offsets: sentinel conventions by level.
KeemenaPreprocessing encapsulates these look-ups in a small record:
struct CrossMap
source_level :: Symbol # :byte, :character, ...
destination_level :: Symbol # :word, ...
alignment :: Vector{Int} # 1-based destination index per source
endFor every source token index i you obtain the corresponding destination token as map.alignment[i].
Low-level constructors
| Function | Produces | Preconditions |
|---|---|---|
alignment_byte_to_word(byte_c, word_c) | :byte -> :word | each corpus has byte_offsets / word_offsets and both share the same span. |
alignment_char_to_word(char_c, word_c) | :character -> :word | ditto, but via character offsets. |
alignment_byte_to_char(byte_c, char_c) | :byte -> :character | ditto. |
Example:
b2w = alignment_byte_to_word(byte_corpus, word_corpus)
word_of_42nd_byte = b2w.alignment[42]Errors:
ArgumentErrorif the required offset vectors are missing or- the two corpora cover different spans (
byte_offsets[end] != word_offsets[end]).
Bundle-level helpers
_ensure_lower_levels!(bundle)
bundle = _ensure_lower_levels!(bundle)If the bundle has a :word level and that word corpus already stores character_offsets and/or byte_offsets, this function:
- synthesises dummy
:character/:bytecorpora token-ids are filled with<UNK>, - adds them as levels (vocabulary = 1-token dummy),
- leaves existing levels untouched,
- returns the same bundle (mutated in place).
Idempotent: calling it again is a no-op.
build_alignments!(bundle; pairs = [(:byte,:word), ...])
Creates the requested CrossMaps iff the corresponding levels exist and the map is not already present.
build_alignments!(bundle) # default three maps
build_alignments!(bundle; pairs=[(:character,:word)])build_ensure_alignments!(bundle)
One-stop convenience:
build_ensure_alignments!(bundle)- Calls
_ensure_lower_levels!, - Calls
build_alignments!with the default trio, - Returns the bundle (mutated).
Typical workflow
bund = preprocess_corpus("alice.txt", config = cfg)
# guarantee byte/char levels + alignments
build_ensure_alignments!(bund)
word_of_first_byte = bund.alignments[(:byte,:word)].alignment[1]Inside the high-level pipelines:
preprocess_corpuscreates all three levels + alignments by default.preprocess_corpus_streaming/_chunks/_fullcallbuild_ensure_alignments!for every chunk (and again after merging), so the resulting bundles are always fully aligned.
Sentinel assumptions
Offset vectors must satisfy:
issorted(offsets) == true
first(offsets) in (0,1) # leading sentinel
last(offsets) >= n_tokens # trailing sentinelalignment_* functions interpret every index in offsets[i] : offsets[i+1]-1 as belonging to token i.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
ArgumentError: byte and word corpora cover different span | Word corpus was trimmed during cleaning but byte corpus was not. | Re-run the pipeline; byte and word corpora must originate from the same cleaned text. |
KeyError when accessing bundle.alignments[(src,dst)] | Map not built (levels missing or function not called). | Call build_ensure_alignments!(bundle) or ensure both levels exist before build_alignments!. |
Dummy vocabularies contain only <UNK> | Expected - lower levels are placeholders used solely for alignment. |
Helper signatures (for reference)
alignment_byte_to_word(byte_c::Corpus, word_c::Corpus) -> CrossMap
alignment_char_to_word(char_c::Corpus, word_c::Corpus) -> CrossMap
alignment_byte_to_char(byte_c::Corpus, char_c::Corpus) -> CrossMap
_ensure_lower_levels!(bundle::PreprocessBundle) -> PreprocessBundle
build_alignments!(bundle::PreprocessBundle; pairs = ...) -> PreprocessBundle
build_ensure_alignments!(bundle::PreprocessBundle) -> PreprocessBundleOnce you have called build_ensure_alignments!, every bundle is guaranteed to contain the canonical :byte -> :character -> :word chain.