Offset Vectors & Segmentation Levels

Keemena stores every corpus as a flat vector of token-ids plus one or more offset vectors that mark the boundaries of higher-level units (bytes -> characters -> words -> sentences -> paragraphs -> documents).

Understanding these vectors lets you

  • slice substrings for data augmentation,
  • project annotations between levels,
  • and validate or extend the pipeline.

Anatomy of an offset vector

offsets = [s0, s1, ..., sn]        # length = n_tokens   or   n_tokens + 1
                                   # 1-based indices into corpus.token_ids
entrymeaning
s0leading sentinel - 0 or 1 (optional)
s1inclusive start index of token i
sntrailing sentinel - n_tokens or n_tokens+1 (optional)
start = offsets[i]
stop  = offsets[i+1] - 1           # inclusive index range of token i

validate_offsets guarantees

  • issorted(offsets, lt = <)
  • first(offsets) in (0, 1) (if a leading sentinel exists)
  • last(offsets) >= n_tokens
  • length(offsets) >= n_tokens

When sentinel recording is disabled for a level the vector length equals n_tokens and the sentinel checks are skipped.


Sentinel conventions by level

Level symbolDefault sentinel styleTypical unit
:byte[0 ... n]UTF-8 byte
:character[0 ... n]Unicode scalar
:word[1 ... n+1]whitespace / tokenizer word
:sentence[1 ... n+1]heuristic sentence
:paragraph[1 ... n+1]blank-line span
:document[1 ... n+1]source document

Trailing sentinels may be either the last token index n_tokens (inclusive style) or n_tokens + 1 (exclusive style). The streaming merge helper accepts both and deduplicates them, so every offset vector in a merged bundle ends with exactly one sentinel.


Mapping symbols -> vectors

Keemena keeps an internal lookup table

KeemenaPreprocessing.LEVEL_TO_OFFSETS_FIELD

that translates a segmentation symbol to the corresponding field name inside Corpus:

SymbolCorpus field
:byte:byte_offsets
:character:character_offsets
:word:word_offsets
:sentence:sentence_offsets
:paragraph:paragraph_offsets
:document:document_offsets
field = KeemenaPreprocessing.LEVEL_TO_OFFSETS_FIELD[:sentence]
sent  = getfield(corpus, field)          # Vector{Int} or `nothing`

Advanced only The table is accessible but not exported; ordinary users do not modify it directly. To register a new level use add_level!(bundle, :my_level, lb) which both validates offsets and updates the lookup table behind the scenes.


Cross-level alignment

When two levels share the same span (e.g. bytes & characters) Keemena derives a CrossMap:

cm = alignment_byte_to_word(byte_corp, word_corp)
dst_word_idx = cm.alignment[src_byte_idx]     # O(1) lookup

build_ensure_alignments! automatically adds three canonical maps to every bundle:

(:byte      , :character)
(:byte      , :word)
(:character , :word)

Practical snippets

Extract raw text for the 42-nd word

wc   = bundle.levels[:word].corpus
span = wc.word_offsets[42] : wc.word_offsets[43] - 1
raw  = String(codeunits(bundle.extras.raw_text)[span])

Sentence lengths (words per sentence)

wc  = bundle.levels[:word].corpus
snt = wc.sentence_offsets                       # requires record_sentence_offsets=true
lengths = diff(snt)                             # Vector{Int}

Shuffle paragraph spans

pc = bundle.levels[:paragraph].corpus
spans = [pc.paragraph_offsets[i] : pc.paragraph_offsets[i+1] - 1
         for i in 1:length(pc.paragraph_offsets)-1]
shuffle!(spans)

Add a custom level (advanced workflow)

#  build monotone offset vector (leading 1, trailing n+1)
my_offs = [1, 8, 15, 22, n_tokens + 1]

#  clone an existing corpus and attach new offsets
corpus = deepcopy(bundle.levels[:word].corpus)
setfield!(corpus, :my_offsets, my_offs)

#  wrap & insert; add_level! validates and registers lookup entry
my_lvl = LevelBundle(corpus,
                     bundle.levels[:word].vocabulary)
add_level!(bundle, :my_level, my_lvl)

Troubleshooting

SymptomLikely causeFix
'offsets define k segments but corpus has n tokens'duplicate or missing trailing sentinelregenerate offsets or let streaming merge rebuild them
'Offsets must be strictly increasing'offsets edited out of ordersort or recreate
Alignment length mismatchcorpora built from different cleaned textre-process both levels in the same pipeline

Following these rules keeps all built-in helpers—slicing utilities, streaming merge, alignment builders—working seamlessly and lets your custom levels interoperate with the rest of KeemenaPreprocessing.