Offset Vectors & Segmentation Levels
Keemena stores every corpus as a flat vector of token-ids plus one or more offset vectors that mark the boundaries of higher-level units (bytes -> characters -> words -> sentences -> paragraphs -> documents).
Understanding these vectors lets you
- slice substrings for data augmentation,
- project annotations between levels,
- and validate or extend the pipeline.
Anatomy of an offset vector
offsets = [s0, s1, ..., sn] # length = n_tokens or n_tokens + 1
# 1-based indices into corpus.token_ids| entry | meaning |
|---|---|
s0 | leading sentinel - 0 or 1 (optional) |
s1 | inclusive start index of token i |
sn | trailing sentinel - n_tokens or n_tokens+1 (optional) |
start = offsets[i]
stop = offsets[i+1] - 1 # inclusive index range of token i
validate_offsetsguarantees
issorted(offsets, lt = <)first(offsets) in (0, 1)(if a leading sentinel exists)last(offsets) >= n_tokenslength(offsets) >= n_tokensWhen sentinel recording is disabled for a level the vector length equals
n_tokensand the sentinel checks are skipped.
Sentinel conventions by level
| Level symbol | Default sentinel style | Typical unit |
|---|---|---|
:byte | [0 ... n] | UTF-8 byte |
:character | [0 ... n] | Unicode scalar |
:word | [1 ... n+1] | whitespace / tokenizer word |
:sentence | [1 ... n+1] | heuristic sentence |
:paragraph | [1 ... n+1] | blank-line span |
:document | [1 ... n+1] | source document |
Trailing sentinels may be either the last token index n_tokens (inclusive style) or n_tokens + 1 (exclusive style). The streaming merge helper accepts both and deduplicates them, so every offset vector in a merged bundle ends with exactly one sentinel.
Mapping symbols -> vectors
Keemena keeps an internal lookup table
KeemenaPreprocessing.LEVEL_TO_OFFSETS_FIELDthat translates a segmentation symbol to the corresponding field name inside Corpus:
| Symbol | Corpus field |
|---|---|
:byte | :byte_offsets |
:character | :character_offsets |
:word | :word_offsets |
:sentence | :sentence_offsets |
:paragraph | :paragraph_offsets |
:document | :document_offsets |
field = KeemenaPreprocessing.LEVEL_TO_OFFSETS_FIELD[:sentence]
sent = getfield(corpus, field) # Vector{Int} or `nothing`Advanced only The table is accessible but not exported; ordinary users do not modify it directly. To register a new level use
add_level!(bundle, :my_level, lb)which both validates offsets and updates the lookup table behind the scenes.
Cross-level alignment
When two levels share the same span (e.g. bytes & characters) Keemena derives a CrossMap:
cm = alignment_byte_to_word(byte_corp, word_corp)
dst_word_idx = cm.alignment[src_byte_idx] # O(1) lookupbuild_ensure_alignments! automatically adds three canonical maps to every bundle:
(:byte , :character)
(:byte , :word)
(:character , :word)Practical snippets
Extract raw text for the 42-nd word
wc = bundle.levels[:word].corpus
span = wc.word_offsets[42] : wc.word_offsets[43] - 1
raw = String(codeunits(bundle.extras.raw_text)[span])Sentence lengths (words per sentence)
wc = bundle.levels[:word].corpus
snt = wc.sentence_offsets # requires record_sentence_offsets=true
lengths = diff(snt) # Vector{Int}Shuffle paragraph spans
pc = bundle.levels[:paragraph].corpus
spans = [pc.paragraph_offsets[i] : pc.paragraph_offsets[i+1] - 1
for i in 1:length(pc.paragraph_offsets)-1]
shuffle!(spans)Add a custom level (advanced workflow)
# build monotone offset vector (leading 1, trailing n+1)
my_offs = [1, 8, 15, 22, n_tokens + 1]
# clone an existing corpus and attach new offsets
corpus = deepcopy(bundle.levels[:word].corpus)
setfield!(corpus, :my_offsets, my_offs)
# wrap & insert; add_level! validates and registers lookup entry
my_lvl = LevelBundle(corpus,
bundle.levels[:word].vocabulary)
add_level!(bundle, :my_level, my_lvl)Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| 'offsets define k segments but corpus has n tokens' | duplicate or missing trailing sentinel | regenerate offsets or let streaming merge rebuild them |
| 'Offsets must be strictly increasing' | offsets edited out of order | sort or recreate |
| Alignment length mismatch | corpora built from different cleaned text | re-process both levels in the same pipeline |
Following these rules keeps all built-in helpers—slicing utilities, streaming merge, alignment builders—working seamlessly and lets your custom levels interoperate with the rest of KeemenaPreprocessing.