Quick Start

The preprocess_corpus wrapper lets you go from raw text -> cleaned, tokenised, aligned, and fully-packaged PreprocessBundle in one line. Its only required argument is sources (strings, file paths, URLs, or any iterable that mixes them). Everything else is optional.

You pass…	What happens
No `config=` and no keywords	A fresh `PreprocessConfiguration` is created with all documented defaults.
Keyword overrides but no `config=`	A fresh configuration is built from the defaults plus your overrides.
`config = cfg` object	That exact configuration is used; keyword overrides are forbidden (ambiguity).

Minimal 'hello bundle' (all defaults)

using KeemenaPreprocessing

bund = preprocess_corpus("my_corpus.txt")

word_vocabulary = bund.levels[:word].vocabulary
@show length(word_vocabulary.id_to_token_strings)

Single in-memory string

raw = """
      It was a dark stormy night,
      And we see Sherlock Holmes.
      """

bund = preprocess_corpus(raw) # treats `raw` as a document

Multiple strings (small ad-hoc corpus)

docs = [
    "Mary had a little lamb.",
    "Humpty-Dumpty sat on a wall.",
    """
    Roses are red,
    violets are blue
    """
]

bund = preprocess_corpus(docs;
                         tokenizer_name = :whitespace,
                         minimum_token_frequency = 2)

Multiple file paths

sources = ["data/alice.txt",
           "data/time_machine.txt",
           "/var/corpora/news_2024.txt"]

bund = preprocess_corpus(sources; lowercase = false)

Directories in sources are silently skipped; mixing paths and raw strings is fine.

Remote URLs

urls = [
    "https://www.gutenberg.org/files/11/11-0.txt",   # Alice
    "https://www.gutenberg.org/files/35/35-0.txt"    # Time Machine
]

bund = preprocess_corpus(urls;
                         tokenizer_name = :unicode,
                         record_sentence_offsets = true)

Zero-configuration byte-level tokenisation

cfg  = byte_cfg()                 # shorthand helper
bund = preprocess_corpus("binary_corpus.bin", cfg)

Saving and loading bundles

cfg   = PreprocessConfiguration(minimum_token_frequency = 5)
bund1 = preprocess_corpus("my_corpus.txt";
                          config  = cfg,
                          save_to = "corpus.jld2")

bund2 = load_preprocess_bundle("corpus.jld2")

Interoperability

using KeemenaPreprocessing

# 1) Load a preprocessed bundle
bundle = load_preprocess_bundle("corpus.jld2")

# 2) Choose a segmentation level for modeling (e.g., words)
word_corpus  = get_corpus(bundle, :word)      # -> Corpus
vocabulary   = bundle.levels[:word].vocabulary

# 3) Get the token ids as a single flat vector (all documents concatenated)
token_ids = word_corpus.token_ids             # Vector{Int32} (or Int)

# 4) Split token ids by document using the document offset vector
#    (offsets follow the "[1 ... n+1]" sentinel style at word-level)
document_offsets = word_corpus.document_offsets
document_ranges = (document_offsets[i]:(document_offsets[i+1]-1)
                   for i in 1:length(document_offsets)-1)
document_token_views = [view(token_ids, r) for r in document_ranges]

# 5) Debug / data inspection: map a handful of ids back to strings
first_20_strings = map(id -> vocabulary.string(id), token_ids[1:20])

# 6) Word -> raw-text span (useful for highlighting model outputs)
#    (See [Offsets: sentinel conventions by level](@ref offsets_sentinels).)
word_index = 42
start_ix   = word_corpus.word_offsets[word_index]
stop_ix    = word_corpus.word_offsets[word_index + 1] - 1
raw_span   = String(codeunits(bundle.extras.raw_text)[start_ix:stop_ix])

# 7) Byte -> word (project low-level artifacts back to words)
build_ensure_alignments!(bundle)  # ensure canonical :byte->:word map exists
byte_to_word = bundle.alignments[(:byte, :word)].alignment
word_of_byte_123 = byte_to_word[123]

Alignments and `CrossMap`

Every time you call preprocess_corpus (streaming or not) the helper build_ensure_alignments! adds deterministic mappings between all recorded segmentation levels:

Offset arrays eg bundle.levels[:word].corpus.sentence_offsets.
CrossMap : sparse look-up tables linking byte -> char -> word -> sentence indices

a Inspecting offsets

wc = get_corpus(bund, :word)     # word-level Corpus
@show wc.sentence_offsets[1:10]  # sentinel-terminated, always sorted

Byte -> word mapping for a single token

btw = bund.alignments[(:byte, :word)]     # CrossMap
byte_ix = 12345
word_ix = btw.alignment[byte_ix]          # constant-time lookup

Convenience helpers

word_ix = alignment_byte_to_word(bund, byte_ix)
char_ix = alignment_byte_to_char(bund, byte_ix)

These helpers are thin wrappers over CrossMap, but keep your code independent of the underlying representation.

Working with multiple segmentation levels

The pipeline can record byte, character, word, sentence, paragraph, and document offsets simultaneously. Just enable the flags you need in the configuration:

using KeemenaPreprocessing

cfg = PreprocessConfiguration(
          tokenizer_name            = :unicode,    # word-ish tokens
          record_byte_offsets       = true,
          record_character_offsets  = true,
          record_word_offsets       = true,
          record_sentence_offsets   = true,
          record_paragraph_offsets  = true,
          record_document_offsets   = true)

bund = preprocess_corpus("demo.txt"; config = cfg)

byte_corp = get_corpus(bund, :byte)        # each token is UInt8
char_corp = get_corpus(bund, :character)        # Unicode code-points
word_corp = get_corpus(bund, :word)        # words / graphemes
sent_offs = word_corp.sentence_offsets     # sentinel-terminated
para_offs = word_corp.paragraph_offsets
doc_offs  = word_corp.document_offsets

@show (byte_corp.token_ids[1:10],
       char_corp.token_ids[1:10],
       word_corp.token_ids[1:10])

By default every offset array is sorted and sentinel-terminated (last == n_tokens + 1), so it is safe to searchsortedlast or binary-search into them.

Supplying a custom tokenizer function

Any callable f(::AbstractString) -> Vector{String} can replace the built-ins. Below we split on whitespace and the dash "‐" character:

using KeemenaPreprocessing

function dash_whitespace_tok(text::AbstractString)
    return split(text, r"[ \t\n\r\-]+", keepempty = false)
end

cfg = PreprocessConfiguration(
          tokenizer_name           = dash_whitespace_tok,    # <- callable
          minimum_token_frequency  = 2,
          record_word_offsets      = true)

docs  = ["state-of-the-art models excel",   # note the dashes
         "art-of-war is timeless"]

bund   = preprocess_corpus(docs; config = cfg)

# Inspect the custom tokenisation
wc = get_corpus(bund, :word)
@show map(tid -> wc.vocabulary.string(tid), wc.token_ids)

If you want to plug in an existing tokenizer package (WordTokenizers.jl, BytePairEncoding.jl, etc.), see Using Existing Tokenizers. If you only need the built-in tokenizers, see Built-in tokenizers.

Tips for custom tokenisers

Requirement	Guideline
Return type	`Vector{<:AbstractString}` (no UInt8).
No trimming	If you want empty tokens preserved, call with `preserve_empty_tokens = true`.
Offsets	Only `:byte` and `:char` levels need special handling; `CrossMap` takes care of higher levels automatically.

pitfalls

Pitfall	Symptom	Fix
Passing `config=` and keyword overrides	`ErrorException: Pass either config= or per-field keywords, not both.`	Pick one method; never both.
`record_paragraph_offsets = true` but `preserve_newlines = false`	Warning and paragraphs not recorded.	Enable `preserve_newlines` (done automatically with a warning).
Unsupported `tokenizer_name` symbol	`AssertionError`	See Built-in tokenizers.

Why JLD2 (and when to use something else)

KeemenaPreprocessing uses JLD2 for the convenience helpers save_preprocess_bundle and load_preprocess_bundle. JLD2 is a pure-Julia serialization format that can store arbitrary Julia structs efficiently and produces files compatible with the HDF5 spec.

The PreprocessBundle itself is just a plain Julia object, so you are not locked into JLD2: if your workflow needs memory-mapped arrays, indexed random access, or cross-language IO, you can write the bundle (or just its large arrays) using a different storage backend.