Public API

KeemenaPreprocessing.TOKENIZERS — Constant

TOKENIZERS

A constant Tuple{Symbol} listing the names of built-in tokenizers that can be passed to the tokenizer_name keyword of PreprocessConfiguration.

Currently supported values are

:whitespace - split on Unicode whitespace;
:unicode - iterate user-perceived graphemes (eachgrapheme);
:byte - treat the text as raw bytes (byte-level models);
:char - split on individual UTF-8 code units.

You may also supply any callable that implements mytokens = f(string) in place of one of these symbols.

source

KeemenaPreprocessing.Corpus — Type

Corpus

Flat, memory-efficient container that stores an entire corpus of token-ids together with optional hierarchical offset tables that recover the original structure (documents → paragraphs → sentences → words → characters → bytes).

Every offset vector records the starting index (1-based, inclusive) of each unit inside token_ids. The final entry therefore equals length(token_ids)+1, making range retrieval convenient via view(token_ids, offsets[i] : offsets[i+1]-1).

Fields

field	type	always present?	description
`token_ids`	`Vector{Int}`	✓	Concatenated token identifiers returned by the vocabulary.
`document_offsets`	`Vector{Int}`	✓	Start positions of each document (outermost level).
`paragraph_offsets`	`Union{Vector{Int},Nothing}`	cfg-dependent	Paragraph starts within each document when `record_paragraph_offsets=true`.
`sentence_offsets`	`Union{Vector{Int},Nothing}`	cfg-dependent	Sentence boundaries when `record_sentence_offsets=true`.
`word_offsets`	`Union{Vector{Int},Nothing}`	cfg-dependent	Word boundaries when `record_word_offsets=true`.
`character_offsets`	`Union{Vector{Int},Nothing}`	cfg-dependent	Unicode-character spans when `record_character_offsets=true`.
`byte_offsets`	`Union{Vector{Int},Nothing}`	cfg-dependent	Byte-level spans when `record_byte_offsets=true`.

Example

# assume `corp` is a Corpus produced by preprocess_corpus
doc1_range = corp.document_offsets[1] : corp.document_offsets[2]-1
doc1_token_ids = view(corp.token_ids, doc1_range)

if corp.sentence_offsets ≠ nothing
    first_sentence = view(corp.token_ids,
                          corp.sentence_offsets[1] : corp.sentence_offsets[2]-1)
end

The presence or absence of each optional offsets vector is determined entirely by the corresponding record_*_offsets flags in PreprocessConfiguration.

source

KeemenaPreprocessing.CrossMap — Type

CrossMap

Alignment table that links two segmentation levels of the same corpus (e.g. bytes -> characters, characters -> words, words -> sentences).

For every unit in the destination level the alignment vector stores the 1-based index into the source offsets at which that unit begins. This allows constant-time projection of any span expressed in destination units back to the finer-grained source sequence.

Fields

source_level :: Symbol Name of the finer level (must match a key in bundle.levels, typically :byte, :char, :word, :sentence, or :paragraph).
destination_level :: Symbol Name of the coarser level whose boundaries are encoded.
alignment :: Vector{Int} Length = N_destination + 1. alignment[i] is the starting source-level offset of destination element i; the extra sentinel entry alignment[end] = N_source + 1 lets you slice with alignment[i] : alignment[i+1]-1 without bounds checks.

Example

# map words ⇒ sentences
m = CrossMap(:word, :sentence, sent2word_offsets)

first_sentence_word_ids = alignment_view(m, 1)  # helper returning a view

The constructor is trivial and performs no validation; pipelines are expected to guarantee consistency when emitting CrossMap objects.

source

KeemenaPreprocessing.CrossMap — Method

CrossMap(src, dst, align)

Shorthand outer constructor that builds a CrossMap while materialising the alignment vector as Vector{Int}.

Arguments

src::Symbol - identifier of the source (finer-grained) level (e.g. :char, :word).
dst::Symbol - identifier of the destination (coarser) level (e.g. :word, :sentence).
align::AbstractVector{<:Integer} - offset array mapping every destination unit to its starting position in the source sequence. Any integer-typed vector is accepted; it is copied into a dense Vector{Int} to guarantee contiguous storage and type stability inside the resulting CrossMap.

Returns

A CrossMap(src, dst, Vector{Int}(align)).

Example

cm = CrossMap(:char, :word, UInt32[1, 5, 9, 14])
@assert cm.alignment isa Vector{Int}

source

KeemenaPreprocessing.LevelBundle — Type

LevelBundle

Self-contained pairing of a Corpus and its companion Vocabulary. A LevelBundle represents one segmentation level (e.g. words, characters, or bytes) produced by the preprocessing pipeline. By storing both objects side-by-side it guarantees that every token_id found in corpus.token_ids is valid according to vocabulary.

Fields

corpus :: Corpus All token-ids plus optional offset tables describing the structure of the text at this level.
vocabulary :: Vocabulary Bidirectional mapping between token strings and the integer ids used in corpus.token_ids.

Integrity checks

The inner constructor performs two runtime validations:

Range check - the largest token-id must not exceed length(vocabulary.id_to_token_strings).
Lower bound - all token-ids must be >= 1 (id 0 is never legal).

Violations raise an informative ArgumentError, catching mismatches early.

Example

word_corpus  = Corpus(word_ids, doc_offs, nothing, sent_offs, word_offs,
                      nothing, nothing)
word_vocab   = build_vocabulary(words; minimum_token_frequency = 2)

word_bundle  = LevelBundle(word_corpus, word_vocab)

nb_tokens    = length(word_bundle.vocabulary.id_to_token_strings)
@info "bundle contains nb_tokens unique tokens"

source

KeemenaPreprocessing.PipelineMetadata — Type

PipelineMetadata

Compact header bundled with every artefact produced by KeemenaPreprocessing. It records the exact pipeline settings and the version of the on-disk schema so that data can be re-processed, inspected, or migrated safely.

Fields

configuration::PreprocessConfiguration The full set of cleaning, tokenisation, vocabulary, and offset-recording options that generated the artefact. Storing this ensures strict reproducibility.
schema_version::VersionNumber The version of the bundle file format (not the Julia package). Increment the major component when breaking changes are introduced so that loaders can detect incompatibilities and perform migrations or raise errors.

Example

cfg  = PreprocessConfiguration(strip_html_tags = true)
meta = PipelineMetadata(cfg, v"1.0.0")

@info "tokeniser:" meta.configuration.tokenizer_name
@assert meta.schema_version >= v"1.0.0"

source

KeemenaPreprocessing.PipelineMetadata — Method

PipelineMetadata() -> PipelineMetadata

Convenience constructor that returns a metadata header with

the default PreprocessConfiguration() (all keyword-arguments left at their documented defaults); and
the current bundle schema version v"1.0.0".

Handy for rapid prototyping or unit tests when you do not need to customise the pipeline but still require a valid PipelineMetadata object.

Identical to:

PipelineMetadata(PreprocessConfiguration(), v"1.0.0")

source

KeemenaPreprocessing.PreprocessBundle — Type

PreprocessBundle{ExtraT}

Top-level artefact emitted by preprocess_corpus (or the streaming variant). A bundle contains everything required to feed a downstream model or to reload a corpus without re-running the expensive preprocessing pipeline.

Type parameter

ExtraT - arbitrary payload for user-defined information (e.g. feature matrices, clustering assignments, language tags). Use Nothing when no extras are needed.

Fields

field	type	description
`levels`	`Dict{Symbol,LevelBundle}`	Mapping from segmentation level name (`:byte`, `:char`, `:word`, `:sentence`, `:paragraph`, …) to the corresponding `LevelBundle`.
`metadata`	`PipelineMetadata`	Reproducibility header (configuration + schema version).
`alignments`	`Dict{Tuple{Symbol,Symbol},CrossMap}`	Pair-wise offset projections between levels, keyed as `(source, destination)` (e.g. `(:char, :word)`).
`extras`	`ExtraT`	Optional user payload carried alongside the core data.

Typical workflow

bund = preprocess_corpus(files; strip_html_tags=true)

# inspect vocabulary
word_vocab = bund.levels[:word].vocabulary
println("vocabulary size: ", length(word_vocab.id_to_token_strings))

# project a sentence span back to character offsets
cm = bund.alignments[(:char, :sentence)]
first_sentence_char_span = cm.alignment[1] : cm.alignment[2]-1

The bundle is immutable; to add additional levels or extras create a fresh instance (helper functions add_level!, with_extras, etc. are provided by the package).

source

KeemenaPreprocessing.PreprocessBundle — Method

PreprocessBundle(levels; metadata = PipelineMetadata(),
                      alignments = Dict{Tuple{Symbol,Symbol},CrossMap}(),
                      extras = nothing) -> PreprocessBundle

Outer constructor that validates and assembles the individual artefacts generated by KeemenaPreprocessing into a single PreprocessBundle.

Required argument

levels::Dict{Symbol,<:LevelBundle} - at least one segmentation level (keyed by level name such as :word or :char).

Optional keyword arguments

keyword	default	purpose
`metadata`	`PipelineMetadata()`	Configuration & schema header.
`alignments`	empty `Dict`	Maps `(source,destination) -> CrossMap`.
`extras`	`nothing`	User-supplied payload propagated unchanged.

Runtime checks

Non-empty levels.
For each (lvl, lb) in levels run validate_offsets(lb.corpus, lvl) to ensure internal offset consistency.
For every supplied alignment (src,dst) → cm:
- both src and dst must exist in levels;
- length(cm.alignment) == length(levels[src].corpus.token_ids);
- cm.source_level == src;
- cm.destination_level == dst.

Any violation throws an informative ArgumentError.

Returns

A fully-validated PreprocessBundle{typeof(extras)} containing: Dict(levels), metadata, Dict(alignments), and extras.

Example

word_bundle = LevelBundle(word_corpus, word_vocab)
char_bundle = LevelBundle(char_corpus, char_vocab)

bund = PreprocessBundle(Dict(:word=>word_bundle, :char=>char_bundle);
                        alignments = Dict((:char,:word)=>char2word_map))

source

KeemenaPreprocessing.PreprocessBundle — Method

PreprocessBundle(; metadata = PipelineMetadata(), extras = nothing) -> PreprocessBundle

Convenience constructor that produces an empty PreprocessBundle:

levels = Dict{Symbol,LevelBundle}()
alignments = Dict{Tuple{Symbol,Symbol},CrossMap}()
metadata = metadata (defaults to PipelineMetadata())
extras = extras (defaults to nothing)

Useful when you want to build a bundle incrementally—for example, loading individual levels from disk or generating them in separate jobs: while still attaching a common metadata header or arbitrary user payload.

bund = PreprocessBundle()                      # blank skeleton
bund = merge(bund, load_word_level("word.jld"))  # pseudo-code for adding data

The returned object's type parameter is inferred from extras so that any payload, including complex structs, can be stored without further boilerplate.

source

KeemenaPreprocessing.PreprocessConfiguration — Method

PreprocessConfiguration(; kwargs...) -> PreprocessConfiguration

Create a fully-specified preprocessing configuration.

All keyword arguments are optional; sensible defaults are provided so that cfg = PreprocessConfiguration() already yields a working pipeline. Options are grouped below by the stage they affect.

Cleaning stage toggles

keyword	default	purpose
`lowercase`	`true`	Convert letters to lower-case.
`strip_accents`	`true`	Remove combining accent marks.
`remove_control_characters`	`true`	Drop Unicode Cc/Cf code-points.
`remove_punctuation`	`true`	Strip punctuation & symbol characters.
`normalise_whitespace`	`true`	Collapse consecutive whitespace.
`remove_zero_width_chars`	`true`	Remove zero-width joiners, etc.
`preserve_newlines`	`true`	Keep explicit line breaks.
`collapse_spaces`	`true`	Collapse runs of spaces/tabs.
`trim_edges`	`true`	Strip leading/trailing whitespace.

URL, e-mail & numbers

keyword	default	purpose
`replace_urls`	`true`	Replace URLs with `url_sentinel`.
`replace_emails`	`true`	Replace e-mails with `mail_sentinel`.
`keep_url_scheme`	`false`	Preserve `http://` / `https://` prefix.
`url_sentinel`	`"<URL>"`	Token inserted for each URL.
`mail_sentinel`	`"<EMAIL>"`	Token inserted for each e-mail.
`replace_numbers`	`false`	Replace numbers with `number_sentinel`.
`number_sentinel`	`"<NUM>"`	Token used when replacing numbers.
`keep_number_decimal`	`false`	Preserve decimal part.
`keep_number_sign`	`false`	Preserve ± sign.
`keep_number_commas`	`false`	Preserve thousands separators.

Mark-up & HTML

keyword	default	purpose
`strip_markdown`	`false`	Remove Markdown formatting.
`preserve_md_code`	`true`	Keep fenced/inline code while stripping.
`strip_html_tags`	`false`	Remove HTML/XML tags.
`html_entity_decode`	`true`	Decode `&`, `"`, etc.

Emoji & Unicode

keyword	default	purpose
`emoji_handling`	`:keep`	`:keep`, `:remove`, or `:sentinel`.
`emoji_sentinel`	`"<EMOJI>"`	Used when `emoji_handling == :sentinel`.
`squeeze_repeat_chars`	`false`	Limit repeated character runs.
`max_char_run`	`3`	Maximum run length when squeezing.
`map_confusables`	`false`	Map visually-confusable chars.
`unicode_normalisation_form`	`:none`	`:NFC`, `:NFD`, `:NFKC`, `:NFKD`, or `:none`.
`map_unicode_punctuation`	`false`	Replace Unicode punctuation with ASCII.

Tokenisation

keyword	default	purpose
`tokenizer_name`	`:whitespace`	One of `TOKENIZERS` or a callable.
`preserve_empty_tokens`	`false`	Keep zero-length tokens.

Vocabulary construction

keyword	default	purpose
`minimum_token_frequency`	`1`	Discard rarer tokens / map to `<UNK>`.
`special_tokens`	`Dict(:unk=>"<UNK>", :pad=>"<PAD>")`	Role ⇒ literal mapping.

Offset recording

keyword	default	purpose
`record_byte_offsets`	`false`	Record byte-level spans.
`record_character_offsets`	`false`	Record Unicode-char offsets.
`record_word_offsets`	`true`	Record word offsets.
`record_sentence_offsets`	`true`	Record sentence offsets.
`record_paragraph_offsets`	`false`	Record paragraph offsets (forces `preserve_newlines = true`).
`record_document_offsets`	`true`	Record document offsets.

Returns

A fully-initialised PreprocessConfiguration instance. Invalid combinations raise AssertionError (e.g. unsupported tokenizer) and certain settings emit warnings when they imply other flags (e.g. paragraph offsets -> preserve_newlines).

See also: TOKENIZERS and byte_cfg for a pre-canned byte-level configuration.

source

KeemenaPreprocessing.Vocabulary — Type

Vocabulary

Immutable lookup table produced by build_vocabulary that maps between integer token-ids and the string literals that appear in a corpus.

Fields

id_to_token_strings::Vector{String} Position i holds the canonical surface form of token-id i (vocab.id_to_token_strings[id] → "word").
token_to_id_map::Dict{String,Int} Fast reverse mapping from token string to its integer id (vocab.token_to_id_map["word"] → id). Look-ups fall back to the <UNK> id when the string is absent.
token_frequencies::Vector{Int} Corpus counts aligned with id_to_token_strings (token_frequencies[id] gives the raw frequency of that token).
special_tokens::Dict{Symbol,Int} Set of reserved ids for sentinel symbols such as :unk, :pad, :bos, :eos, … Keys are roles (Symbol); values are the corresponding integer ids.

Usage example

vocab = build_vocabulary(tokens; minimum_token_frequency = 3)

@info "UNK id:    " vocab.special_tokens[:unk]
@info "«hello» id:" vocab.token_to_id_map["hello"]
@info "id → token:" vocab.id_to_token_strings[42]

source

KeemenaPreprocessing._Alignment.alignment_byte_to_word — Function

alignment_byte_to_word(byte_c, word_c) -> CrossMap

Construct a byte -> word CrossMap that projects each byte index in byte_c onto the word index in word_c that contains it.

Preconditions

byte_c must have a non-nothing byte_offsets vector (checked via the private helper _require_offsets).
word_c must have a non-nothing word_offsets vector.
Both corpora must span the same token range byte_offsets[end] == word_offsets[end]; otherwise an ArgumentError is thrown.

Arguments

name	type	description
`byte_c`	`Corpus`	Corpus tokenised at the byte level.
`word_c`	`Corpus`	Corpus tokenised at the word level.

Algorithm

Retrieve the sentinel-terminated offset vectors bo = byte_c.byte_offsets and wo = word_c.word_offsets.
Allocate b2w :: Vector{Int}(undef, n_bytes) where n_bytes = length(bo) - 1.
For each word index w_idx fill the slice wo[w_idx] : wo[w_idx+1]-1 with w_idx, thereby assigning every byte position to the word that begins at wo[w_idx].
Return CrossMap(:byte, :word, b2w).

The output vector has length n_bytes (no sentinel) because every byte token receives one word identifier.

Returns

A CrossMap whose fields are:

source_level      == :byte
destination_level == :word
alignment         :: Vector{Int}  # length = n_bytes

Errors

ArgumentError if either corpus lacks the necessary offsets.
ArgumentError when the overall spans differ.

Example

b2w = alignment_byte_to_word(byte_corpus, word_corpus)
word_index_of_42nd_byte = b2w.alignment[42]

source

KeemenaPreprocessing._Assemble.assemble_bundle — Function

assemble_bundle(tokens, offsets, vocab, cfg) -> PreprocessBundle

Convert the token-level artefacts produced by tokenize_and_segment into a minimal yet fully valid PreprocessBundle. The function

Projects each token to its integer id using vocab; unknown strings are mapped to the :unk special (throws if the vocabulary lacks one).
Packs the id sequence together with the requested offset tables into a Corpus.
Wraps that corpus and its vocabulary in a LevelBundle whose key is inferred from cfg.tokenizer_name:
tokenizer_name value level symbol stored
:byte :byte
:char :character
:unicode, :whitespace :word
Function Symbol(typeof(fn))
any other Symbol same symbol
Builds a default PipelineMetadata header (PipelineMetadata(cfg, v"1.0.0")).
Returns a PreprocessBundle containing exactly one level, empty alignments, and extras = nothing.

`tokenizer_name` value	level symbol stored
`:byte`	`:byte`
`:char`	`:character`
`:unicode`, `:whitespace`	`:word`
`Function`	`Symbol(typeof(fn))`
any other `Symbol`	same symbol

Arguments

name	type	description
`tokens`	`Vector{<:Union{String,UInt8}}`	Flattened token stream.
`offsets`	`Dict{Symbol,Vector{Int}}`	Start indices for each recorded level (as returned by `tokenize_and_segment`).
`vocab`	`Vocabulary`	Token <-> id mapping (must contain `:unk`).
`cfg`	`PreprocessConfiguration`	Determines the level key and special-token requirements.

Returns

PreprocessBundle with

bundle.levels       == Dict(level_key => LevelBundle(corpus, vocab))
bundle.metadata     == PipelineMetadata(cfg, v"1.0.0")
bundle.alignments   == Dict{Tuple{Symbol,Symbol},CrossMap}()   # empty
bundle.extras       == nothing

Errors

Throws ArgumentError if vocab lacks the :unk special.
Propagates any error raised by the inner constructors of Corpus or LevelBundle (e.g. offset inconsistencies).

Example

tokens, offs = tokenize_and_segment(docs, cfg)
vocab         = build_vocabulary(tokens; cfg = cfg)
bund          = assemble_bundle(tokens, offs, vocab, cfg)

@info keys(bund.levels)  # (:word,) for whitespace tokenizer

source

KeemenaPreprocessing._BundleIO.load_preprocess_bundle — Function

load_preprocess_bundle(path; format = :jld2) -> PreprocessBundle

Load a previously-saved PreprocessBundle from disk.

The function currently understands the JLD2 wire-format written by save_preprocess_bundle. It performs a lightweight header check to ensure the on-disk bundle version is not newer than the library version linked at run-time, helping you avoid silent incompatibilities after package upgrades.

Arguments

name	type	description
`path`	`AbstractString`	File name (relative or absolute) pointing to the bundle on disk.
`format`	`Symbol` (keyword)	Serialization format. Only `:jld2` is accepted—any other value raises an error.

Returns

PreprocessBundle - the exact object originally passed to save_preprocess_bundle, including all levels, alignments, metadata, and extras.

Errors

ArgumentError  - if path does not exist.
ArgumentError  - if format ≠ :jld2.
ErrorException  - when the bundle's persisted version is newer than the library's internal _BUNDLE_VERSION, signalling that your local code may be too old to read the file safely.

Example

bund = load_preprocess_bundle("artifacts/train_bundle.jld2")

@info "levels available: keys(bund.levels))"

source

KeemenaPreprocessing._BundleIO.save_preprocess_bundle — Function

save_preprocess_bundle(bundle, path; format = :jld2, compress = false) -> String

Persist a PreprocessBundle to disk and return the absolute file path written.

Currently the only supported format is :jld2; an error is raised for any other value.

Arguments

name	type	description
`bundle`	`PreprocessBundle`	Object produced by `preprocess_corpus`.
`path`	`AbstractString`	Destination file name (relative or absolute). Parent directories are created automatically.
`format`	`Symbol` (keyword)	Serialization format. Must be `:jld2`.
`compress`	`Bool` (keyword)	When `false` (default) the JLD2 file is written without zlib compression; set to `false` for fastest write speed (default), true adds compression.

File structure

The JLD2 file stores three top-level keys

key	value
`"__bundle_version__"`	String denoting the package's internal bundle spec.
`"__schema_version__"`	`string(bundle.metadata.schema_version)`
`"bundle"`	The full `PreprocessBundle` instance.

These headers enable future schema migrations or compatibility checks.

Returns

String - absolute path of the file just written.

Example

p = save_preprocess_bundle(bund, "artifacts/train_bundle.jld2"; compress = false)
@info "bundle saved to p"

source

KeemenaPreprocessing._Cleaning.clean_documents — Function

clean_documents(docs, cfg) → Vector{String}

Apply the text-cleaning stage of the Keemena pipeline to every document in docs according to the options held in cfg (PreprocessConfiguration. The returned vector has the same length and order as docs.

Arguments

name	type	description
`docs`	`Vector{String}`	Raw, unprocessed documents.
`cfg`	`PreprocessConfiguration`	Cleaning directives (lower-casing, URL replacement, emoji handling, …).

Processing steps

The function runs a fixed sequence of transformations, each guarded by the corresponding flag in cfg:

Unicode normalisation normalize_unicode (unicode_normalisation_form).
HTML stripping strip_html + entity decoding (strip_html_tags, html_entity_decode).
Markdown stripping strip_markdown (strip_markdown, preserve_md_code).
Repeated-character squeezing squeeze_char_runs (squeeze_repeat_chars, max_char_run).
Unicode confusable mapping normalize_confusables (map_confusables).
Emoji handling _rewrite_emojis (emoji_handling, emoji_sentinel).
Number replacement replace_numbers (replace_numbers, plus the keep_* sub-flags and number_sentinel).
Unicode-to-ASCII punctuation mapping map_unicode_punctuation (map_unicode_punctuation).
URL / e-mail replacement replace_urls_emails (replace_urls, replace_emails, url_sentinel, mail_sentinel, keep_url_scheme).
Lower-casing lowercase (lowercase).
Accent stripping _strip_accents (strip_accents).
Control-character removal regex replace with _CTRL_RE (remove_control_characters).
Whitespace normalisation normalize_whitespace (normalise_whitespace, remove_zero_width_chars, collapse_spaces, trim_edges, preserve_newlines). Falls back to strip when only trim_edges is requested.
Punctuation removal regex replace with _PUNCT_RE (remove_punctuation).

Every transformation returns a new string; the original input remains unchanged.

Returns

Vector{String} — cleaned documents ready for tokenisation.

Example

cfg  = PreprocessConfiguration(strip_html_tags = true,
                               replace_urls    = true)
clean = clean_documents(["Visit https://example.com!"], cfg)
@info clean[1]   # -> "Visit <URL>"

source

KeemenaPreprocessing._Tokenization.tokenize_and_segment — Function

source

KeemenaPreprocessing._Vocabulary.build_vocabulary — Function

build_vocabulary(tokens::Vector{String}; cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(freqs::Dict{String,Int};   cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(stream::Channel{Vector{String}};
                 cfg::PreprocessConfiguration;
                 chunk_size::Int = 500_000) -> Vocabulary

Construct a Vocabulary from token data that may be held entirely in memory, pre-counted, or streamed in batches.

Method overview

Vector method - accepts a flat vector of token strings.
Dict method - accepts a dictionary that maps each token string to its corpus frequency.
Streaming method - accepts a channel that yields token-vector batches so you can build a vocabulary without ever loading the whole corpus at once.

All three methods share the same counting, filtering, and ID-assignment logic; they differ only in how token data are supplied.

Shared argument

cfg - a PreprocessConfiguration that provides
- minimum_token_frequency
- initial special_tokens
- dynamic sentence markers when record_sentence_offsets is true.

Additional arguments

tokens - vector of token strings.
freqs - dictionary from token string to integer frequency.
stream - channel that produces vectors of token strings.
chunk_size - number of tokens to buffer before flushing counts (streaming method only).

Processing steps

Seed specials - copy the special tokens from cfg and insert <BOS> / <EOS> if sentence offsets are recorded.
Count tokens - accumulate frequencies from the provided data source.
Filter - discard tokens occurring fewer times than cfg.minimum_token_frequency.
Assign IDs - assign IDs to specials first (alphabetical order for reproducibility), then to remaining tokens sorted by descending frequency and finally lexicographic order.
Return - a deterministic Vocabulary containing token_to_id, id_to_token, and frequencies.

Examples

# From a token vector
tokens = ["the", "red", "fox", ...]
vocab  = build_vocabulary(tokens; cfg = config)

# From pre-computed counts
counts = Dict("the" => 523_810, "fox" => 1_234)
vocab  = build_vocabulary(counts; cfg = config)

# Streaming large corpora
ch = Channel{Vector{String}}(8) do c
    for path in corpus_paths
        put!(c, tokenize(read(path, String)))
    end
end
vocab = build_vocabulary(ch; cfg = config, chunk_size = 100_000)

source

KeemenaPreprocessing.add_level! — Method

add_level!(bundle, level, lb) -> PreprocessBundle

Mutating helper that inserts a new LevelBundle lb into bundle.levels under key level. The routine:

Guards against duplicates - throws an error if level already exists.
Validates the offsets inside lb.corpus for consistency with the supplied level via validate_offsets.
Stores the bundle and returns the same bundle instance so the call can be chained.

be aware that, add_level! modifies its first argument in place; if you require an immutable bundle keep a copy before calling

Arguments

name	type	description
`bundle`	`PreprocessBundle`	Target bundle to extend.
`level`	`Symbol`	Identifier for the new segmentation level (e.g. `:char`, `:word`).
`lb`	`LevelBundle`	Data + vocabulary for that level.

Returns

The same bundle, now containing level => lb.

Errors

ArgumentError if a level with the same name already exists.
Propagates any error raised by validate_offsets when lb.corpus is inconsistent.

Example

char_bundle = LevelBundle(char_corp, char_vocab)
add_level!(bund, :character, char_bundle)

@assert has_level(bund, :character)

source

KeemenaPreprocessing.byte_cfg — Method

byte_cfg(; kwargs...) -> PreprocessConfiguration

Shorthand constructor that returns a PreprocessConfiguration pre-configured for byte-level tokenisation.

The wrapper fixes the following fields

tokenizer_name = :byte
record_byte_offsets = true
record_character_offsets = false
record_word_offsets = false

while forwarding every other keyword argument to PreprocessConfiguration. Use it when building byte-level language-model corpora but still needing the full flexibility to tweak cleaning, vocabulary, or segmentation options:

cfg = byte_cfg(strip_html_tags = true,
               minimum_token_frequency = 5)

source

KeemenaPreprocessing.get_corpus — Method

get_corpus(bundle, level) -> Corpus

Retrieve the Corpus object for segmentation level level from a PreprocessBundle.

This is equivalent to get_level(bundle, level).corpus and is provided as a convenience helper when you only need the sequence of token-ids and offset tables rather than the whole LevelBundle.

Arguments

bundle::PreprocessBundle - bundle produced by preprocess_corpus.
level::Symbol - level identifier such as :byte, :word, :sentence, ...

Returns

The Corpus stored in the requested level.

Errors

Throws an ArgumentError if the level is not present in bundle (see get_level for details).

Example

word_corp = get_corpus(bund, :word)

# iterate over sentences
sent_offs = word_corp.sentence_offsets
for i in 1:length(sent_offs)-1
    rng = sent_offs[i] : sent_offs[i+1]-1
    println(view(word_corp.token_ids, rng))
end

source

KeemenaPreprocessing.get_level — Method

get_level(bundle, level) → LevelBundle

Fetch the LevelBundle associated with segmentation level level from a PreprocessBundle.

Arguments

bundle::PreprocessBundle — bundle returned by preprocess_corpus.
level::Symbol — identifier such as :byte, :word, :sentence, ...

Returns

The requested LevelBundle.

Errors

Throws an ArgumentError when the level is absent, listing all available levels to aid debugging.

Example

word_bundle = get_level(bund, :word)
println("vocabulary size: ", length(word_bundle.vocabulary.id_to_token_strings))

source

KeemenaPreprocessing.get_token_ids — Method

get_token_ids(bundle, level) -> Vector{Int}

Return the vector of token-ids for segmentation level level contained in a PreprocessBundle.

Identical to get_corpus(bundle, level).token_ids, but provided as a convenience helper when you only need the raw id sequence and not the full Corpus object.

Arguments

bundle::PreprocessBundle - bundle produced by preprocess_corpus.
level::Symbol - segmentation level identifier (e.g. :byte, :word).

Returns

A Vector{Int} whose length equals the number of tokens at that level.

Errors

Throws an ArgumentError if the requested level is absent (see get_level for details).

Example

word_ids = get_token_ids(bund, :word)
println("first ten ids: ", word_ids[1:10])

source

KeemenaPreprocessing.get_vocabulary — Method

get_vocabulary(bundle, level) -> Vocabulary

Return the Vocabulary associated with segmentation level level (eg :byte, :word, :sentence) from a given PreprocessBundle

Effectively a shorthand for get_level(bundle, level).vocabulary

Arguments

bundle::PreprocessBundle - Bundle produced by preprocess_corpus
level::Symbol — Level identifier whose vocabulary you need

Returns

The Vocabulary stored for level

Errors

Raises an ArgumentError if level is not present in bundle (see get_level for details)

Example

vocab = get_vocabulary(bund, :word)
println("Top-10 tokens: ", vocab.id_to_token_strings[1:10])

source

KeemenaPreprocessing.has_level — Method

has_level(bundle, level) -> Bool

Return true if the given PreprocessBundle contains a LevelBundle for the segmentation level level (e.g. :byte, :word, :sentence); otherwise return false.

Arguments

bundle::PreprocessBundle — bundle to inspect.
level::Symbol — level identifier to look for.

Example

julia> has_level(bund, :word)
true

source

KeemenaPreprocessing.preprocess_corpus — Method

preprocess_corpus(sources, cfg; save_to = nothing) - PreprocessBundle

Variant of preprocess_corpus that accepts an already constructed PreprocessConfiguration and therefore bypasses all keyword aliasing and default-override logic.

Use this when you have prepared a configuration object up-front (e.g. loaded from disk, shared across jobs, or customised in a function) and want to run the pipeline with those exact settings.

Arguments

name	type	description
`sources`	`AbstractString`, `Vector{<:AbstractString}`, iterable	One or more file paths, URLs, directories (ignored), or in-memory text strings.
`cfg`	`PreprocessConfiguration`	Fully-specified configuration controlling every cleaning/tokenisation option.
`save_to`	`String` or `nothing` (default)	If non-`nothing`, the resulting bundle is serialised (e.g. via JLD2) to the given file path and returned; otherwise nothing is written.

Pipeline (unchanged)

Load raw sources.
Clean text based on cfg flags.
Tokenise & segment; record requested offsets.
Build vocabulary obeying minimum_token_frequency, special_tokens, ...
Pack everything into a PreprocessBundle. Optionally persist.

Returns

A PreprocessBundle populated with corpora, vocabularies, alignments, metadata, and (by default) empty extras.

Example

cfg  = PreprocessConfiguration(strip_markdown = true,
                               tokenizer_name  = :unicode)

bund = preprocess_corpus(["doc1.txt", "doc2.txt"], cfg;
                         save_to = "unicode_bundle.jld2")

note: If you do not have a configuration object yet, call the keyword-only version instead: preprocess_corpus(sources; kwargs...) which will create a default configuration and apply any overrides you provide.

source

KeemenaPreprocessing.preprocess_corpus — Method

preprocess_corpus(sources; save_to = nothing,
                              config = nothing,
                              kwargs...) -> PreprocessBundle

End-to-end convenience wrapper that loads raw texts, cleans them, tokenises, builds a vocabulary, records offsets, and packs the result into a PreprocessBundle.

The routine can be invoked in two mutually-exclusive ways:

Explicit configuration - supply your own PreprocessConfiguration through the config= keyword.
Ad-hoc keyword overrides - omit config and pass any subset of the configuration keywords directly (e.g. lowercase = false, tokenizer_name = :unicode). Internally a fresh PreprocessConfiguration(; kwargs...) is created from those overrides plus the documented defaults, so calling preprocess_corpus(sources) with no keywords at all runs the pipeline using the default settings.

note: Passing both config= and per-field keywords is an error because it would lead to ambiguous intent.

Arguments

name	type	description
`sources`	`AbstractString`, `Vector{<:AbstractString}`, or iterable	Either one or more file paths/URLs that will be read, directories (silently skipped), or in-memory strings treated as raw text.
`save_to`	`String` or `nothing` (default)	If a path is given the resulting bundle is serialised (JLD2) to disk and returned; otherwise nothing is written.
`config`	`PreprocessConfiguration` or `nothing`	Pre-constructed configuration object. When `nothing` (default), a new one is built from `kwargs...`.
`kwargs...`	see `PreprocessConfiguration`	Per-field overrides that populate a fresh configuration when `config` is `nothing`.

Pipeline stages

Loading - files/URLs are fetched; directory entries are ignored.
Cleaning - controlled by the configuration's cleaning toggles.
Tokenisation & segmentation - produces token ids and offset tables.
Vocabulary building - applies minimum_token_frequency and inserts special tokens.
Packaging - returns a PreprocessBundle; if save_to was given, the same bundle is persisted to that path.

Returns

A fully-populated PreprocessBundle.

Examples

# 1. Quick start with defaults
bund = preprocess_corpus("corpus.txt")

# 2. Fine-grained control via keyword overrides
bund = preprocess_corpus(["doc1.txt", "doc2.txt"];
                         strip_html_tags = true,
                         tokenizer_name  = :unicode,
                         minimum_token_frequency = 3)

# 3. Supply a hand-crafted configuration object
cfg  = PreprocessConfiguration(strip_markdown = true,
                               record_sentence_offsets = false)
bund = preprocess_corpus("input/", config = cfg, save_to = "bundle.jld2")

source

KeemenaPreprocessing.preprocess_corpus_streaming — Method

preprocess_corpus_streaming(srcs;
                            cfg           = PreprocessConfiguration(),
                            vocab         = nothing,
                            chunk_tokens  = DEFAULT_CHUNK_TOKENS) -> Channel{PreprocessBundle}

Low-memory, two-pass variant of preprocess_corpus that yields a stream of PreprocessBundle s via a Channel. Each bundle covers ≈ chunk_tokens worth of tokens, letting you pipeline huge corpora through training code without ever loading the whole dataset into RAM.

Workflow

Vocabulary pass (optional) If vocab === nothing, the function first computes global token-frequency counts in a constant-memory scan (_streaming_counts) and builds a vocabulary with build_vocabulary(freqs; cfg). If you already possess a fixed vocabulary (e.g. for fine-tuning), supply it through the vocab keyword to skip this pass.
Chunking iterator A background task produced by doc_chunk_iterator groups raw source documents into slices whose estimated size does not exceed chunk_tokens.
Per-chunk pipeline For every chunk the following steps mirror the standard pipeline:
- clean_documents
- tokenize_and_segment
- assemble_bundle
- build_ensure_alignments!
The resulting bundle is put! onto the channel.

Arguments

name	type	description
`srcs`	iterable of `AbstractString`	File paths, URLs, or raw texts.
`cfg`	`PreprocessConfiguration`	Cleaning/tokenisation settings (default: fresh object).
`vocab`	`Vocabulary` or `nothing`	Pre-existing vocabulary; when `nothing` it is inferred in pass 1.
`chunk_tokens`	`Int`	Soft cap on tokens per chunk (default = `DEFAULT_CHUNK_TOKENS`).

Returns

A channel of type Channel{PreprocessBundle}. Consume it with foreach, for bundle in ch, or take!(ch).

ch = preprocess_corpus_streaming("large_corpus/*";
                                 cfg = PreprocessConfiguration(strip_html_tags=true),
                                 chunk_tokens = 250_000)

for bund in ch                      # streaming training loop
    update_model!(bund)             # user-defined function
end

note: The channel is unbuffered (Inf capacity) so each bundle is produced only when the consumer is ready, minimising peak memory consumption.

source

KeemenaPreprocessing.preprocess_corpus_streaming_chunks — Method

preprocess_corpus_streaming_chunks(srcs; kwargs...) -> 4Vector{PreprocessBundle}

Run the streaming pipeline once, eagerly consume the channel, and return a Vector whose i-th entry is the PreprocessBundle covering chunk i.

Identical keyword interface to preprocess_corpus_streaming; all arguments are forwarded unchanged.

Use when you want chunked artefacts (e.g. sharding a massive corpus across GPUs) but prefer a materialised vector instead of an explicit Channel.

bundles = preprocess_corpus_streaming_chunks("wiki_xml/*";
                                   chunk_tokens = 250_000,
                                   strip_html_tags = true)
@info "produced (length(bundles)) bundles"

source

KeemenaPreprocessing.preprocess_corpus_streaming_full — Method

preprocesscorpusstreaming_full(srcs; kwargs...) -> PreprocessBundle

Run the streaming pipeline, merge every chunk on the fly, and return one single PreprocessBundle that spans the entire corpus.

All keyword arguments are forwarded to preprocesscorpusstreaming. Throws when chunks were built with incompatible vocabularies

bund = preprocess_corpus_streaming_full(["en.txt", "de.txt"];
                              minimum_token_frequency = 5)
println("corpus length: ", length(get_token_ids(bund, :word)))

source

KeemenaPreprocessing.with_extras — Method

with_extras(original, new_extras) -> PreprocessBundle

Create a shallow copy of original where only the extras field is replaced by new_extras. All other components (levels, metadata, alignments) are cloned by reference, so the operation is cheap and the returned bundle remains consistent with the source.

Useful when you have performed post-processing (e.g. dimensionality reduction, cluster assignments, per-document labels) and want to attach the results without mutating the original bundle in place.

Arguments

name	type	description
`original`	`PreprocessBundle`	Bundle produced by `preprocess_corpus`.
`new_extras`	`Any`	Arbitrary payload to store under `bundle.extras`.

Returns

A new PreprocessBundle{typeof(new_extras)} identical to original except that extras == new_extras.

Example

labels = collect(kmeans(doc_embeddings, 50).assignments)
labeled = with_extras(bund, labels)

@assert labeled.levels === bund.levels         # same reference
@assert labeled.extras === labels              # updated payload

source