Public API

KeemenaPreprocessing.TOKENIZERSConstant

TOKENIZERS

A constant Tuple{Symbol} listing the names of built-in tokenizers that can be passed to the tokenizer_name keyword of PreprocessConfiguration.

Currently supported values are

  • :whitespace - split on Unicode whitespace;
  • :unicode - iterate user-perceived graphemes (eachgrapheme);
  • :byte - treat the text as raw bytes (byte-level models);
  • :char - split on individual UTF-8 code units.

You may also supply any callable that implements mytokens = f(string) in place of one of these symbols.

source
KeemenaPreprocessing.CorpusType
Corpus

Flat, memory-efficient container that stores an entire corpus of token-ids together with optional hierarchical offset tables that recover the original structure (documents → paragraphs → sentences → words → characters → bytes).

Every offset vector records the starting index (1-based, inclusive) of each unit inside token_ids. The final entry therefore equals length(token_ids)+1, making range retrieval convenient via view(token_ids, offsets[i] : offsets[i+1]-1).

Fields

fieldtypealways present?description
token_idsVector{Int}Concatenated token identifiers returned by the vocabulary.
document_offsetsVector{Int}Start positions of each document (outermost level).
paragraph_offsetsUnion{Vector{Int},Nothing}cfg-dependentParagraph starts within each document when record_paragraph_offsets=true.
sentence_offsetsUnion{Vector{Int},Nothing}cfg-dependentSentence boundaries when record_sentence_offsets=true.
word_offsetsUnion{Vector{Int},Nothing}cfg-dependentWord boundaries when record_word_offsets=true.
character_offsetsUnion{Vector{Int},Nothing}cfg-dependentUnicode-character spans when record_character_offsets=true.
byte_offsetsUnion{Vector{Int},Nothing}cfg-dependentByte-level spans when record_byte_offsets=true.

Example

# assume `corp` is a Corpus produced by preprocess_corpus
doc1_range = corp.document_offsets[1] : corp.document_offsets[2]-1
doc1_token_ids = view(corp.token_ids, doc1_range)

if corp.sentence_offsets ≠ nothing
    first_sentence = view(corp.token_ids,
                          corp.sentence_offsets[1] : corp.sentence_offsets[2]-1)
end

The presence or absence of each optional offsets vector is determined entirely by the corresponding record_*_offsets flags in PreprocessConfiguration.

source
KeemenaPreprocessing.CrossMapType
CrossMap

Alignment table that links two segmentation levels of the same corpus (e.g. bytes -> characters, characters -> words, words -> sentences).

For every unit in the destination level the alignment vector stores the 1-based index into the source offsets at which that unit begins. This allows constant-time projection of any span expressed in destination units back to the finer-grained source sequence.

Fields

  • source_level :: Symbol Name of the finer level (must match a key in bundle.levels, typically :byte, :char, :word, :sentence, or :paragraph).

  • destination_level :: Symbol Name of the coarser level whose boundaries are encoded.

  • alignment :: Vector{Int} Length = N_destination + 1. alignment[i] is the starting source-level offset of destination element i; the extra sentinel entry alignment[end] = N_source + 1 lets you slice with alignment[i] : alignment[i+1]-1 without bounds checks.

Example

# map words ⇒ sentences
m = CrossMap(:word, :sentence, sent2word_offsets)

first_sentence_word_ids = alignment_view(m, 1)  # helper returning a view

The constructor is trivial and performs no validation; pipelines are expected to guarantee consistency when emitting CrossMap objects.

source
KeemenaPreprocessing.CrossMapMethod
CrossMap(src, dst, align)

Shorthand outer constructor that builds a CrossMap while materialising the alignment vector as Vector{Int}.

Arguments

  • src::Symbol - identifier of the source (finer-grained) level (e.g. :char, :word).

  • dst::Symbol - identifier of the destination (coarser) level (e.g. :word, :sentence).

  • align::AbstractVector{<:Integer} - offset array mapping every destination unit to its starting position in the source sequence. Any integer-typed vector is accepted; it is copied into a dense Vector{Int} to guarantee contiguous storage and type stability inside the resulting CrossMap.

Returns

A CrossMap(src, dst, Vector{Int}(align)).

Example

cm = CrossMap(:char, :word, UInt32[1, 5, 9, 14])
@assert cm.alignment isa Vector{Int}
source
KeemenaPreprocessing.LevelBundleType
LevelBundle

Self-contained pairing of a Corpus and its companion Vocabulary. A LevelBundle represents one segmentation level (e.g. words, characters, or bytes) produced by the preprocessing pipeline. By storing both objects side-by-side it guarantees that every token_id found in corpus.token_ids is valid according to vocabulary.

Fields

  • corpus :: Corpus All token-ids plus optional offset tables describing the structure of the text at this level.

  • vocabulary :: Vocabulary Bidirectional mapping between token strings and the integer ids used in corpus.token_ids.

Integrity checks

The inner constructor performs two runtime validations:

  1. Range check - the largest token-id must not exceed length(vocabulary.id_to_token_strings).
  2. Lower bound - all token-ids must be >= 1 (id 0 is never legal).

Violations raise an informative ArgumentError, catching mismatches early.

Example

word_corpus  = Corpus(word_ids, doc_offs, nothing, sent_offs, word_offs,
                      nothing, nothing)
word_vocab   = build_vocabulary(words; minimum_token_frequency = 2)

word_bundle  = LevelBundle(word_corpus, word_vocab)

nb_tokens    = length(word_bundle.vocabulary.id_to_token_strings)
@info "bundle contains nb_tokens unique tokens"
source
KeemenaPreprocessing.PipelineMetadataType
PipelineMetadata

Compact header bundled with every artefact produced by KeemenaPreprocessing. It records the exact pipeline settings and the version of the on-disk schema so that data can be re-processed, inspected, or migrated safely.

Fields

  • configuration::PreprocessConfiguration The full set of cleaning, tokenisation, vocabulary, and offset-recording options that generated the artefact. Storing this ensures strict reproducibility.

  • schema_version::VersionNumber The version of the bundle file format (not the Julia package). Increment the major component when breaking changes are introduced so that loaders can detect incompatibilities and perform migrations or raise errors.

Example

cfg  = PreprocessConfiguration(strip_html_tags = true)
meta = PipelineMetadata(cfg, v"1.0.0")

@info "tokeniser:" meta.configuration.tokenizer_name
@assert meta.schema_version >= v"1.0.0"
source
KeemenaPreprocessing.PipelineMetadataMethod
PipelineMetadata() -> PipelineMetadata

Convenience constructor that returns a metadata header with

  • the default PreprocessConfiguration() (all keyword-arguments left at their documented defaults); and
  • the current bundle schema version v"1.0.0".

Handy for rapid prototyping or unit tests when you do not need to customise the pipeline but still require a valid PipelineMetadata object.

Identical to:

PipelineMetadata(PreprocessConfiguration(), v"1.0.0")
source
KeemenaPreprocessing.PreprocessBundleType
PreprocessBundle{ExtraT}

Top-level artefact emitted by preprocess_corpus (or the streaming variant). A bundle contains everything required to feed a downstream model or to reload a corpus without re-running the expensive preprocessing pipeline.

Type parameter

  • ExtraT - arbitrary payload for user-defined information (e.g. feature matrices, clustering assignments, language tags). Use Nothing when no extras are needed.

Fields

fieldtypedescription
levelsDict{Symbol,LevelBundle}Mapping from segmentation level name (:byte, :char, :word, :sentence, :paragraph, …) to the corresponding LevelBundle.
metadataPipelineMetadataReproducibility header (configuration + schema version).
alignmentsDict{Tuple{Symbol,Symbol},CrossMap}Pair-wise offset projections between levels, keyed as (source, destination) (e.g. (:char, :word)).
extrasExtraTOptional user payload carried alongside the core data.

Typical workflow

bund = preprocess_corpus(files; strip_html_tags=true)

# inspect vocabulary
word_vocab = bund.levels[:word].vocabulary
println("vocabulary size: ", length(word_vocab.id_to_token_strings))

# project a sentence span back to character offsets
cm = bund.alignments[(:char, :sentence)]
first_sentence_char_span = cm.alignment[1] : cm.alignment[2]-1

The bundle is immutable; to add additional levels or extras create a fresh instance (helper functions add_level!, with_extras, etc. are provided by the package).

source
KeemenaPreprocessing.PreprocessBundleMethod
PreprocessBundle(levels; metadata = PipelineMetadata(),
                      alignments = Dict{Tuple{Symbol,Symbol},CrossMap}(),
                      extras = nothing) -> PreprocessBundle

Outer constructor that validates and assembles the individual artefacts generated by KeemenaPreprocessing into a single PreprocessBundle.

Required argument

  • levels::Dict{Symbol,<:LevelBundle} - at least one segmentation level (keyed by level name such as :word or :char).

Optional keyword arguments

keyworddefaultpurpose
metadataPipelineMetadata()Configuration & schema header.
alignmentsempty DictMaps (source,destination) -> CrossMap.
extrasnothingUser-supplied payload propagated unchanged.

Runtime checks

  1. Non-empty levels.
  2. For each (lvl, lb) in levels run validate_offsets(lb.corpus, lvl) to ensure internal offset consistency.
  3. For every supplied alignment (src,dst) → cm:
    • both src and dst must exist in levels;
    • length(cm.alignment) == length(levels[src].corpus.token_ids);
    • cm.source_level == src;
    • cm.destination_level == dst.

Any violation throws an informative ArgumentError.

Returns

A fully-validated PreprocessBundle{typeof(extras)} containing: Dict(levels), metadata, Dict(alignments), and extras.

Example

word_bundle = LevelBundle(word_corpus, word_vocab)
char_bundle = LevelBundle(char_corpus, char_vocab)

bund = PreprocessBundle(Dict(:word=>word_bundle, :char=>char_bundle);
                        alignments = Dict((:char,:word)=>char2word_map))
source
KeemenaPreprocessing.PreprocessBundleMethod
PreprocessBundle(; metadata = PipelineMetadata(), extras = nothing) -> PreprocessBundle

Convenience constructor that produces an empty PreprocessBundle:

  • levels = Dict{Symbol,LevelBundle}()
  • alignments = Dict{Tuple{Symbol,Symbol},CrossMap}()
  • metadata = metadata (defaults to PipelineMetadata())
  • extras = extras (defaults to nothing)

Useful when you want to build a bundle incrementally—for example, loading individual levels from disk or generating them in separate jobs: while still attaching a common metadata header or arbitrary user payload.

bund = PreprocessBundle()                      # blank skeleton
bund = merge(bund, load_word_level("word.jld"))  # pseudo-code for adding data

The returned object's type parameter is inferred from extras so that any payload, including complex structs, can be stored without further boilerplate.

source
KeemenaPreprocessing.PreprocessConfigurationMethod
PreprocessConfiguration(; kwargs...) -> PreprocessConfiguration

Create a fully-specified preprocessing configuration.

All keyword arguments are optional; sensible defaults are provided so that cfg = PreprocessConfiguration() already yields a working pipeline. Options are grouped below by the stage they affect.

Cleaning stage toggles

keyworddefaultpurpose
lowercasetrueConvert letters to lower-case.
strip_accentstrueRemove combining accent marks.
remove_control_characterstrueDrop Unicode Cc/Cf code-points.
remove_punctuationtrueStrip punctuation & symbol characters.
normalise_whitespacetrueCollapse consecutive whitespace.
remove_zero_width_charstrueRemove zero-width joiners, etc.
preserve_newlinestrueKeep explicit line breaks.
collapse_spacestrueCollapse runs of spaces/tabs.
trim_edgestrueStrip leading/trailing whitespace.

URL, e-mail & numbers

keyworddefaultpurpose
replace_urlstrueReplace URLs with url_sentinel.
replace_emailstrueReplace e-mails with mail_sentinel.
keep_url_schemefalsePreserve http:// / https:// prefix.
url_sentinel"<URL>"Token inserted for each URL.
mail_sentinel"<EMAIL>"Token inserted for each e-mail.
replace_numbersfalseReplace numbers with number_sentinel.
number_sentinel"<NUM>"Token used when replacing numbers.
keep_number_decimalfalsePreserve decimal part.
keep_number_signfalsePreserve ± sign.
keep_number_commasfalsePreserve thousands separators.

Mark-up & HTML

keyworddefaultpurpose
strip_markdownfalseRemove Markdown formatting.
preserve_md_codetrueKeep fenced/inline code while stripping.
strip_html_tagsfalseRemove HTML/XML tags.
html_entity_decodetrueDecode &amp;, &quot;, etc.

Emoji & Unicode

keyworddefaultpurpose
emoji_handling:keep:keep, :remove, or :sentinel.
emoji_sentinel"<EMOJI>"Used when emoji_handling == :sentinel.
squeeze_repeat_charsfalseLimit repeated character runs.
max_char_run3Maximum run length when squeezing.
map_confusablesfalseMap visually-confusable chars.
unicode_normalisation_form:none:NFC, :NFD, :NFKC, :NFKD, or :none.
map_unicode_punctuationfalseReplace Unicode punctuation with ASCII.

Tokenisation

keyworddefaultpurpose
tokenizer_name:whitespaceOne of TOKENIZERS or a callable.
preserve_empty_tokensfalseKeep zero-length tokens.

Vocabulary construction

keyworddefaultpurpose
minimum_token_frequency1Discard rarer tokens / map to <UNK>.
special_tokensDict(:unk=>"<UNK>", :pad=>"<PAD>")Role ⇒ literal mapping.

Offset recording

keyworddefaultpurpose
record_byte_offsetsfalseRecord byte-level spans.
record_character_offsetsfalseRecord Unicode-char offsets.
record_word_offsetstrueRecord word offsets.
record_sentence_offsetstrueRecord sentence offsets.
record_paragraph_offsetsfalseRecord paragraph offsets (forces preserve_newlines = true).
record_document_offsetstrueRecord document offsets.

Returns

A fully-initialised PreprocessConfiguration instance. Invalid combinations raise AssertionError (e.g. unsupported tokenizer) and certain settings emit warnings when they imply other flags (e.g. paragraph offsets -> preserve_newlines).

See also: TOKENIZERS and byte_cfg for a pre-canned byte-level configuration.

source
KeemenaPreprocessing.VocabularyType
Vocabulary

Immutable lookup table produced by build_vocabulary that maps between integer token-ids and the string literals that appear in a corpus.

Fields

  • id_to_token_strings::Vector{String} Position i holds the canonical surface form of token-id i (vocab.id_to_token_strings[id]"word").

  • token_to_id_map::Dict{String,Int} Fast reverse mapping from token string to its integer id (vocab.token_to_id_map["word"]id). Look-ups fall back to the <UNK> id when the string is absent.

  • token_frequencies::Vector{Int} Corpus counts aligned with id_to_token_strings (token_frequencies[id] gives the raw frequency of that token).

  • special_tokens::Dict{Symbol,Int} Set of reserved ids for sentinel symbols such as :unk, :pad, :bos, :eos, … Keys are roles (Symbol); values are the corresponding integer ids.

Usage example

vocab = build_vocabulary(tokens; minimum_token_frequency = 3)

@info "UNK id:    " vocab.special_tokens[:unk]
@info "«hello» id:" vocab.token_to_id_map["hello"]
@info "id → token:" vocab.id_to_token_strings[42]
source
KeemenaPreprocessing._Alignment.alignment_byte_to_wordFunction
alignment_byte_to_word(byte_c, word_c) -> CrossMap

Construct a byte -> word CrossMap that projects each byte index in byte_c onto the word index in word_c that contains it.

Preconditions

  • byte_c must have a non-nothing byte_offsets vector (checked via the private helper _require_offsets).
  • word_c must have a non-nothing word_offsets vector.
  • Both corpora must span the same token range byte_offsets[end] == word_offsets[end]; otherwise an ArgumentError is thrown.

Arguments

nametypedescription
byte_cCorpusCorpus tokenised at the byte level.
word_cCorpusCorpus tokenised at the word level.

Algorithm

  1. Retrieve the sentinel-terminated offset vectors bo = byte_c.byte_offsets and wo = word_c.word_offsets.
  2. Allocate b2w :: Vector{Int}(undef, n_bytes) where n_bytes = length(bo) - 1.
  3. For each word index w_idx fill the slice wo[w_idx] : wo[w_idx+1]-1 with w_idx, thereby assigning every byte position to the word that begins at wo[w_idx].
  4. Return CrossMap(:byte, :word, b2w).

The output vector has length n_bytes (no sentinel) because every byte token receives one word identifier.

Returns

A CrossMap whose fields are:

source_level      == :byte
destination_level == :word
alignment         :: Vector{Int}  # length = n_bytes

Errors

  • ArgumentError if either corpus lacks the necessary offsets.
  • ArgumentError when the overall spans differ.

Example

b2w = alignment_byte_to_word(byte_corpus, word_corpus)
word_index_of_42nd_byte = b2w.alignment[42]
source
KeemenaPreprocessing._Assemble.assemble_bundleFunction
assemble_bundle(tokens, offsets, vocab, cfg) -> PreprocessBundle

Convert the token-level artefacts produced by tokenize_and_segment into a minimal yet fully valid PreprocessBundle. The function

  1. Projects each token to its integer id using vocab; unknown strings are mapped to the :unk special (throws if the vocabulary lacks one).

  2. Packs the id sequence together with the requested offset tables into a Corpus.

  3. Wraps that corpus and its vocabulary in a LevelBundle whose key is inferred from cfg.tokenizer_name:

    tokenizer_name valuelevel symbol stored
    :byte:byte
    :char:character
    :unicode, :whitespace:word
    FunctionSymbol(typeof(fn))
    any other Symbolsame symbol
  4. Builds a default PipelineMetadata header (PipelineMetadata(cfg, v"1.0.0")).

  5. Returns a PreprocessBundle containing exactly one level, empty alignments, and extras = nothing.

Arguments

nametypedescription
tokensVector{<:Union{String,UInt8}}Flattened token stream.
offsetsDict{Symbol,Vector{Int}}Start indices for each recorded level (as returned by tokenize_and_segment).
vocabVocabularyToken <-> id mapping (must contain :unk).
cfgPreprocessConfigurationDetermines the level key and special-token requirements.

Returns

PreprocessBundle with

bundle.levels       == Dict(level_key => LevelBundle(corpus, vocab))
bundle.metadata     == PipelineMetadata(cfg, v"1.0.0")
bundle.alignments   == Dict{Tuple{Symbol,Symbol},CrossMap}()   # empty
bundle.extras       == nothing

Errors

  • Throws ArgumentError if vocab lacks the :unk special.
  • Propagates any error raised by the inner constructors of Corpus or LevelBundle (e.g. offset inconsistencies).

Example

tokens, offs = tokenize_and_segment(docs, cfg)
vocab         = build_vocabulary(tokens; cfg = cfg)
bund          = assemble_bundle(tokens, offs, vocab, cfg)

@info keys(bund.levels)  # (:word,) for whitespace tokenizer
source
KeemenaPreprocessing._BundleIO.load_preprocess_bundleFunction
load_preprocess_bundle(path; format = :jld2) -> PreprocessBundle

Load a previously-saved PreprocessBundle from disk.

The function currently understands the JLD2 wire-format written by save_preprocess_bundle. It performs a lightweight header check to ensure the on-disk bundle version is not newer than the library version linked at run-time, helping you avoid silent incompatibilities after package upgrades.

Arguments

nametypedescription
pathAbstractStringFile name (relative or absolute) pointing to the bundle on disk.
formatSymbol (keyword)Serialization format. Only :jld2 is accepted—any other value raises an error.

Returns

PreprocessBundle - the exact object originally passed to save_preprocess_bundle, including all levels, alignments, metadata, and extras.

Errors

  • ArgumentError &nbsp;- if path does not exist.
  • ArgumentError &nbsp;- if format ≠ :jld2.
  • ErrorException &nbsp;- when the bundle's persisted version is newer than the library's internal _BUNDLE_VERSION, signalling that your local code may be too old to read the file safely.

Example

bund = load_preprocess_bundle("artifacts/train_bundle.jld2")

@info "levels available: keys(bund.levels))"
source
KeemenaPreprocessing._BundleIO.save_preprocess_bundleFunction
save_preprocess_bundle(bundle, path; format = :jld2, compress = false) -> String

Persist a PreprocessBundle to disk and return the absolute file path written.

Currently the only supported format is :jld2; an error is raised for any other value.

Arguments

nametypedescription
bundlePreprocessBundleObject produced by preprocess_corpus.
pathAbstractStringDestination file name (relative or absolute). Parent directories are created automatically.
formatSymbol (keyword)Serialization format. Must be :jld2.
compressBool (keyword)When false (default) the JLD2 file is written without zlib compression; set to false for fastest write speed (default), true adds compression.

File structure

The JLD2 file stores three top-level keys

keyvalue
"__bundle_version__"String denoting the package's internal bundle spec.
"__schema_version__"string(bundle.metadata.schema_version)
"bundle"The full PreprocessBundle instance.

These headers enable future schema migrations or compatibility checks.

Returns

String - absolute path of the file just written.

Example

p = save_preprocess_bundle(bund, "artifacts/train_bundle.jld2"; compress = false)
@info "bundle saved to p"
source
KeemenaPreprocessing._Cleaning.clean_documentsFunction
clean_documents(docs, cfg) → Vector{String}

Apply the text-cleaning stage of the Keemena pipeline to every document in docs according to the options held in cfg (PreprocessConfiguration. The returned vector has the same length and order as docs.

Arguments

nametypedescription
docsVector{String}Raw, unprocessed documents.
cfgPreprocessConfigurationCleaning directives (lower-casing, URL replacement, emoji handling, …).

Processing steps

The function runs a fixed sequence of transformations, each guarded by the corresponding flag in cfg:

  1. Unicode normalisation normalize_unicode (unicode_normalisation_form).
  2. HTML stripping strip_html + entity decoding (strip_html_tags, html_entity_decode).
  3. Markdown stripping strip_markdown (strip_markdown, preserve_md_code).
  4. Repeated-character squeezing squeeze_char_runs (squeeze_repeat_chars, max_char_run).
  5. Unicode confusable mapping normalize_confusables (map_confusables).
  6. Emoji handling _rewrite_emojis (emoji_handling, emoji_sentinel).
  7. Number replacement replace_numbers (replace_numbers, plus the keep_* sub-flags and number_sentinel).
  8. Unicode-to-ASCII punctuation mapping map_unicode_punctuation (map_unicode_punctuation).
  9. URL / e-mail replacement replace_urls_emails (replace_urls, replace_emails, url_sentinel, mail_sentinel, keep_url_scheme).
  10. Lower-casing lowercase (lowercase).
  11. Accent stripping _strip_accents (strip_accents).
  12. Control-character removal regex replace with _CTRL_RE (remove_control_characters).
  13. Whitespace normalisation normalize_whitespace (normalise_whitespace, remove_zero_width_chars, collapse_spaces, trim_edges, preserve_newlines). Falls back to strip when only trim_edges is requested.
  14. Punctuation removal regex replace with _PUNCT_RE (remove_punctuation).

Every transformation returns a new string; the original input remains unchanged.

Returns

Vector{String} — cleaned documents ready for tokenisation.

Example

cfg  = PreprocessConfiguration(strip_html_tags = true,
                               replace_urls    = true)
clean = clean_documents(["Visit https://example.com!"], cfg)
@info clean[1]   # -> "Visit <URL>"
source
KeemenaPreprocessing._Vocabulary.build_vocabularyFunction
build_vocabulary(tokens::Vector{String}; cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(freqs::Dict{String,Int};   cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(stream::Channel{Vector{String}};
                 cfg::PreprocessConfiguration;
                 chunk_size::Int = 500_000) -> Vocabulary

Construct a Vocabulary from token data that may be held entirely in memory, pre-counted, or streamed in batches.

Method overview

  • Vector method - accepts a flat vector of token strings.
  • Dict method - accepts a dictionary that maps each token string to its corpus frequency.
  • Streaming method - accepts a channel that yields token-vector batches so you can build a vocabulary without ever loading the whole corpus at once.

All three methods share the same counting, filtering, and ID-assignment logic; they differ only in how token data are supplied.

Shared argument

  • cfg - a PreprocessConfiguration that provides
    • minimum_token_frequency
    • initial special_tokens
    • dynamic sentence markers when record_sentence_offsets is true.

Additional arguments

  • tokens - vector of token strings.
  • freqs - dictionary from token string to integer frequency.
  • stream - channel that produces vectors of token strings.
  • chunk_size - number of tokens to buffer before flushing counts (streaming method only).

Processing steps

  1. Seed specials - copy the special tokens from cfg and insert <BOS> / <EOS> if sentence offsets are recorded.
  2. Count tokens - accumulate frequencies from the provided data source.
  3. Filter - discard tokens occurring fewer times than cfg.minimum_token_frequency.
  4. Assign IDs - assign IDs to specials first (alphabetical order for reproducibility), then to remaining tokens sorted by descending frequency and finally lexicographic order.
  5. Return - a deterministic Vocabulary containing token_to_id, id_to_token, and frequencies.

Examples

# From a token vector
tokens = ["the", "red", "fox", ...]
vocab  = build_vocabulary(tokens; cfg = config)

# From pre-computed counts
counts = Dict("the" => 523_810, "fox" => 1_234)
vocab  = build_vocabulary(counts; cfg = config)

# Streaming large corpora
ch = Channel{Vector{String}}(8) do c
    for path in corpus_paths
        put!(c, tokenize(read(path, String)))
    end
end
vocab = build_vocabulary(ch; cfg = config, chunk_size = 100_000)
source
KeemenaPreprocessing.add_level!Method
add_level!(bundle, level, lb) -> PreprocessBundle

Mutating helper that inserts a new LevelBundle lb into bundle.levels under key level. The routine:

  1. Guards against duplicates - throws an error if level already exists.
  2. Validates the offsets inside lb.corpus for consistency with the supplied level via validate_offsets.
  3. Stores the bundle and returns the same bundle instance so the call can be chained.

be aware that, add_level! modifies its first argument in place; if you require an immutable bundle keep a copy before calling

Arguments

nametypedescription
bundlePreprocessBundleTarget bundle to extend.
levelSymbolIdentifier for the new segmentation level (e.g. :char, :word).
lbLevelBundleData + vocabulary for that level.

Returns

The same bundle, now containing level => lb.

Errors

  • ArgumentError if a level with the same name already exists.
  • Propagates any error raised by validate_offsets when lb.corpus is inconsistent.

Example

char_bundle = LevelBundle(char_corp, char_vocab)
add_level!(bund, :character, char_bundle)

@assert has_level(bund, :character)
source
KeemenaPreprocessing.byte_cfgMethod
byte_cfg(; kwargs...) -> PreprocessConfiguration

Shorthand constructor that returns a PreprocessConfiguration pre-configured for byte-level tokenisation.

The wrapper fixes the following fields

  • tokenizer_name = :byte
  • record_byte_offsets = true
  • record_character_offsets = false
  • record_word_offsets = false

while forwarding every other keyword argument to PreprocessConfiguration. Use it when building byte-level language-model corpora but still needing the full flexibility to tweak cleaning, vocabulary, or segmentation options:

cfg = byte_cfg(strip_html_tags = true,
               minimum_token_frequency = 5)
source
KeemenaPreprocessing.get_corpusMethod
get_corpus(bundle, level) -> Corpus

Retrieve the Corpus object for segmentation level level from a PreprocessBundle.

This is equivalent to get_level(bundle, level).corpus and is provided as a convenience helper when you only need the sequence of token-ids and offset tables rather than the whole LevelBundle.

Arguments

  • bundle::PreprocessBundle - bundle produced by preprocess_corpus.
  • level::Symbol - level identifier such as :byte, :word, :sentence, ...

Returns

The Corpus stored in the requested level.

Errors

Throws an ArgumentError if the level is not present in bundle (see get_level for details).

Example

word_corp = get_corpus(bund, :word)

# iterate over sentences
sent_offs = word_corp.sentence_offsets
for i in 1:length(sent_offs)-1
    rng = sent_offs[i] : sent_offs[i+1]-1
    println(view(word_corp.token_ids, rng))
end
source
KeemenaPreprocessing.get_levelMethod
get_level(bundle, level) → LevelBundle

Fetch the LevelBundle associated with segmentation level level from a PreprocessBundle.

Arguments

  • bundle::PreprocessBundle — bundle returned by preprocess_corpus.
  • level::Symbol — identifier such as :byte, :word, :sentence, ...

Returns

The requested LevelBundle.

Errors

Throws an ArgumentError when the level is absent, listing all available levels to aid debugging.

Example

word_bundle = get_level(bund, :word)
println("vocabulary size: ", length(word_bundle.vocabulary.id_to_token_strings))
source
KeemenaPreprocessing.get_token_idsMethod
get_token_ids(bundle, level) -> Vector{Int}

Return the vector of token-ids for segmentation level level contained in a PreprocessBundle.

Identical to get_corpus(bundle, level).token_ids, but provided as a convenience helper when you only need the raw id sequence and not the full Corpus object.

Arguments

  • bundle::PreprocessBundle - bundle produced by preprocess_corpus.
  • level::Symbol - segmentation level identifier (e.g. :byte, :word).

Returns

A Vector{Int} whose length equals the number of tokens at that level.

Errors

Throws an ArgumentError if the requested level is absent (see get_level for details).

Example

word_ids = get_token_ids(bund, :word)
println("first ten ids: ", word_ids[1:10])
source
KeemenaPreprocessing.get_vocabularyMethod
get_vocabulary(bundle, level) -> Vocabulary

Return the Vocabulary associated with segmentation level level (eg :byte, :word, :sentence) from a given PreprocessBundle

Effectively a shorthand for get_level(bundle, level).vocabulary

Arguments

  • bundle::PreprocessBundle - Bundle produced by preprocess_corpus
  • level::Symbol — Level identifier whose vocabulary you need

Returns

The Vocabulary stored for level

Errors

Raises an ArgumentError if level is not present in bundle (see get_level for details)

Example

vocab = get_vocabulary(bund, :word)
println("Top-10 tokens: ", vocab.id_to_token_strings[1:10])
source
KeemenaPreprocessing.has_levelMethod
has_level(bundle, level) -> Bool

Return true if the given PreprocessBundle contains a LevelBundle for the segmentation level level (e.g. :byte, :word, :sentence); otherwise return false.

Arguments

  • bundle::PreprocessBundle — bundle to inspect.
  • level::Symbol — level identifier to look for.

Example

julia> has_level(bund, :word)
true
source
KeemenaPreprocessing.preprocess_corpusMethod
preprocess_corpus(sources, cfg; save_to = nothing) - PreprocessBundle

Variant of preprocess_corpus that accepts an already constructed PreprocessConfiguration and therefore bypasses all keyword aliasing and default-override logic.

Use this when you have prepared a configuration object up-front (e.g. loaded from disk, shared across jobs, or customised in a function) and want to run the pipeline with those exact settings.

Arguments

nametypedescription
sourcesAbstractString, Vector{<:AbstractString}, iterableOne or more file paths, URLs, directories (ignored), or in-memory text strings.
cfgPreprocessConfigurationFully-specified configuration controlling every cleaning/tokenisation option.
save_toString or nothing (default)If non-nothing, the resulting bundle is serialised (e.g. via JLD2) to the given file path and returned; otherwise nothing is written.

Pipeline (unchanged)

  1. Load raw sources.
  2. Clean text based on cfg flags.
  3. Tokenise & segment; record requested offsets.
  4. Build vocabulary obeying minimum_token_frequency, special_tokens, ...
  5. Pack everything into a PreprocessBundle. Optionally persist.

Returns

A PreprocessBundle populated with corpora, vocabularies, alignments, metadata, and (by default) empty extras.

Example

cfg  = PreprocessConfiguration(strip_markdown = true,
                               tokenizer_name  = :unicode)

bund = preprocess_corpus(["doc1.txt", "doc2.txt"], cfg;
                         save_to = "unicode_bundle.jld2")

note: If you do not have a configuration object yet, call the keyword-only version instead: preprocess_corpus(sources; kwargs...) which will create a default configuration and apply any overrides you provide.

source
KeemenaPreprocessing.preprocess_corpusMethod
preprocess_corpus(sources; save_to = nothing,
                              config = nothing,
                              kwargs...) -> PreprocessBundle

End-to-end convenience wrapper that loads raw texts, cleans them, tokenises, builds a vocabulary, records offsets, and packs the result into a PreprocessBundle.

The routine can be invoked in two mutually-exclusive ways:

  1. Explicit configuration - supply your own PreprocessConfiguration through the config= keyword.

  2. Ad-hoc keyword overrides - omit config and pass any subset of the configuration keywords directly (e.g. lowercase = false, tokenizer_name = :unicode). Internally a fresh PreprocessConfiguration(; kwargs...) is created from those overrides plus the documented defaults, so calling preprocess_corpus(sources) with no keywords at all runs the pipeline using the default settings.

note: Passing both config= and per-field keywords is an error because it would lead to ambiguous intent.

Arguments

nametypedescription
sourcesAbstractString, Vector{<:AbstractString}, or iterableEither one or more file paths/URLs that will be read, directories (silently skipped), or in-memory strings treated as raw text.
save_toString or nothing (default)If a path is given the resulting bundle is serialised (JLD2) to disk and returned; otherwise nothing is written.
configPreprocessConfiguration or nothingPre-constructed configuration object. When nothing (default), a new one is built from kwargs....
kwargs...see PreprocessConfigurationPer-field overrides that populate a fresh configuration when config is nothing.

Pipeline stages

  1. Loading - files/URLs are fetched; directory entries are ignored.
  2. Cleaning - controlled by the configuration's cleaning toggles.
  3. Tokenisation & segmentation - produces token ids and offset tables.
  4. Vocabulary building - applies minimum_token_frequency and inserts special tokens.
  5. Packaging - returns a PreprocessBundle; if save_to was given, the same bundle is persisted to that path.

Returns

A fully-populated PreprocessBundle.

Examples

# 1. Quick start with defaults
bund = preprocess_corpus("corpus.txt")

# 2. Fine-grained control via keyword overrides
bund = preprocess_corpus(["doc1.txt", "doc2.txt"];
                         strip_html_tags = true,
                         tokenizer_name  = :unicode,
                         minimum_token_frequency = 3)

# 3. Supply a hand-crafted configuration object
cfg  = PreprocessConfiguration(strip_markdown = true,
                               record_sentence_offsets = false)
bund = preprocess_corpus("input/", config = cfg, save_to = "bundle.jld2")
source
KeemenaPreprocessing.preprocess_corpus_streamingMethod
preprocess_corpus_streaming(srcs;
                            cfg           = PreprocessConfiguration(),
                            vocab         = nothing,
                            chunk_tokens  = DEFAULT_CHUNK_TOKENS) -> Channel{PreprocessBundle}

Low-memory, two-pass variant of preprocess_corpus that yields a stream of PreprocessBundle s via a Channel. Each bundle covers chunk_tokens worth of tokens, letting you pipeline huge corpora through training code without ever loading the whole dataset into RAM.

Workflow

  1. Vocabulary pass (optional) If vocab === nothing, the function first computes global token-frequency counts in a constant-memory scan (_streaming_counts) and builds a vocabulary with build_vocabulary(freqs; cfg). If you already possess a fixed vocabulary (e.g. for fine-tuning), supply it through the vocab keyword to skip this pass.

  2. Chunking iterator A background task produced by doc_chunk_iterator groups raw source documents into slices whose estimated size does not exceed chunk_tokens.

  3. Per-chunk pipeline For every chunk the following steps mirror the standard pipeline:

    • clean_documents
    • tokenize_and_segment
    • assemble_bundle
    • build_ensure_alignments!

    The resulting bundle is put! onto the channel.

Arguments

nametypedescription
srcsiterable of AbstractStringFile paths, URLs, or raw texts.
cfgPreprocessConfigurationCleaning/tokenisation settings (default: fresh object).
vocabVocabulary or nothingPre-existing vocabulary; when nothing it is inferred in pass 1.
chunk_tokensIntSoft cap on tokens per chunk (default = DEFAULT_CHUNK_TOKENS).

Returns

A channel of type Channel{PreprocessBundle}. Consume it with foreach, for bundle in ch, or take!(ch).

ch = preprocess_corpus_streaming("large_corpus/*";
                                 cfg = PreprocessConfiguration(strip_html_tags=true),
                                 chunk_tokens = 250_000)

for bund in ch                      # streaming training loop
    update_model!(bund)             # user-defined function
end

note: The channel is unbuffered (Inf capacity) so each bundle is produced only when the consumer is ready, minimising peak memory consumption.

source
KeemenaPreprocessing.preprocess_corpus_streaming_chunksMethod
preprocess_corpus_streaming_chunks(srcs; kwargs...) -> 4Vector{PreprocessBundle}

Run the streaming pipeline once, eagerly consume the channel, and return a Vector whose i-th entry is the PreprocessBundle covering chunk i.

Identical keyword interface to preprocess_corpus_streaming; all arguments are forwarded unchanged.

Use when you want chunked artefacts (e.g. sharding a massive corpus across GPUs) but prefer a materialised vector instead of an explicit Channel.

bundles = preprocess_corpus_streaming_chunks("wiki_xml/*";
                                   chunk_tokens = 250_000,
                                   strip_html_tags = true)
@info "produced (length(bundles)) bundles"
source
KeemenaPreprocessing.preprocess_corpus_streaming_fullMethod

preprocesscorpusstreaming_full(srcs; kwargs...) -> PreprocessBundle

Run the streaming pipeline, merge every chunk on the fly, and return one single PreprocessBundle that spans the entire corpus.

All keyword arguments are forwarded to preprocesscorpusstreaming. Throws when chunks were built with incompatible vocabularies

bund = preprocess_corpus_streaming_full(["en.txt", "de.txt"];
                              minimum_token_frequency = 5)
println("corpus length: ", length(get_token_ids(bund, :word)))
source
KeemenaPreprocessing.with_extrasMethod
with_extras(original, new_extras) -> PreprocessBundle

Create a shallow copy of original where only the extras field is replaced by new_extras. All other components (levels, metadata, alignments) are cloned by reference, so the operation is cheap and the returned bundle remains consistent with the source.

Useful when you have performed post-processing (e.g. dimensionality reduction, cluster assignments, per-document labels) and want to attach the results without mutating the original bundle in place.

Arguments

nametypedescription
originalPreprocessBundleBundle produced by preprocess_corpus.
new_extrasAnyArbitrary payload to store under bundle.extras.

Returns

A new PreprocessBundle{typeof(new_extras)} identical to original except that extras == new_extras.

Example

labels = collect(kmeans(doc_embeddings, 50).assignments)
labeled = with_extras(bund, labels)

@assert labeled.levels === bund.levels         # same reference
@assert labeled.extras === labels              # updated payload
source