Public API
KeemenaPreprocessing.TOKENIZERS — Constant
TOKENIZERS
A constant Tuple{Symbol} listing the names of built-in tokenizers that can be passed to the tokenizer_name keyword of PreprocessConfiguration.
Currently supported values are
:whitespace- split on Unicode whitespace;:unicode- iterate user-perceived graphemes (eachgrapheme);:byte- treat the text as raw bytes (byte-level models);:char- split on individual UTF-8 code units.
You may also supply any callable that implements mytokens = f(string) in place of one of these symbols.
KeemenaPreprocessing.Corpus — Type
CorpusFlat, memory-efficient container that stores an entire corpus of token-ids together with optional hierarchical offset tables that recover the original structure (documents → paragraphs → sentences → words → characters → bytes).
Every offset vector records the starting index (1-based, inclusive) of each unit inside token_ids. The final entry therefore equals length(token_ids)+1, making range retrieval convenient via view(token_ids, offsets[i] : offsets[i+1]-1).
Fields
| field | type | always present? | description |
|---|---|---|---|
token_ids | Vector{Int} | ✓ | Concatenated token identifiers returned by the vocabulary. |
document_offsets | Vector{Int} | ✓ | Start positions of each document (outermost level). |
paragraph_offsets | Union{Vector{Int},Nothing} | cfg-dependent | Paragraph starts within each document when record_paragraph_offsets=true. |
sentence_offsets | Union{Vector{Int},Nothing} | cfg-dependent | Sentence boundaries when record_sentence_offsets=true. |
word_offsets | Union{Vector{Int},Nothing} | cfg-dependent | Word boundaries when record_word_offsets=true. |
character_offsets | Union{Vector{Int},Nothing} | cfg-dependent | Unicode-character spans when record_character_offsets=true. |
byte_offsets | Union{Vector{Int},Nothing} | cfg-dependent | Byte-level spans when record_byte_offsets=true. |
Example
# assume `corp` is a Corpus produced by preprocess_corpus
doc1_range = corp.document_offsets[1] : corp.document_offsets[2]-1
doc1_token_ids = view(corp.token_ids, doc1_range)
if corp.sentence_offsets ≠ nothing
first_sentence = view(corp.token_ids,
corp.sentence_offsets[1] : corp.sentence_offsets[2]-1)
endThe presence or absence of each optional offsets vector is determined entirely by the corresponding record_*_offsets flags in PreprocessConfiguration.
KeemenaPreprocessing.CrossMap — Type
CrossMapAlignment table that links two segmentation levels of the same corpus (e.g. bytes -> characters, characters -> words, words -> sentences).
For every unit in the destination level the alignment vector stores the 1-based index into the source offsets at which that unit begins. This allows constant-time projection of any span expressed in destination units back to the finer-grained source sequence.
Fields
source_level :: SymbolName of the finer level (must match a key inbundle.levels, typically:byte,:char,:word,:sentence, or:paragraph).destination_level :: SymbolName of the coarser level whose boundaries are encoded.alignment :: Vector{Int}Length =N_destination + 1.alignment[i]is the starting source-level offset of destination elementi; the extra sentinel entryalignment[end] = N_source + 1lets you slice withalignment[i] : alignment[i+1]-1without bounds checks.
Example
# map words ⇒ sentences
m = CrossMap(:word, :sentence, sent2word_offsets)
first_sentence_word_ids = alignment_view(m, 1) # helper returning a viewThe constructor is trivial and performs no validation; pipelines are expected to guarantee consistency when emitting CrossMap objects.
KeemenaPreprocessing.CrossMap — Method
CrossMap(src, dst, align)Shorthand outer constructor that builds a CrossMap while materialising the alignment vector as Vector{Int}.
Arguments
src::Symbol- identifier of the source (finer-grained) level (e.g.:char,:word).dst::Symbol- identifier of the destination (coarser) level (e.g.:word,:sentence).align::AbstractVector{<:Integer}- offset array mapping every destination unit to its starting position in the source sequence. Any integer-typed vector is accepted; it is copied into a denseVector{Int}to guarantee contiguous storage and type stability inside the resultingCrossMap.
Returns
A CrossMap(src, dst, Vector{Int}(align)).
Example
cm = CrossMap(:char, :word, UInt32[1, 5, 9, 14])
@assert cm.alignment isa Vector{Int}KeemenaPreprocessing.LevelBundle — Type
LevelBundleSelf-contained pairing of a Corpus and its companion Vocabulary. A LevelBundle represents one segmentation level (e.g. words, characters, or bytes) produced by the preprocessing pipeline. By storing both objects side-by-side it guarantees that every token_id found in corpus.token_ids is valid according to vocabulary.
Fields
corpus :: CorpusAll token-ids plus optional offset tables describing the structure of the text at this level.vocabulary :: VocabularyBidirectional mapping between token strings and the integer ids used incorpus.token_ids.
Integrity checks
The inner constructor performs two runtime validations:
- Range check - the largest token-id must not exceed
length(vocabulary.id_to_token_strings). - Lower bound - all token-ids must be >= 1 (id 0 is never legal).
Violations raise an informative ArgumentError, catching mismatches early.
Example
word_corpus = Corpus(word_ids, doc_offs, nothing, sent_offs, word_offs,
nothing, nothing)
word_vocab = build_vocabulary(words; minimum_token_frequency = 2)
word_bundle = LevelBundle(word_corpus, word_vocab)
nb_tokens = length(word_bundle.vocabulary.id_to_token_strings)
@info "bundle contains nb_tokens unique tokens"KeemenaPreprocessing.PipelineMetadata — Type
PipelineMetadataCompact header bundled with every artefact produced by KeemenaPreprocessing. It records the exact pipeline settings and the version of the on-disk schema so that data can be re-processed, inspected, or migrated safely.
Fields
configuration::PreprocessConfigurationThe full set of cleaning, tokenisation, vocabulary, and offset-recording options that generated the artefact. Storing this ensures strict reproducibility.schema_version::VersionNumberThe version of the bundle file format (not the Julia package). Increment the major component when breaking changes are introduced so that loaders can detect incompatibilities and perform migrations or raise errors.
Example
cfg = PreprocessConfiguration(strip_html_tags = true)
meta = PipelineMetadata(cfg, v"1.0.0")
@info "tokeniser:" meta.configuration.tokenizer_name
@assert meta.schema_version >= v"1.0.0"KeemenaPreprocessing.PipelineMetadata — Method
PipelineMetadata() -> PipelineMetadataConvenience constructor that returns a metadata header with
- the default
PreprocessConfiguration()(all keyword-arguments left at their documented defaults); and - the current bundle schema version
v"1.0.0".
Handy for rapid prototyping or unit tests when you do not need to customise the pipeline but still require a valid PipelineMetadata object.
Identical to:
PipelineMetadata(PreprocessConfiguration(), v"1.0.0")KeemenaPreprocessing.PreprocessBundle — Type
PreprocessBundle{ExtraT}Top-level artefact emitted by preprocess_corpus (or the streaming variant). A bundle contains everything required to feed a downstream model or to reload a corpus without re-running the expensive preprocessing pipeline.
Type parameter
ExtraT- arbitrary payload for user-defined information (e.g. feature matrices, clustering assignments, language tags). UseNothingwhen no extras are needed.
Fields
| field | type | description |
|---|---|---|
levels | Dict{Symbol,LevelBundle} | Mapping from segmentation level name (:byte, :char, :word, :sentence, :paragraph, …) to the corresponding LevelBundle. |
metadata | PipelineMetadata | Reproducibility header (configuration + schema version). |
alignments | Dict{Tuple{Symbol,Symbol},CrossMap} | Pair-wise offset projections between levels, keyed as (source, destination) (e.g. (:char, :word)). |
extras | ExtraT | Optional user payload carried alongside the core data. |
Typical workflow
bund = preprocess_corpus(files; strip_html_tags=true)
# inspect vocabulary
word_vocab = bund.levels[:word].vocabulary
println("vocabulary size: ", length(word_vocab.id_to_token_strings))
# project a sentence span back to character offsets
cm = bund.alignments[(:char, :sentence)]
first_sentence_char_span = cm.alignment[1] : cm.alignment[2]-1The bundle is immutable; to add additional levels or extras create a fresh instance (helper functions add_level!, with_extras, etc. are provided by the package).
KeemenaPreprocessing.PreprocessBundle — Method
PreprocessBundle(levels; metadata = PipelineMetadata(),
alignments = Dict{Tuple{Symbol,Symbol},CrossMap}(),
extras = nothing) -> PreprocessBundleOuter constructor that validates and assembles the individual artefacts generated by KeemenaPreprocessing into a single PreprocessBundle.
Required argument
levels::Dict{Symbol,<:LevelBundle}- at least one segmentation level (keyed by level name such as:wordor:char).
Optional keyword arguments
| keyword | default | purpose |
|---|---|---|
metadata | PipelineMetadata() | Configuration & schema header. |
alignments | empty Dict | Maps (source,destination) -> CrossMap. |
extras | nothing | User-supplied payload propagated unchanged. |
Runtime checks
- Non-empty
levels. - For each
(lvl, lb)inlevelsrunvalidate_offsets(lb.corpus, lvl)to ensure internal offset consistency. - For every supplied alignment
(src,dst) → cm:- both
srcanddstmust exist inlevels; length(cm.alignment) == length(levels[src].corpus.token_ids);cm.source_level == src;cm.destination_level == dst.
- both
Any violation throws an informative ArgumentError.
Returns
A fully-validated PreprocessBundle{typeof(extras)} containing: Dict(levels), metadata, Dict(alignments), and extras.
Example
word_bundle = LevelBundle(word_corpus, word_vocab)
char_bundle = LevelBundle(char_corpus, char_vocab)
bund = PreprocessBundle(Dict(:word=>word_bundle, :char=>char_bundle);
alignments = Dict((:char,:word)=>char2word_map))KeemenaPreprocessing.PreprocessBundle — Method
PreprocessBundle(; metadata = PipelineMetadata(), extras = nothing) -> PreprocessBundleConvenience constructor that produces an empty PreprocessBundle:
levels = Dict{Symbol,LevelBundle}()alignments = Dict{Tuple{Symbol,Symbol},CrossMap}()metadata = metadata(defaults toPipelineMetadata())extras = extras(defaults tonothing)
Useful when you want to build a bundle incrementally—for example, loading individual levels from disk or generating them in separate jobs: while still attaching a common metadata header or arbitrary user payload.
bund = PreprocessBundle() # blank skeleton
bund = merge(bund, load_word_level("word.jld")) # pseudo-code for adding dataThe returned object's type parameter is inferred from extras so that any payload, including complex structs, can be stored without further boilerplate.
KeemenaPreprocessing.PreprocessConfiguration — Method
PreprocessConfiguration(; kwargs...) -> PreprocessConfigurationCreate a fully-specified preprocessing configuration.
All keyword arguments are optional; sensible defaults are provided so that cfg = PreprocessConfiguration() already yields a working pipeline. Options are grouped below by the stage they affect.
Cleaning stage toggles
| keyword | default | purpose |
|---|---|---|
lowercase | true | Convert letters to lower-case. |
strip_accents | true | Remove combining accent marks. |
remove_control_characters | true | Drop Unicode Cc/Cf code-points. |
remove_punctuation | true | Strip punctuation & symbol characters. |
normalise_whitespace | true | Collapse consecutive whitespace. |
remove_zero_width_chars | true | Remove zero-width joiners, etc. |
preserve_newlines | true | Keep explicit line breaks. |
collapse_spaces | true | Collapse runs of spaces/tabs. |
trim_edges | true | Strip leading/trailing whitespace. |
URL, e-mail & numbers
| keyword | default | purpose |
|---|---|---|
replace_urls | true | Replace URLs with url_sentinel. |
replace_emails | true | Replace e-mails with mail_sentinel. |
keep_url_scheme | false | Preserve http:// / https:// prefix. |
url_sentinel | "<URL>" | Token inserted for each URL. |
mail_sentinel | "<EMAIL>" | Token inserted for each e-mail. |
replace_numbers | false | Replace numbers with number_sentinel. |
number_sentinel | "<NUM>" | Token used when replacing numbers. |
keep_number_decimal | false | Preserve decimal part. |
keep_number_sign | false | Preserve ± sign. |
keep_number_commas | false | Preserve thousands separators. |
Mark-up & HTML
| keyword | default | purpose |
|---|---|---|
strip_markdown | false | Remove Markdown formatting. |
preserve_md_code | true | Keep fenced/inline code while stripping. |
strip_html_tags | false | Remove HTML/XML tags. |
html_entity_decode | true | Decode &, ", etc. |
Emoji & Unicode
| keyword | default | purpose |
|---|---|---|
emoji_handling | :keep | :keep, :remove, or :sentinel. |
emoji_sentinel | "<EMOJI>" | Used when emoji_handling == :sentinel. |
squeeze_repeat_chars | false | Limit repeated character runs. |
max_char_run | 3 | Maximum run length when squeezing. |
map_confusables | false | Map visually-confusable chars. |
unicode_normalisation_form | :none | :NFC, :NFD, :NFKC, :NFKD, or :none. |
map_unicode_punctuation | false | Replace Unicode punctuation with ASCII. |
Tokenisation
| keyword | default | purpose |
|---|---|---|
tokenizer_name | :whitespace | One of TOKENIZERS or a callable. |
preserve_empty_tokens | false | Keep zero-length tokens. |
Vocabulary construction
| keyword | default | purpose |
|---|---|---|
minimum_token_frequency | 1 | Discard rarer tokens / map to <UNK>. |
special_tokens | Dict(:unk=>"<UNK>", :pad=>"<PAD>") | Role ⇒ literal mapping. |
Offset recording
| keyword | default | purpose |
|---|---|---|
record_byte_offsets | false | Record byte-level spans. |
record_character_offsets | false | Record Unicode-char offsets. |
record_word_offsets | true | Record word offsets. |
record_sentence_offsets | true | Record sentence offsets. |
record_paragraph_offsets | false | Record paragraph offsets (forces preserve_newlines = true). |
record_document_offsets | true | Record document offsets. |
Returns
A fully-initialised PreprocessConfiguration instance. Invalid combinations raise AssertionError (e.g. unsupported tokenizer) and certain settings emit warnings when they imply other flags (e.g. paragraph offsets -> preserve_newlines).
See also: TOKENIZERS and byte_cfg for a pre-canned byte-level configuration.
KeemenaPreprocessing.Vocabulary — Type
VocabularyImmutable lookup table produced by build_vocabulary that maps between integer token-ids and the string literals that appear in a corpus.
Fields
id_to_token_strings::Vector{String}Positioniholds the canonical surface form of token-idi(vocab.id_to_token_strings[id]→"word").token_to_id_map::Dict{String,Int}Fast reverse mapping from token string to its integer id (vocab.token_to_id_map["word"]→id). Look-ups fall back to the<UNK>id when the string is absent.token_frequencies::Vector{Int}Corpus counts aligned withid_to_token_strings(token_frequencies[id]gives the raw frequency of that token).special_tokens::Dict{Symbol,Int}Set of reserved ids for sentinel symbols such as:unk,:pad,:bos,:eos, … Keys are roles (Symbol); values are the corresponding integer ids.
Usage example
vocab = build_vocabulary(tokens; minimum_token_frequency = 3)
@info "UNK id: " vocab.special_tokens[:unk]
@info "«hello» id:" vocab.token_to_id_map["hello"]
@info "id → token:" vocab.id_to_token_strings[42]KeemenaPreprocessing._Alignment.alignment_byte_to_word — Function
alignment_byte_to_word(byte_c, word_c) -> CrossMapConstruct a byte -> word CrossMap that projects each byte index in byte_c onto the word index in word_c that contains it.
Preconditions
byte_cmust have a non-nothingbyte_offsetsvector (checked via the private helper_require_offsets).word_cmust have a non-nothingword_offsetsvector.- Both corpora must span the same token range
byte_offsets[end] == word_offsets[end]; otherwise anArgumentErroris thrown.
Arguments
| name | type | description |
|---|---|---|
byte_c | Corpus | Corpus tokenised at the byte level. |
word_c | Corpus | Corpus tokenised at the word level. |
Algorithm
- Retrieve the sentinel-terminated offset vectors
bo = byte_c.byte_offsetsandwo = word_c.word_offsets. - Allocate
b2w :: Vector{Int}(undef, n_bytes)wheren_bytes = length(bo) - 1. - For each word index
w_idxfill the slicewo[w_idx] : wo[w_idx+1]-1withw_idx, thereby assigning every byte position to the word that begins atwo[w_idx]. - Return
CrossMap(:byte, :word, b2w).
The output vector has length n_bytes (no sentinel) because every byte token receives one word identifier.
Returns
A CrossMap whose fields are:
source_level == :byte
destination_level == :word
alignment :: Vector{Int} # length = n_bytesErrors
ArgumentErrorif either corpus lacks the necessary offsets.ArgumentErrorwhen the overall spans differ.
Example
b2w = alignment_byte_to_word(byte_corpus, word_corpus)
word_index_of_42nd_byte = b2w.alignment[42]KeemenaPreprocessing._Assemble.assemble_bundle — Function
assemble_bundle(tokens, offsets, vocab, cfg) -> PreprocessBundleConvert the token-level artefacts produced by tokenize_and_segment into a minimal yet fully valid PreprocessBundle. The function
Projects each token to its integer id using
vocab; unknown strings are mapped to the:unkspecial (throws if the vocabulary lacks one).Packs the id sequence together with the requested offset tables into a
Corpus.Wraps that corpus and its vocabulary in a
LevelBundlewhose key is inferred fromcfg.tokenizer_name:tokenizer_namevaluelevel symbol stored :byte:byte:char:character:unicode,:whitespace:wordFunctionSymbol(typeof(fn))any other Symbolsame symbol Builds a default
PipelineMetadataheader (PipelineMetadata(cfg, v"1.0.0")).Returns a
PreprocessBundlecontaining exactly one level, emptyalignments, andextras = nothing.
Arguments
| name | type | description |
|---|---|---|
tokens | Vector{<:Union{String,UInt8}} | Flattened token stream. |
offsets | Dict{Symbol,Vector{Int}} | Start indices for each recorded level (as returned by tokenize_and_segment). |
vocab | Vocabulary | Token <-> id mapping (must contain :unk). |
cfg | PreprocessConfiguration | Determines the level key and special-token requirements. |
Returns
PreprocessBundle with
bundle.levels == Dict(level_key => LevelBundle(corpus, vocab))
bundle.metadata == PipelineMetadata(cfg, v"1.0.0")
bundle.alignments == Dict{Tuple{Symbol,Symbol},CrossMap}() # empty
bundle.extras == nothingErrors
- Throws
ArgumentErrorifvocablacks the:unkspecial. - Propagates any error raised by the inner constructors of
CorpusorLevelBundle(e.g. offset inconsistencies).
Example
tokens, offs = tokenize_and_segment(docs, cfg)
vocab = build_vocabulary(tokens; cfg = cfg)
bund = assemble_bundle(tokens, offs, vocab, cfg)
@info keys(bund.levels) # (:word,) for whitespace tokenizerKeemenaPreprocessing._BundleIO.load_preprocess_bundle — Function
load_preprocess_bundle(path; format = :jld2) -> PreprocessBundleLoad a previously-saved PreprocessBundle from disk.
The function currently understands the JLD2 wire-format written by save_preprocess_bundle. It performs a lightweight header check to ensure the on-disk bundle version is not newer than the library version linked at run-time, helping you avoid silent incompatibilities after package upgrades.
Arguments
| name | type | description |
|---|---|---|
path | AbstractString | File name (relative or absolute) pointing to the bundle on disk. |
format | Symbol (keyword) | Serialization format. Only :jld2 is accepted—any other value raises an error. |
Returns
PreprocessBundle - the exact object originally passed to save_preprocess_bundle, including all levels, alignments, metadata, and extras.
Errors
ArgumentError - ifpathdoes not exist.ArgumentError - ifformat ≠ :jld2.ErrorException - when the bundle's persisted version is newer than the library's internal_BUNDLE_VERSION, signalling that your local code may be too old to read the file safely.
Example
bund = load_preprocess_bundle("artifacts/train_bundle.jld2")
@info "levels available: keys(bund.levels))"KeemenaPreprocessing._BundleIO.save_preprocess_bundle — Function
save_preprocess_bundle(bundle, path; format = :jld2, compress = false) -> StringPersist a PreprocessBundle to disk and return the absolute file path written.
Currently the only supported format is :jld2; an error is raised for any other value.
Arguments
| name | type | description |
|---|---|---|
bundle | PreprocessBundle | Object produced by preprocess_corpus. |
path | AbstractString | Destination file name (relative or absolute). Parent directories are created automatically. |
format | Symbol (keyword) | Serialization format. Must be :jld2. |
compress | Bool (keyword) | When false (default) the JLD2 file is written without zlib compression; set to false for fastest write speed (default), true adds compression. |
File structure
The JLD2 file stores three top-level keys
| key | value |
|---|---|
"__bundle_version__" | String denoting the package's internal bundle spec. |
"__schema_version__" | string(bundle.metadata.schema_version) |
"bundle" | The full PreprocessBundle instance. |
These headers enable future schema migrations or compatibility checks.
Returns
String - absolute path of the file just written.
Example
p = save_preprocess_bundle(bund, "artifacts/train_bundle.jld2"; compress = false)
@info "bundle saved to p"KeemenaPreprocessing._Cleaning.clean_documents — Function
clean_documents(docs, cfg) → Vector{String}Apply the text-cleaning stage of the Keemena pipeline to every document in docs according to the options held in cfg (PreprocessConfiguration. The returned vector has the same length and order as docs.
Arguments
| name | type | description |
|---|---|---|
docs | Vector{String} | Raw, unprocessed documents. |
cfg | PreprocessConfiguration | Cleaning directives (lower-casing, URL replacement, emoji handling, …). |
Processing steps
The function runs a fixed sequence of transformations, each guarded by the corresponding flag in cfg:
- Unicode normalisation
normalize_unicode(unicode_normalisation_form). - HTML stripping
strip_html+ entity decoding (strip_html_tags,html_entity_decode). - Markdown stripping
strip_markdown(strip_markdown,preserve_md_code). - Repeated-character squeezing
squeeze_char_runs(squeeze_repeat_chars,max_char_run). - Unicode confusable mapping
normalize_confusables(map_confusables). - Emoji handling
_rewrite_emojis(emoji_handling,emoji_sentinel). - Number replacement
replace_numbers(replace_numbers, plus thekeep_*sub-flags andnumber_sentinel). - Unicode-to-ASCII punctuation mapping
map_unicode_punctuation(map_unicode_punctuation). - URL / e-mail replacement
replace_urls_emails(replace_urls,replace_emails,url_sentinel,mail_sentinel,keep_url_scheme). - Lower-casing
lowercase(lowercase). - Accent stripping
_strip_accents(strip_accents). - Control-character removal regex replace with
_CTRL_RE(remove_control_characters). - Whitespace normalisation
normalize_whitespace(normalise_whitespace,remove_zero_width_chars,collapse_spaces,trim_edges,preserve_newlines). Falls back tostripwhen onlytrim_edgesis requested. - Punctuation removal regex replace with
_PUNCT_RE(remove_punctuation).
Every transformation returns a new string; the original input remains unchanged.
Returns
Vector{String} — cleaned documents ready for tokenisation.
Example
cfg = PreprocessConfiguration(strip_html_tags = true,
replace_urls = true)
clean = clean_documents(["Visit https://example.com!"], cfg)
@info clean[1] # -> "Visit <URL>"KeemenaPreprocessing._Vocabulary.build_vocabulary — Function
build_vocabulary(tokens::Vector{String}; cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(freqs::Dict{String,Int}; cfg::PreprocessConfiguration) -> Vocabulary
build_vocabulary(stream::Channel{Vector{String}};
cfg::PreprocessConfiguration;
chunk_size::Int = 500_000) -> VocabularyConstruct a Vocabulary from token data that may be held entirely in memory, pre-counted, or streamed in batches.
Method overview
- Vector method - accepts a flat vector of token strings.
- Dict method - accepts a dictionary that maps each token string to its corpus frequency.
- Streaming method - accepts a channel that yields token-vector batches so you can build a vocabulary without ever loading the whole corpus at once.
All three methods share the same counting, filtering, and ID-assignment logic; they differ only in how token data are supplied.
Shared argument
cfg- aPreprocessConfigurationthat providesminimum_token_frequency- initial
special_tokens - dynamic sentence markers when
record_sentence_offsetsis true.
Additional arguments
tokens- vector of token strings.freqs- dictionary from token string to integer frequency.stream- channel that produces vectors of token strings.chunk_size- number of tokens to buffer before flushing counts (streaming method only).
Processing steps
- Seed specials - copy the special tokens from
cfgand insert<BOS>/<EOS>if sentence offsets are recorded. - Count tokens - accumulate frequencies from the provided data source.
- Filter - discard tokens occurring fewer times than
cfg.minimum_token_frequency. - Assign IDs - assign IDs to specials first (alphabetical order for reproducibility), then to remaining tokens sorted by descending frequency and finally lexicographic order.
- Return - a deterministic
Vocabularycontainingtoken_to_id,id_to_token, andfrequencies.
Examples
# From a token vector
tokens = ["the", "red", "fox", ...]
vocab = build_vocabulary(tokens; cfg = config)
# From pre-computed counts
counts = Dict("the" => 523_810, "fox" => 1_234)
vocab = build_vocabulary(counts; cfg = config)
# Streaming large corpora
ch = Channel{Vector{String}}(8) do c
for path in corpus_paths
put!(c, tokenize(read(path, String)))
end
end
vocab = build_vocabulary(ch; cfg = config, chunk_size = 100_000)KeemenaPreprocessing.add_level! — Method
add_level!(bundle, level, lb) -> PreprocessBundleMutating helper that inserts a new LevelBundle lb into bundle.levels under key level. The routine:
- Guards against duplicates - throws an error if
levelalready exists. - Validates the offsets inside
lb.corpusfor consistency with the supplied level viavalidate_offsets. - Stores the bundle and returns the same
bundleinstance so the call can be chained.
be aware that, add_level! modifies its first argument in place; if you require an immutable bundle keep a copy before calling
Arguments
| name | type | description |
|---|---|---|
bundle | PreprocessBundle | Target bundle to extend. |
level | Symbol | Identifier for the new segmentation level (e.g. :char, :word). |
lb | LevelBundle | Data + vocabulary for that level. |
Returns
The same bundle, now containing level => lb.
Errors
ArgumentErrorif a level with the same name already exists.- Propagates any error raised by
validate_offsetswhenlb.corpusis inconsistent.
Example
char_bundle = LevelBundle(char_corp, char_vocab)
add_level!(bund, :character, char_bundle)
@assert has_level(bund, :character)KeemenaPreprocessing.byte_cfg — Method
byte_cfg(; kwargs...) -> PreprocessConfigurationShorthand constructor that returns a PreprocessConfiguration pre-configured for byte-level tokenisation.
The wrapper fixes the following fields
tokenizer_name = :byterecord_byte_offsets = truerecord_character_offsets = falserecord_word_offsets = false
while forwarding every other keyword argument to PreprocessConfiguration. Use it when building byte-level language-model corpora but still needing the full flexibility to tweak cleaning, vocabulary, or segmentation options:
cfg = byte_cfg(strip_html_tags = true,
minimum_token_frequency = 5)KeemenaPreprocessing.get_corpus — Method
get_corpus(bundle, level) -> CorpusRetrieve the Corpus object for segmentation level level from a PreprocessBundle.
This is equivalent to get_level(bundle, level).corpus and is provided as a convenience helper when you only need the sequence of token-ids and offset tables rather than the whole LevelBundle.
Arguments
bundle::PreprocessBundle- bundle produced bypreprocess_corpus.level::Symbol- level identifier such as:byte,:word,:sentence, ...
Returns
The Corpus stored in the requested level.
Errors
Throws an ArgumentError if the level is not present in bundle (see get_level for details).
Example
word_corp = get_corpus(bund, :word)
# iterate over sentences
sent_offs = word_corp.sentence_offsets
for i in 1:length(sent_offs)-1
rng = sent_offs[i] : sent_offs[i+1]-1
println(view(word_corp.token_ids, rng))
endKeemenaPreprocessing.get_level — Method
get_level(bundle, level) → LevelBundleFetch the LevelBundle associated with segmentation level level from a PreprocessBundle.
Arguments
bundle::PreprocessBundle— bundle returned bypreprocess_corpus.level::Symbol— identifier such as:byte,:word,:sentence, ...
Returns
The requested LevelBundle.
Errors
Throws an ArgumentError when the level is absent, listing all available levels to aid debugging.
Example
word_bundle = get_level(bund, :word)
println("vocabulary size: ", length(word_bundle.vocabulary.id_to_token_strings))KeemenaPreprocessing.get_token_ids — Method
get_token_ids(bundle, level) -> Vector{Int}Return the vector of token-ids for segmentation level level contained in a PreprocessBundle.
Identical to get_corpus(bundle, level).token_ids, but provided as a convenience helper when you only need the raw id sequence and not the full Corpus object.
Arguments
bundle::PreprocessBundle- bundle produced bypreprocess_corpus.level::Symbol- segmentation level identifier (e.g.:byte,:word).
Returns
A Vector{Int} whose length equals the number of tokens at that level.
Errors
Throws an ArgumentError if the requested level is absent (see get_level for details).
Example
word_ids = get_token_ids(bund, :word)
println("first ten ids: ", word_ids[1:10])KeemenaPreprocessing.get_vocabulary — Method
get_vocabulary(bundle, level) -> VocabularyReturn the Vocabulary associated with segmentation level level (eg :byte, :word, :sentence) from a given PreprocessBundle
Effectively a shorthand for get_level(bundle, level).vocabulary
Arguments
bundle::PreprocessBundle- Bundle produced bypreprocess_corpuslevel::Symbol— Level identifier whose vocabulary you need
Returns
The Vocabulary stored for level
Errors
Raises an ArgumentError if level is not present in bundle (see get_level for details)
Example
vocab = get_vocabulary(bund, :word)
println("Top-10 tokens: ", vocab.id_to_token_strings[1:10])KeemenaPreprocessing.has_level — Method
has_level(bundle, level) -> BoolReturn true if the given PreprocessBundle contains a LevelBundle for the segmentation level level (e.g. :byte, :word, :sentence); otherwise return false.
Arguments
bundle::PreprocessBundle— bundle to inspect.level::Symbol— level identifier to look for.
Example
julia> has_level(bund, :word)
trueKeemenaPreprocessing.preprocess_corpus — Method
preprocess_corpus(sources, cfg; save_to = nothing) - PreprocessBundleVariant of preprocess_corpus that accepts an already constructed PreprocessConfiguration and therefore bypasses all keyword aliasing and default-override logic.
Use this when you have prepared a configuration object up-front (e.g. loaded from disk, shared across jobs, or customised in a function) and want to run the pipeline with those exact settings.
Arguments
| name | type | description |
|---|---|---|
sources | AbstractString, Vector{<:AbstractString}, iterable | One or more file paths, URLs, directories (ignored), or in-memory text strings. |
cfg | PreprocessConfiguration | Fully-specified configuration controlling every cleaning/tokenisation option. |
save_to | String or nothing (default) | If non-nothing, the resulting bundle is serialised (e.g. via JLD2) to the given file path and returned; otherwise nothing is written. |
Pipeline (unchanged)
- Load raw sources.
- Clean text based on
cfgflags. - Tokenise & segment; record requested offsets.
- Build vocabulary obeying
minimum_token_frequency,special_tokens, ... - Pack everything into a
PreprocessBundle. Optionally persist.
Returns
A PreprocessBundle populated with corpora, vocabularies, alignments, metadata, and (by default) empty extras.
Example
cfg = PreprocessConfiguration(strip_markdown = true,
tokenizer_name = :unicode)
bund = preprocess_corpus(["doc1.txt", "doc2.txt"], cfg;
save_to = "unicode_bundle.jld2")note: If you do not have a configuration object yet, call the keyword-only version instead: preprocess_corpus(sources; kwargs...) which will create a default configuration and apply any overrides you provide.
KeemenaPreprocessing.preprocess_corpus — Method
preprocess_corpus(sources; save_to = nothing,
config = nothing,
kwargs...) -> PreprocessBundleEnd-to-end convenience wrapper that loads raw texts, cleans them, tokenises, builds a vocabulary, records offsets, and packs the result into a PreprocessBundle.
The routine can be invoked in two mutually-exclusive ways:
Explicit configuration - supply your own
PreprocessConfigurationthrough theconfig=keyword.Ad-hoc keyword overrides - omit
configand pass any subset of the configuration keywords directly (e.g.lowercase = false, tokenizer_name = :unicode). Internally a freshPreprocessConfiguration(; kwargs...)is created from those overrides plus the documented defaults, so callingpreprocess_corpus(sources)with no keywords at all runs the pipeline using the default settings.
note: Passing both config= and per-field keywords is an error because it would lead to ambiguous intent.
Arguments
| name | type | description |
|---|---|---|
sources | AbstractString, Vector{<:AbstractString}, or iterable | Either one or more file paths/URLs that will be read, directories (silently skipped), or in-memory strings treated as raw text. |
save_to | String or nothing (default) | If a path is given the resulting bundle is serialised (JLD2) to disk and returned; otherwise nothing is written. |
config | PreprocessConfiguration or nothing | Pre-constructed configuration object. When nothing (default), a new one is built from kwargs.... |
kwargs... | see PreprocessConfiguration | Per-field overrides that populate a fresh configuration when config is nothing. |
Pipeline stages
- Loading - files/URLs are fetched; directory entries are ignored.
- Cleaning - controlled by the configuration's cleaning toggles.
- Tokenisation & segmentation - produces token ids and offset tables.
- Vocabulary building - applies
minimum_token_frequencyand inserts special tokens. - Packaging - returns a
PreprocessBundle; ifsave_towas given, the same bundle is persisted to that path.
Returns
A fully-populated PreprocessBundle.
Examples
# 1. Quick start with defaults
bund = preprocess_corpus("corpus.txt")
# 2. Fine-grained control via keyword overrides
bund = preprocess_corpus(["doc1.txt", "doc2.txt"];
strip_html_tags = true,
tokenizer_name = :unicode,
minimum_token_frequency = 3)
# 3. Supply a hand-crafted configuration object
cfg = PreprocessConfiguration(strip_markdown = true,
record_sentence_offsets = false)
bund = preprocess_corpus("input/", config = cfg, save_to = "bundle.jld2")KeemenaPreprocessing.preprocess_corpus_streaming — Method
preprocess_corpus_streaming(srcs;
cfg = PreprocessConfiguration(),
vocab = nothing,
chunk_tokens = DEFAULT_CHUNK_TOKENS) -> Channel{PreprocessBundle}Low-memory, two-pass variant of preprocess_corpus that yields a stream of PreprocessBundle s via a Channel. Each bundle covers ≈ chunk_tokens worth of tokens, letting you pipeline huge corpora through training code without ever loading the whole dataset into RAM.
Workflow
Vocabulary pass (optional) If
vocab === nothing, the function first computes global token-frequency counts in a constant-memory scan (_streaming_counts) and builds a vocabulary withbuild_vocabulary(freqs; cfg). If you already possess a fixed vocabulary (e.g. for fine-tuning), supply it through thevocabkeyword to skip this pass.Chunking iterator A background task produced by
doc_chunk_iteratorgroups raw source documents into slices whose estimated size does not exceedchunk_tokens.Per-chunk pipeline For every chunk the following steps mirror the standard pipeline:
clean_documentstokenize_and_segmentassemble_bundlebuild_ensure_alignments!
The resulting bundle is
put!onto the channel.
Arguments
| name | type | description |
|---|---|---|
srcs | iterable of AbstractString | File paths, URLs, or raw texts. |
cfg | PreprocessConfiguration | Cleaning/tokenisation settings (default: fresh object). |
vocab | Vocabulary or nothing | Pre-existing vocabulary; when nothing it is inferred in pass 1. |
chunk_tokens | Int | Soft cap on tokens per chunk (default = DEFAULT_CHUNK_TOKENS). |
Returns
A channel of type Channel{PreprocessBundle}. Consume it with foreach, for bundle in ch, or take!(ch).
ch = preprocess_corpus_streaming("large_corpus/*";
cfg = PreprocessConfiguration(strip_html_tags=true),
chunk_tokens = 250_000)
for bund in ch # streaming training loop
update_model!(bund) # user-defined function
endnote: The channel is unbuffered (Inf capacity) so each bundle is produced only when the consumer is ready, minimising peak memory consumption.
KeemenaPreprocessing.preprocess_corpus_streaming_chunks — Method
preprocess_corpus_streaming_chunks(srcs; kwargs...) -> 4Vector{PreprocessBundle}Run the streaming pipeline once, eagerly consume the channel, and return a Vector whose i-th entry is the PreprocessBundle covering chunk i.
Identical keyword interface to preprocess_corpus_streaming; all arguments are forwarded unchanged.
Use when you want chunked artefacts (e.g. sharding a massive corpus across GPUs) but prefer a materialised vector instead of an explicit Channel.
bundles = preprocess_corpus_streaming_chunks("wiki_xml/*";
chunk_tokens = 250_000,
strip_html_tags = true)
@info "produced (length(bundles)) bundles"KeemenaPreprocessing.preprocess_corpus_streaming_full — Method
preprocesscorpusstreaming_full(srcs; kwargs...) -> PreprocessBundle
Run the streaming pipeline, merge every chunk on the fly, and return one single PreprocessBundle that spans the entire corpus.
All keyword arguments are forwarded to preprocesscorpusstreaming. Throws when chunks were built with incompatible vocabularies
bund = preprocess_corpus_streaming_full(["en.txt", "de.txt"];
minimum_token_frequency = 5)
println("corpus length: ", length(get_token_ids(bund, :word)))KeemenaPreprocessing.with_extras — Method
with_extras(original, new_extras) -> PreprocessBundleCreate a shallow copy of original where only the extras field is replaced by new_extras. All other components (levels, metadata, alignments) are cloned by reference, so the operation is cheap and the returned bundle remains consistent with the source.
Useful when you have performed post-processing (e.g. dimensionality reduction, cluster assignments, per-document labels) and want to attach the results without mutating the original bundle in place.
Arguments
| name | type | description |
|---|---|---|
original | PreprocessBundle | Bundle produced by preprocess_corpus. |
new_extras | Any | Arbitrary payload to store under bundle.extras. |
Returns
A new PreprocessBundle{typeof(new_extras)} identical to original except that extras == new_extras.
Example
labels = collect(kmeans(doc_embeddings, 50).assignments)
labeled = with_extras(bund, labels)
@assert labeled.levels === bund.levels # same reference
@assert labeled.extras === labels # updated payload