Configuration
PreprocessConfiguration is a single struct that controls every stage of the preprocessing pipeline:
| Stage | What it governs |
|---|---|
| Cleaning | Unicode normalisation, punctuation stripping, URL / e-mail / number replacement, Markdown & HTML removal, emoji handling, repeated-character squeezing, confusable mapping … |
| Tokenisation | Choice of built-in or custom tokenizer; whether to keep zero-length tokens. |
| Vocabulary | Minimum token frequency cutoff; special-token mapping. |
| Segmentation | Which offset levels (byte, char, word, sentence, paragraph, document) should be recorded. |
A brand-new configuration with all defaults is just:
using KeemenaPreprocessing
cfg = PreprocessConfiguration() # ready to goKeyword reference
Below is an exhaustive table of every keyword accepted by PreprocessConfiguration(; kwargs...). Arguments are grouped by stage; omit any keyword to keep its default.
Cleaning toggles
| keyword | default | description |
|---|---|---|
lowercase | true | Convert letters to lower-case. |
strip_accents | true | Remove combining accent marks. |
remove_control_characters | true | Drop Unicode Cc / Cf code-points. |
remove_punctuation | true | Strip punctuation & symbol characters. |
normalise_whitespace | true | Collapse consecutive whitespace to a single space. |
remove_zero_width_chars | true | Remove zero-width joiners, etc. |
preserve_newlines | true | Keep explicit \n; needed for paragraph offsets. |
collapse_spaces | true | Collapse runs of spaces / tabs. |
trim_edges | true | Strip leading / trailing whitespace. |
URL, e-mail & number replacement
| keyword | default | purpose |
|---|---|---|
replace_urls | true | Replace URLs with url_sentinel. |
replace_emails | true | Replace e-mails with mail_sentinel. |
keep_url_scheme | false | Preserve http:// / https:// prefix. |
url_sentinel | "<URL>" | Literal token replacing each URL. |
mail_sentinel | "<EMAIL>" | Literal token replacing each e-mail. |
replace_numbers | false | Replace numbers with number_sentinel. |
number_sentinel | "<NUM>" | Token used when replacing numbers. |
keep_number_decimal | false | Preserve decimal part. |
keep_number_sign | false | Preserve + / - sign. |
keep_number_commas | false | Preserve thousands separators. |
Mark-up & HTML
| keyword | default | description |
|---|---|---|
strip_markdown | false | Remove Markdown formatting. |
preserve_md_code | true | Keep fenced / inline code while stripping. |
strip_html_tags | false | Remove HTML / XML tags. |
html_entity_decode | true | Decode &, ", … |
Emoji & Unicode normalisation
| keyword | default | description |
|---|---|---|
emoji_handling | :keep | :keep, :remove, or :sentinel. |
emoji_sentinel | "<EMOJI>" | Used when emoji_handling == :sentinel. |
squeeze_repeat_chars | false | Limit repeated characters (sooooo → sooo). |
max_char_run | 3 | Max run length when squeezing. |
map_confusables | false | Map visually confusable Unicode chars to ASCII. |
unicode_normalisation_form | :none | :NFC, :NFD, :NFKC, :NFKD, or :none. |
map_unicode_punctuation | false | Replace fancy punctuation with ASCII analogues. |
Tokenisation
| keyword | default | description |
|---|---|---|
tokenizer_name | :whitespace | One of TOKENIZERS or a custom f(::String) callable. |
preserve_empty_tokens | false | Keep zero-length tokens if the tokenizer returns them. |
Vocabulary construction
| keyword | default | purpose |
|---|---|---|
minimum_token_frequency | 1 | Tokens below this frequency map to <UNK>. |
special_tokens | Dict(:unk=>"<UNK>", :pad=>"<PAD>") | Role ⇒ literal token mapping. |
Offset recording
| keyword | default | description |
|---|---|---|
record_byte_offsets | false | Record byte-level spans. |
record_character_offsets | false | Record Unicode-character offsets. |
record_word_offsets | true | Record word offsets. |
record_sentence_offsets | true | Record sentence offsets. |
record_paragraph_offsets | false | Record paragraph offsets (forces preserve_newlines = true). |
record_document_offsets | true | Record document offsets. |
Built-in tokenizers
const TOKENIZERS = (:whitespace, :unicode, :byte, :char)| name | behaviour | typical use |
|---|---|---|
:whitespace | split(text) on Unicode whitespace | Most word-level corpora. |
:unicode | Iterate grapheme clusters (eachgrapheme) | Languages with complex scripts, emoji, accents. |
:byte | Raw UTF-8 bytes (UInt8) | Byte-level LLM pre-training. |
:char | Individual UTF-8 code-units | Character-level models / diagnostics. |
You may pass any callable that returns a Vector{<:AbstractString}:
mytok(text) = split(lowercase(text), r"[ \-]+")
cfg = PreprocessConfiguration(tokenizer_name = mytok)Example: using WordTokenizers.jl via an adapter
WordTokenizers.jl exposes many tokenizers, and some return SubString{String} (or otherwise vary token element types). To keep KeemenaPreprocessing outputs stable for downstream pipelines, wrap the tokenizer and normalize to Vector{String}.
using KeemenaPreprocessing
import WordTokenizers
function wordtokenizers_nltk_tokenizer(text::AbstractString)::Vector{String}
# Normalize element type for stable downstream typing
return String.(WordTokenizers.nltk_word_tokenize(text))
end
configuration = PreprocessConfiguration(
tokenizer_name = wordtokenizers_nltk_tokenizer,
)
documents = ["Hello, world! This is a test."]
bundle = preprocess_corpus(documents; config = configuration)Note: avoid WordTokenizers.set_tokenizer(...) in pipeline code since it changes global behavior. Prefer calling the tokenizer function you want explicitly (as above).
Helper: byte_cfg
cfg = byte_cfg(strip_html_tags = true,
minimum_token_frequency = 5)byte_cfg is a thin wrapper that pre-sets tokenizer_name = :byte, record_byte_offsets = true, and disables char / word offsets. All other keywords are forwarded unchanged.
Examples
Language-agnostic, emoji-masked corpus
cfg = PreprocessConfiguration(
strip_html_tags = true,
emoji_handling = :sentinel,
minimum_token_frequency = 3)
bund = preprocess_corpus("multilang_news/*"; config = cfg)### Paragraph-level offsets for document classification
cfg = PreprocessConfiguration(
record_paragraph_offsets = true, # auto-enables preserve_newlines
tokenizer_name = :unicode)
bund = preprocess_corpus("reports/*.txt"; config = cfg)### Extreme byte-level pre-training
cfg = byte_cfg(
squeeze_repeat_chars = true,
max_char_run = 5,
minimum_token_frequency = 10)
bund = preprocess_corpus("c4_dump/*"; config = cfg, save_to = "byte_bundle.jld2")Notes & assertions
minimum_token_frequencymust be ≥ 1.tokenizer_namemust be one ofTOKENIZERSor a callable.- Enabling
record_paragraph_offsets = trueautomatically setspreserve_newlines = true(with a warning). emoji_handlingmust be:keep,:remove, or:sentinel.unicode_normalisation_formmust be:none,:NFC,:NFD,:NFKC, or:NFKD.
Invalid combinations raise AssertionError, so mistakes fail fast during configuration construction rather than deep inside the pipeline.
Return value
PreprocessConfiguration(… ) always yields a fully-populated, immutable struct ready to be stored in bundle metadata or reused across jobs.