Configuration

PreprocessConfiguration is a single struct that controls every stage of the preprocessing pipeline:

StageWhat it governs
CleaningUnicode normalisation, punctuation stripping, URL / e-mail / number replacement, Markdown & HTML removal, emoji handling, repeated-character squeezing, confusable mapping …
TokenisationChoice of built-in or custom tokenizer; whether to keep zero-length tokens.
VocabularyMinimum token frequency cutoff; special-token mapping.
SegmentationWhich offset levels (byte, char, word, sentence, paragraph, document) should be recorded.

A brand-new configuration with all defaults is just:

using KeemenaPreprocessing
cfg = PreprocessConfiguration()     # ready to go

Keyword reference

Below is an exhaustive table of every keyword accepted by PreprocessConfiguration(; kwargs...). Arguments are grouped by stage; omit any keyword to keep its default.

Cleaning toggles

keyworddefaultdescription
lowercasetrueConvert letters to lower-case.
strip_accentstrueRemove combining accent marks.
remove_control_characterstrueDrop Unicode Cc / Cf code-points.
remove_punctuationtrueStrip punctuation & symbol characters.
normalise_whitespacetrueCollapse consecutive whitespace to a single space.
remove_zero_width_charstrueRemove zero-width joiners, etc.
preserve_newlinestrueKeep explicit \n; needed for paragraph offsets.
collapse_spacestrueCollapse runs of spaces / tabs.
trim_edgestrueStrip leading / trailing whitespace.

URL, e-mail & number replacement

keyworddefaultpurpose
replace_urlstrueReplace URLs with url_sentinel.
replace_emailstrueReplace e-mails with mail_sentinel.
keep_url_schemefalsePreserve http:// / https:// prefix.
url_sentinel"<URL>"Literal token replacing each URL.
mail_sentinel"<EMAIL>"Literal token replacing each e-mail.
replace_numbersfalseReplace numbers with number_sentinel.
number_sentinel"<NUM>"Token used when replacing numbers.
keep_number_decimalfalsePreserve decimal part.
keep_number_signfalsePreserve + / - sign.
keep_number_commasfalsePreserve thousands separators.

Mark-up & HTML

keyworddefaultdescription
strip_markdownfalseRemove Markdown formatting.
preserve_md_codetrueKeep fenced / inline code while stripping.
strip_html_tagsfalseRemove HTML / XML tags.
html_entity_decodetrueDecode &amp;, &quot;, …

Emoji & Unicode normalisation

keyworddefaultdescription
emoji_handling:keep:keep, :remove, or :sentinel.
emoji_sentinel"<EMOJI>"Used when emoji_handling == :sentinel.
squeeze_repeat_charsfalseLimit repeated characters (sooooo → sooo).
max_char_run3Max run length when squeezing.
map_confusablesfalseMap visually confusable Unicode chars to ASCII.
unicode_normalisation_form:none:NFC, :NFD, :NFKC, :NFKD, or :none.
map_unicode_punctuationfalseReplace fancy punctuation with ASCII analogues.

Tokenisation

keyworddefaultdescription
tokenizer_name:whitespaceOne of TOKENIZERS or a custom f(::String) callable.
preserve_empty_tokensfalseKeep zero-length tokens if the tokenizer returns them.

Vocabulary construction

keyworddefaultpurpose
minimum_token_frequency1Tokens below this frequency map to <UNK>.
special_tokensDict(:unk=>"<UNK>", :pad=>"<PAD>")Role ⇒ literal token mapping.

Offset recording

keyworddefaultdescription
record_byte_offsetsfalseRecord byte-level spans.
record_character_offsetsfalseRecord Unicode-character offsets.
record_word_offsetstrueRecord word offsets.
record_sentence_offsetstrueRecord sentence offsets.
record_paragraph_offsetsfalseRecord paragraph offsets (forces preserve_newlines = true).
record_document_offsetstrueRecord document offsets.

Built-in tokenizers

const TOKENIZERS = (:whitespace, :unicode, :byte, :char)
namebehaviourtypical use
:whitespacesplit(text) on Unicode whitespaceMost word-level corpora.
:unicodeIterate grapheme clusters (eachgrapheme)Languages with complex scripts, emoji, accents.
:byteRaw UTF-8 bytes (UInt8)Byte-level LLM pre-training.
:charIndividual UTF-8 code-unitsCharacter-level models / diagnostics.

You may pass any callable that returns a Vector{<:AbstractString}:

mytok(text) = split(lowercase(text), r"[ \-]+")

cfg = PreprocessConfiguration(tokenizer_name = mytok)

Example: using WordTokenizers.jl via an adapter

WordTokenizers.jl exposes many tokenizers, and some return SubString{String} (or otherwise vary token element types). To keep KeemenaPreprocessing outputs stable for downstream pipelines, wrap the tokenizer and normalize to Vector{String}.

using KeemenaPreprocessing
import WordTokenizers

function wordtokenizers_nltk_tokenizer(text::AbstractString)::Vector{String}
    # Normalize element type for stable downstream typing
    return String.(WordTokenizers.nltk_word_tokenize(text))
end

configuration = PreprocessConfiguration(
    tokenizer_name = wordtokenizers_nltk_tokenizer,
)

documents = ["Hello, world! This is a test."]
bundle = preprocess_corpus(documents; config = configuration)

Note: avoid WordTokenizers.set_tokenizer(...) in pipeline code since it changes global behavior. Prefer calling the tokenizer function you want explicitly (as above).


Helper: byte_cfg

cfg = byte_cfg(strip_html_tags = true,
               minimum_token_frequency = 5)

byte_cfg is a thin wrapper that pre-sets tokenizer_name = :byte, record_byte_offsets = true, and disables char / word offsets. All other keywords are forwarded unchanged.


Examples

Language-agnostic, emoji-masked corpus

cfg = PreprocessConfiguration(
          strip_html_tags         = true,
          emoji_handling          = :sentinel,
          minimum_token_frequency = 3)

bund = preprocess_corpus("multilang_news/*"; config = cfg)

### Paragraph-level offsets for document classification

cfg = PreprocessConfiguration(
          record_paragraph_offsets = true,   # auto-enables preserve_newlines
          tokenizer_name            = :unicode)

bund = preprocess_corpus("reports/*.txt"; config = cfg)

### Extreme byte-level pre-training

cfg = byte_cfg(
          squeeze_repeat_chars    = true,
          max_char_run            = 5,
          minimum_token_frequency = 10)

bund = preprocess_corpus("c4_dump/*"; config = cfg, save_to = "byte_bundle.jld2")

Notes & assertions

  • minimum_token_frequency must be ≥ 1.
  • tokenizer_name must be one of TOKENIZERS or a callable.
  • Enabling record_paragraph_offsets = true automatically sets preserve_newlines = true (with a warning).
  • emoji_handling must be :keep, :remove, or :sentinel.
  • unicode_normalisation_form must be :none, :NFC, :NFD, :NFKC, or :NFKD.

Invalid combinations raise AssertionError, so mistakes fail fast during configuration construction rather than deep inside the pipeline.


Return value

PreprocessConfiguration(… ) always yields a fully-populated, immutable struct ready to be stored in bundle metadata or reused across jobs.