Configuration

PreprocessConfiguration is a single struct that controls every stage of the preprocessing pipeline:

StageWhat it governs
CleaningUnicode normalisation, punctuation stripping, URL / e-mail / number replacement, Markdown & HTML removal, emoji handling, repeated-character squeezing, confusable mapping …
TokenisationChoice of built-in or custom tokenizer; whether to keep zero-length tokens.
VocabularyMinimum token frequency cutoff; special-token mapping.
SegmentationWhich offset levels (byte, char, word, sentence, paragraph, document) should be recorded.

A brand-new configuration with all defaults is just:

using KeemenaPreprocessing
cfg = PreprocessConfiguration()     # ready to go

Keyword reference

Below is an exhaustive table of every keyword accepted by PreprocessConfiguration(; kwargs...). Arguments are grouped by stage; omit any keyword to keep its default.

Cleaning toggles

keyworddefaultdescription
lowercasetrueConvert letters to lower-case.
strip_accentstrueRemove combining accent marks.
remove_control_characterstrueDrop Unicode Cc / Cf code-points.
remove_punctuationtrueStrip punctuation & symbol characters.
normalise_whitespacetrueCollapse consecutive whitespace to a single space.
remove_zero_width_charstrueRemove zero-width joiners, etc.
preserve_newlinestrueKeep explicit \n; needed for paragraph offsets.
collapse_spacestrueCollapse runs of spaces / tabs.
trim_edgestrueStrip leading / trailing whitespace.

URL, e-mail & number replacement

keyworddefaultpurpose
replace_urlstrueReplace URLs with url_sentinel.
replace_emailstrueReplace e-mails with mail_sentinel.
keep_url_schemefalsePreserve http:// / https:// prefix.
url_sentinel"<URL>"Literal token replacing each URL.
mail_sentinel"<EMAIL>"Literal token replacing each e-mail.
replace_numbersfalseReplace numbers with number_sentinel.
number_sentinel"<NUM>"Token used when replacing numbers.
keep_number_decimalfalsePreserve decimal part.
keep_number_signfalsePreserve + / - sign.
keep_number_commasfalsePreserve thousands separators.

Mark-up & HTML

keyworddefaultdescription
strip_markdownfalseRemove Markdown formatting.
preserve_md_codetrueKeep fenced / inline code while stripping.
strip_html_tagsfalseRemove HTML / XML tags.
html_entity_decodetrueDecode &amp;, &quot;, …

Emoji & Unicode normalisation

keyworddefaultdescription
emoji_handling:keep:keep, :remove, or :sentinel.
emoji_sentinel"<EMOJI>"Used when emoji_handling == :sentinel.
squeeze_repeat_charsfalseLimit repeated characters (sooooo → sooo).
max_char_run3Max run length when squeezing.
map_confusablesfalseMap visually confusable Unicode chars to ASCII.
unicode_normalisation_form:none:NFC, :NFD, :NFKC, :NFKD, or :none.
map_unicode_punctuationfalseReplace fancy punctuation with ASCII analogues.

Tokenisation

keyworddefaultdescription
tokenizer_name:whitespaceOne of TOKENIZERS or a custom f(::String) callable.
preserve_empty_tokensfalseKeep zero-length tokens if the tokenizer returns them.
subwordnothingActivate first-party subword integration with SubwordOptions(...) using mode = :tokenizer_native or :bundle_reindexed (aliases :native / :corpus accepted).
cleaning_profile:classic:classic keeps existing cleaning behavior. :subword_cooperative (aliases: :subword_safe, :cooperative) applies tokenizer-friendly cleaning defaults when subword !== nothing.

First-party subwords

Think of subword = SubwordOptions(...) as the "integrated subword switch". You still configure and run KeemenaPreprocessing as usual, but tokenization is handled through KeemenaSubwords under the hood.

If you need one-package ergonomics for corpus preprocessing plus subwords, start here.

If you already have a tokenizer object loaded and want explicit control, pass it through source = tokenizer.

If you are using the older callable bridge and do not want to migrate yet, keep using tokenizer_name = keemena_callable(tokenizer); that path is still valid.

Example (most common path: tokenizer-native ids):

cfg = PreprocessConfiguration(
    cleaning_profile = :subword_cooperative,
    subword = SubwordOptions(
        source = :core_bpe_en,
        mode = :tokenizer_native,
        level_name = :subword,
    ),
)

Mode choice in practice:

  • mode = :tokenizer_native: use this when downstream model code expects tokenizer-native ids. This is usually the right choice for pretrained-tokenizer workflows.
  • mode = :bundle_reindexed: use this when KeemenaPreprocessing should own the final vocabulary and ids. KeemenaSubwords still segments text into pieces, but the bundle remaps ids with your minimum_token_frequency and special_tokens settings.

Cleaning profile in practice:

  • cleaning_profile = :classic: keep current general corpus-cleaning behavior.
  • cleaning_profile = :subword_cooperative: when subword !== nothing, use conservative cleaning so tokenizer-side normalization can do the heavy lifting. Specifically this profile disables lowercase, strip_accents, remove_punctuation, replace_urls, replace_emails, and replace_numbers.

Example (bundle-owned ids/vocab):

cfg = PreprocessConfiguration(
    minimum_token_frequency = 2,
    special_tokens = Dict(:unk => "<UNK>", :pad => "<PAD>"),
    subword = SubwordOptions(
        source = :core_bpe_en,
        mode = :bundle_reindexed,
        level_name = :subword,
    ),
)

Additional notes:

  • Compatibility aliases are accepted (:native, :corpus, :tokenizer, :bundle).
  • Cleaning profile aliases are accepted (:subword_safe, :cooperative).
  • If cleaning_profile = :subword_cooperative is set without subword, configuration falls back to :classic with a warning.
  • subword !== nothing works with preprocess_corpus, preprocess_corpus_streaming, preprocess_corpus_streaming_chunks, and preprocess_corpus_streaming_full.
  • Important structural-offset scope in first-party subword mode: subword bundles currently store document_offsets on bundle.levels[:subword]. Do not assume record_byte_offsets, record_character_offsets, record_word_offsets, record_sentence_offsets, or record_paragraph_offsets will materialize corresponding structural offsets on the subword corpus.
  • Alignment scope in first-party subword mode: byte/character/word alignment maps are not auto-generated from a subword-only bundle.
  • Use subword helpers (get_subword_offsets, get_subword_attention_mask, get_subword_token_type_ids, get_subword_special_tokens_mask, get_subword_metadata) rather than reading extras directly.
  • See Subwords via KeemenaPreprocessing for full end-to-end walkthroughs and decision guidance.

Vocabulary construction

keyworddefaultpurpose
minimum_token_frequency1Tokens below this frequency map to <UNK>.
special_tokensDict(:unk=>"<UNK>", :pad=>"<PAD>")Role ⇒ literal token mapping.

Offset recording

keyworddefaultdescription
record_byte_offsetsfalseRecord byte-level spans.
record_character_offsetsfalseRecord Unicode-character offsets.
record_word_offsetstrueRecord word offsets.
record_sentence_offsetstrueRecord sentence offsets.
record_paragraph_offsetsfalseRecord paragraph offsets (forces preserve_newlines = true).
record_document_offsetstrueRecord document offsets.

Note:

  • The table above describes the generic tokenizer pipeline.
  • In first-party subword mode (subword !== nothing), only document_offsets are stored on the :subword corpus. Other structural offset vectors are not currently populated there.

Built-in tokenizers

const TOKENIZERS = (:whitespace, :unicode, :byte, :char)
namebehaviourtypical use
:whitespacesplit(text) on Unicode whitespaceMost word-level corpora.
:unicodeIterate grapheme clusters (eachgrapheme)Languages with complex scripts, emoji, accents.
:byteRaw UTF-8 bytes (UInt8)Byte-level LLM pre-training.
:charIndividual UTF-8 code-unitsCharacter-level models / diagnostics.

You may pass any callable that returns a Vector{<:AbstractString}:

mytok(text) = split(lowercase(text), r"[ \-]+")

cfg = PreprocessConfiguration(tokenizer_name = mytok)

Example: using WordTokenizers.jl via an adapter

WordTokenizers.jl exposes many tokenizers, and some return SubString{String} (or otherwise vary token element types). To keep KeemenaPreprocessing outputs stable for downstream pipelines, wrap the tokenizer and normalize to Vector{String}.

using KeemenaPreprocessing
import WordTokenizers

function wordtokenizers_nltk_tokenizer(text::AbstractString)::Vector{String}
    # Normalize element type for stable downstream typing
    return String.(WordTokenizers.nltk_word_tokenize(text))
end

configuration = PreprocessConfiguration(
    tokenizer_name = wordtokenizers_nltk_tokenizer,
)

documents = ["Hello, world! This is a test."]
bundle = preprocess_corpus(documents; config = configuration)

Note: avoid WordTokenizers.set_tokenizer(...) in pipeline code since it changes global behavior. Prefer calling the tokenizer function you want explicitly (as above).


Helper: byte_cfg

cfg = byte_cfg(strip_html_tags = true,
               minimum_token_frequency = 5)

byte_cfg is a thin wrapper that pre-sets tokenizer_name = :byte, record_byte_offsets = true, and disables char / word offsets. All other keywords are forwarded unchanged.


Examples

Language-agnostic, emoji-masked corpus

cfg = PreprocessConfiguration(
          strip_html_tags         = true,
          emoji_handling          = :sentinel,
          minimum_token_frequency = 3)

bund = preprocess_corpus("multilang_news/*"; config = cfg)

### Paragraph-level offsets for document classification

cfg = PreprocessConfiguration(
          record_paragraph_offsets = true,   # auto-enables preserve_newlines
          tokenizer_name            = :unicode)

bund = preprocess_corpus("reports/*.txt"; config = cfg)

### Extreme byte-level pre-training

cfg = byte_cfg(
          squeeze_repeat_chars    = true,
          max_char_run            = 5,
          minimum_token_frequency = 10)

bund = preprocess_corpus("c4_dump/*"; config = cfg, save_to = "byte_bundle.jld2")

Notes & assertions

  • minimum_token_frequency must be ≥ 1.
  • tokenizer_name must be one of TOKENIZERS or a callable.
  • Enabling record_paragraph_offsets = true automatically sets preserve_newlines = true (with a warning).
  • emoji_handling must be :keep, :remove, or :sentinel.
  • unicode_normalisation_form must be :none, :NFC, :NFD, :NFKC, or :NFKD.

Invalid combinations raise AssertionError, so mistakes fail fast during configuration construction rather than deep inside the pipeline.


Return value

PreprocessConfiguration(… ) always yields a fully-populated, immutable struct ready to be stored in bundle metadata or reused across jobs.