KeemenaPreprocessing

Documentation for KeemenaPreprocessing.

KeemenaPreprocessing.PreprocessBundleType

PreprocessBundle{IdT,OffsetT,ExtraT}

  • IdT : unsigned integer type for token ids (e.g. UInt32)
  • OffsetT : integer type for offsets (e.g. Int or UInt32)
  • ExtraT : payload supplied by downstream packages (Nothing by default)
source
KeemenaPreprocessing.PreprocessConfigurationMethod
PreprocessConfiguration(; kwargs...) -> PreprocessConfiguration

Build a fully-typed, immutable configuration object that controls every step of preprocess_corpus. All keyword arguments are optional; the defaults shown below reproduce the behaviour of a 'typical' English-language pipeline.

If you mistype a keyword, or supply an illegal value, the constructor throws an AssertionError or ArgumentError immediately—so your downstream workflow can never run with hidden mistakes.

────────────────────────────────────────────────────────────────────────────── Cleaning options ──────────────── lowercase = true   → convert text to lowercase strip_accents = true   → remove Unicode accents/diacritics remove_control_characters = true remove_punctuation = true normalise_whitespace = true   → collapse runs of ␠, \t, \n into one space

────────────────────────────────────────────────────────────────────────────── Tokenisation ──────────── tokenizer_name = :whitespace \| :unicode \| callable

  • :whitespace - split(str) on ASCII whitespace.
  • :unicode - splits on Unicode word-break boundaries (UAX #29).
  • Function - any f(::AbstractString)::Vector{String} you supply (e.g. a SentencePiece processor).

preserve_empty_tokens = false - keep empty strings that may arise from consecutive delimiters.

────────────────────────────────────────────────────────────────────────────── Vocabulary building ─────────────────── minimum_token_frequency = 1 -> discard tokens with lower frequency special_tokens = Dict(:unk => "<UNK>", :pad => "<PAD>")

The dictionary is copied internally, so later mutation will not affect existing configurations.

────────────────────────────────────────────────────────────────────────────── Segmentation levels to record (booleans) ──────────────────────────────────────── record_character_offsets = false record_word_offsets = true record_sentence_offsets = true record_paragraph_offsets = true record_document_offsets = true

These flags request which offset tables should appear in the resulting PreprocessBundle. After processing you can inspect bundle.levels_present[:sentence] etc. to see which ones were actually populated.

────────────────────────────────────────────────────────────────────────────── Examples ────────

Minimal default config

cfg = PreprocessConfiguration()

#custom Unicode tokenizer and higher frequency cut-off

cfg = PreprocessConfiguration(tokenizer_name          = :unicode,
                              minimum_token_frequency = 5,
                              lowercase               = false)
                              
#plug-in your own callable tokenizer (passing a function)

unicode_tokenizer(s) = collect(eachmatch(r"\b\p{L}[\p{L}\p{Mn}\p{Pc}\p{Nd}]*", s)) .|> string

cfg = PreprocessConfiguration(tokenizer_name = unicode_tokenizer,
                              remove_punctuation = false)
                              
#you can pass cfg straight to preprocess_corpus:

bundle = preprocess_corpus(text_files; config = cfg, save_to = "bundle.jld2")
source
KeemenaPreprocessing.with_extras!Method
with_extras!(bundle, new_extras; setlevel = nothing) -> new_bundle

Return a new PreprocessBundle sharing the same corpus & vocab but carrying new_extras. If setlevel is provided it toggles the corresponding levels_present flag to true.

source