KeemenaPreprocessing
Documentation for KeemenaPreprocessing.
KeemenaPreprocessing.PreprocessBundle
KeemenaPreprocessing.PreprocessConfiguration
KeemenaPreprocessing.with_extras!
KeemenaPreprocessing.PreprocessBundle
— TypePreprocessBundle{IdT,OffsetT,ExtraT}
IdT
: unsigned integer type for token ids (e.g.UInt32
)OffsetT
: integer type for offsets (e.g.Int
orUInt32
)ExtraT
: payload supplied by downstream packages (Nothing
by default)
KeemenaPreprocessing.PreprocessConfiguration
— MethodPreprocessConfiguration(; kwargs...) -> PreprocessConfiguration
Build a fully-typed, immutable configuration object that controls every step of preprocess_corpus
. All keyword arguments are optional; the defaults shown below reproduce the behaviour of a 'typical' English-language pipeline.
If you mistype a keyword, or supply an illegal value, the constructor throws an AssertionError
or ArgumentError
immediately—so your downstream workflow can never run with hidden mistakes.
────────────────────────────────────────────────────────────────────────────── Cleaning options ──────────────── lowercase
= true → convert text to lowercase strip_accents
= true → remove Unicode accents/diacritics remove_control_characters
= true remove_punctuation
= true normalise_whitespace
= true → collapse runs of ␠, \t, \n into one space
────────────────────────────────────────────────────────────────────────────── Tokenisation ──────────── tokenizer_name
= :whitespace \| :unicode \| callable
- :whitespace -
split(str)
on ASCII whitespace. - :unicode - splits on Unicode word-break boundaries (UAX #29).
- Function - any
f(::AbstractString)::Vector{String}
you supply (e.g. a SentencePiece processor).
preserve_empty_tokens
= false - keep empty strings that may arise from consecutive delimiters.
────────────────────────────────────────────────────────────────────────────── Vocabulary building ─────────────────── minimum_token_frequency
= 1 -> discard tokens with lower frequency special_tokens
= Dict(:unk => "<UNK>", :pad => "<PAD>")
The dictionary is copied internally, so later mutation will not affect existing configurations.
────────────────────────────────────────────────────────────────────────────── Segmentation levels to record (booleans) ──────────────────────────────────────── record_character_offsets
= false record_word_offsets
= true record_sentence_offsets
= true record_paragraph_offsets
= true record_document_offsets
= true
These flags request which offset tables should appear in the resulting PreprocessBundle
. After processing you can inspect bundle.levels_present[:sentence]
etc. to see which ones were actually populated.
────────────────────────────────────────────────────────────────────────────── Examples ────────
Minimal default config
cfg = PreprocessConfiguration()
#custom Unicode tokenizer and higher frequency cut-off
cfg = PreprocessConfiguration(tokenizer_name = :unicode,
minimum_token_frequency = 5,
lowercase = false)
#plug-in your own callable tokenizer (passing a function)
unicode_tokenizer(s) = collect(eachmatch(r"\b\p{L}[\p{L}\p{Mn}\p{Pc}\p{Nd}]*", s)) .|> string
cfg = PreprocessConfiguration(tokenizer_name = unicode_tokenizer,
remove_punctuation = false)
#you can pass cfg straight to preprocess_corpus:
bundle = preprocess_corpus(text_files; config = cfg, save_to = "bundle.jld2")
KeemenaPreprocessing.with_extras!
— Methodwith_extras!(bundle, new_extras; setlevel = nothing) -> new_bundle
Return a new PreprocessBundle
sharing the same corpus & vocab but carrying new_extras
. If setlevel
is provided it toggles the corresponding levels_present
flag to true
.