KeemenaPreprocessing.jl

KeemenaPreprocessing is a lightweight, fully streaming text-processing pipeline for Julia. It converts raw text into a compact, serialisable bundle containing

cleaned documents
flattened token sequences (byte, char, word or custom)
start-index offsets for sentences, paragraphs and documents
a deterministic vocabulary with user-defined special tokens
auxiliary metadata for downstream models

Memory usage stays predictable—even on huge corpora—because every stage can run incrementally in fixed-size chunks.

Key features

Stage	Purpose
Cleaning	Lower-cases, strips accents, removes control characters, collapses whitespace and can replace URLs, e-mails or numbers with sentinel tokens.
Tokenisation	Built-in byte, Unicode-word, whitespace and character tokenisers, generic callables, and first-party subword integration via `SubwordOptions(...)`.
Segmentation	Optional paragraph and sentence splitters driven by regex.
Vocabulary	Frequency filtering, minimum counts, user-defined special tokens.
Streaming mode	Process arbitrarily large corpora via channels so nothing ever has to fit entirely in RAM.
Bundles	Pack everything into a single JLD2 file with `save_preprocess_bundle`.

All stages are driven by one PreprocessConfiguration object, so the same code works for quick prototypes and full production pipelines.

Quick start

A single call runs the entire pipeline—load, clean, tokenise, build a vocabulary, assemble offsets, pack a bundle, and optionally save to disk:

using KeemenaPreprocessing

bundle = preprocess_corpus("corpus/*.txt";
                           tokenizer_name = :unicode,            # override defaults
                           record_sentence_offsets = true,
                           minimum_token_frequency = 3,
                           save_to = "my_bundle.jld2")           # optional persistence

Prefer a pre-built configuration object? Pass it through config =:

cfg    = PreprocessConfiguration(tokenizer_name = :byte,
                                 record_byte_offsets = true)

bundle = preprocess_corpus("data/raw.txt"; config = cfg)

Documentation map

Configuration : every option in PreprocessConfiguration
Subwords : first-party native-id subword flow and artifact accessors
Cleaning : rules, sentinel tokens, customisation
Tokenisation : built-in tokenisers and extensibility
Vocabulary : frequency thresholds, specials, determinism
Streaming : channels, chunk sizes, memory planning
Saving & loading : JLD2 helpers for long-running jobs
See the Guides for worked examples
Full API in the reference

Code of Conduct

This project follows the Julia Community Standards: https://julialang.org/community/standards/

We expect all participants in this repository (issues, pull requests, discussions) to maintain a welcoming and constructive environment.

Enforcement

Project maintainers may edit or remove comments, close issues, or reject contributions that violate this Code of Conduct.

Contributing

Contributions are very much welcome. Please try to make each PR focused one a single improvement or issue fix. For larger issues please open an issue so that a plan can be worked on with others to make sure that the direction is agreed upon so that a the PR is merged being expected. If you would like to propose or request features not present so that maintainers are aware of the need please also feel free to do so.