Class WhitespacePreTokenizer
A pre-tokenizer that splits text into word tokens and non-word, non-whitespace tokens. This implementation matches the behavior of the regular expression pattern "\w+|[^\w\s]+".
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class WhitespacePreTokenizer : IPreTokenizer
Remarks
The tokenizer operates in two modes:
- Word mode: Captures sequences of word characters (letters, digits, and underscores)
- Symbol mode: Captures sequences of non-word, non-whitespace characters (punctuation, special characters)
Whitespace characters are skipped and not included in the output tokens.
Examples
Input: "Hello, World! Test-123" Output: ["Hello", ",", "World", "!", "Test", "-", "123"]
Methods
PreTokenize(SubString, Output<SubString>)
Pre-cuts the input into smaller parts.
Declaration
public void PreTokenize(SubString input, Output<SubString> output)
Parameters
| Type | Name | Description |
|---|---|---|
| SubString | input | The source to pre-cut. |
| Output<SubString> | output | Target collection of generated pre-tokenized strings. |