Class DigitsPreTokenizer
A pre-tokenizer that splits input text at digit boundaries. This class separates numeric digits from non-numeric characters during the pre-tokenization phase.
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class DigitsPreTokenizer : IPreTokenizer
Remarks
The tokenizer can operate in two modes:
- Grouped mode: Consecutive digits are kept together as a single token (e.g., "abc123def" → ["abc", "123", "def"]).
- Individual mode: Each digit is separated into its own token (e.g., "abc123def" → ["abc", "1", "2", "3", "def"]).
Constructors
DigitsPreTokenizer(bool)
Initializes a new instance of the DigitsPreTokenizer class.
Declaration
public DigitsPreTokenizer(bool individualDigits = false)
Parameters
| Type | Name | Description |
|---|---|---|
| bool | individualDigits | If |
Methods
PreTokenize(SubString, Output<SubString>)
Pre-cuts the input into smaller parts.
Declaration
public void PreTokenize(SubString input, Output<SubString> output)
Parameters
| Type | Name | Description |
|---|---|---|
| SubString | input | The source to pre-cut. |
| Output<SubString> | output | Target collection of generated pre-tokenized strings. |