Class MetaspacePreTokenizer
A pre-tokenizer that replaces spaces with a special character (metaspace) and optionally splits the input text at these metaspace boundaries. This is commonly used in SentencePiece-based tokenizers.
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class MetaspacePreTokenizer : IPreTokenizer
Remarks
The metaspace character (default: U+2581 '▁') is used to preserve information about whitespace in the original text while treating it as a regular character during tokenization. This allows the tokenizer to distinguish between "hello world" and "helloworld".
Constructors
MetaspacePreTokenizer(char, PrependScheme, bool)
Initializes a new instance of the MetaspacePreTokenizer class.
Declaration
public MetaspacePreTokenizer(char replacement = '▁', PrependScheme prependScheme = PrependScheme.Always, bool split = true)
Parameters
| Type | Name | Description |
|---|---|---|
| char | replacement | The character to use as a replacement for spaces. Default is U+2581 ('▁'), the lower one eighth block Unicode character commonly used in SentencePiece. |
| PrependScheme | prependScheme | The scheme for prepending the replacement character to the input. Default is Always. |
| bool | split | If |
Methods
PreTokenize(SubString, Output<SubString>)
Pre-cuts the input into smaller parts.
Declaration
public void PreTokenize(SubString input, Output<SubString> output)
Parameters
| Type | Name | Description |
|---|---|---|
| SubString | input | The source to pre-cut. |
| Output<SubString> | output | Target collection of generated pre-tokenized strings. |