Namespace Unity.InferenceEngine.Tokenization.PreTokenizers

Classes

BertPreTokenizer

Splits on spaces and punctuation, removing spaces, and keeping each punctuation as separated chunk.

ByteLevelPreTokenizer

Pre tokenize an input using ByteLevel rules.

CharSplitPreTokenizer

A pre-tokenizer that splits text based on a specified character delimiter.

DefaultPreTokenizer

Default placeholder implementation of a pre-tokenizer. Does not pre-cut the input.

DigitsPreTokenizer

A pre-tokenizer that splits input text at digit boundaries. This class separates numeric digits from non-numeric characters during the pre-tokenization phase.

MetaspacePreTokenizer

A pre-tokenizer that replaces spaces with a special character (metaspace) and optionally splits the input text at these metaspace boundaries. This is commonly used in SentencePiece-based tokenizers.

PunctuationPreTokenizer

A pre-tokenizer that splits text on punctuation characters.

RegexSplitPreTokenizer

Splits the input based on a regular expression.

RuneSplitPreTokenizer

Splits the input by the runes.

SequencePreTokenizer

Applies a sequence of pre tokenizers.

StringSplitPreTokenizer

Splits the input based on a string pattern.

WhitespacePreTokenizer

A pre-tokenizer that splits text into word tokens and non-word, non-whitespace tokens. This implementation matches the behavior of the regular expression pattern "\w+|[^\w\s]+".

WhitespaceSplitPreTokenizer

A pre-tokenizer that splits input text on whitespace characters.

Interfaces

IPreTokenizer

Pre-cuts the input string into smaller parts. Those parts will be passed to the IMapper for tokenization.