Namespace Unity.InferenceEngine.Tokenization.PreTokenizers
Classes
BertPreTokenizer
Splits on spaces and punctuation, removing spaces, and keeping each punctuation as separated chunk.
ByteLevelPreTokenizer
Pre tokenize an input using ByteLevel rules.
CharSplitPreTokenizer
A pre-tokenizer that splits text based on a specified character delimiter.
DefaultPreTokenizer
Default placeholder implementation of a pre-tokenizer. Does not pre-cut the input.
DigitsPreTokenizer
A pre-tokenizer that splits input text at digit boundaries. This class separates numeric digits from non-numeric characters during the pre-tokenization phase.
MetaspacePreTokenizer
A pre-tokenizer that replaces spaces with a special character (metaspace) and optionally splits the input text at these metaspace boundaries. This is commonly used in SentencePiece-based tokenizers.
PunctuationPreTokenizer
A pre-tokenizer that splits text on punctuation characters.
RegexSplitPreTokenizer
Splits the input based on a regular expression.
RuneSplitPreTokenizer
Splits the input by the runes.
SequencePreTokenizer
Applies a sequence of pre tokenizers.
StringSplitPreTokenizer
Splits the input based on a string pattern.
WhitespacePreTokenizer
A pre-tokenizer that splits text into word tokens and non-word, non-whitespace tokens. This implementation matches the behavior of the regular expression pattern "\w+|[^\w\s]+".
WhitespaceSplitPreTokenizer
A pre-tokenizer that splits input text on whitespace characters.
Interfaces
IPreTokenizer
Pre-cuts the input string into smaller parts. Those parts will be passed to the IMapper for tokenization.