Namespace Unity.InferenceEngine.Tokenization
Classes
Encoding
Contains the result of a tokenization pipeline ran by a Tokenizer instance.
OutputUtility
Utility methods for Output<T>
Tokenizer
This type is the entry point of the tokenization/detokenization pipeline. The pipeline is composed of six steps, and turns an input string into an IEncoding chain:
- Pre-tokenization Splits the result of the normalization step into small pieces (example: split by whitespace).
- Encoding Central step of the tokenization, this one turns each piece from the pre-tokenization process into sequence of int ids. See IMapper for more details.
- Truncation Splits the sequence of ids from the encoding step into smaller subsequences. The most frequent truncation rule in "max length". See ITruncator for more details.
- Postprocessing
Transforms each subsequences of generated from the truncation.
The most common transformation is adding
[CLS]and[SEP]tokens before and after the sequence. See IPostProcessor for more details. - Padding Pads each subsequence from the postprocessing to match the expected sequence size.
Structs
Output<T>
Target interface for tokenization components.
SubString
Represents a portion of a string value.
Token
Represents the data of a token in a sequence.
TokenConfiguration
Represents a token that can be added to a Tokenizer instance, with optional properties that control its behavior.
Interfaces
IEncoding
Describes the result of a tokenization pipeline execution.
ITokenizer
The high level API of a tokenization/detokenization pipeline.
Enums
Direction
Tells whether performing a process to the Left, to the Right, or both.
SequenceIdentifier
Identifies a sequence. It is used in the TemplatePostProcessor.
SplitDelimiterBehavior
Options for how to deal with the delimiter when splitting the input string. See RegexSplitPreTokenizer