Namespace Unity.InferenceEngine.Tokenization

Classes

Encoding

Contains the result of a tokenization pipeline ran by a Tokenizer instance.

OutputUtility

Utility methods for Output<T>

Tokenizer

This type is the entry point of the tokenization/detokenization pipeline. The pipeline is composed of six steps, and turns an input string into an IEncoding chain:

Pre-tokenization Splits the result of the normalization step into small pieces (example: split by whitespace).
Encoding Central step of the tokenization, this one turns each piece from the pre-tokenization process into sequence of int ids. See IMapper for more details.
Truncation Splits the sequence of ids from the encoding step into smaller subsequences. The most frequent truncation rule in "max length". See ITruncator for more details.
Postprocessing Transforms each subsequences of generated from the truncation. The most common transformation is adding [CLS] and [SEP] tokens before and after the sequence. See IPostProcessor for more details.
Padding Pads each subsequence from the postprocessing to match the expected sequence size.