Class Tokenizer
This type is the entry point of the tokenization/detokenization pipeline. The pipeline is composed of six steps, and turns an input string into an IEncoding chain:
- Pre-tokenization Splits the result of the normalization step into small pieces (example: split by whitespace).
- Encoding Central step of the tokenization, this one turns each piece from the pre-tokenization process into sequence of int ids. See IMapper for more details.
- Truncation Splits the sequence of ids from the encoding step into smaller subsequences. The most frequent truncation rule in "max length". See ITruncator for more details.
- Postprocessing
Transforms each subsequences of generated from the truncation.
The most common transformation is adding
[CLS]and[SEP]tokens before and after the sequence. See IPostProcessor for more details. - Padding Pads each subsequence from the postprocessing to match the expected sequence size.
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class Tokenizer : ITokenizer
Constructors
Tokenizer(IMapper, INormalizer, IPreTokenizer, IPostProcessor, ITruncator, IPadding, IDecoder, IEnumerable<TokenConfiguration>)
Initializes a new instance of the Tokenizer type.
Declaration
public Tokenizer(IMapper mapper, INormalizer normalizer = null, IPreTokenizer preTokenizer = null, IPostProcessor postProcessor = null, ITruncator truncator = null, IPadding paddingProcessor = null, IDecoder decoder = null, IEnumerable<TokenConfiguration> addedVocabulary = null)
Parameters
| Type | Name | Description |
|---|---|---|
| IMapper | mapper | The IMapper encoding to use to turn the strings into tokens. |
| INormalizer | normalizer | Normalizes portions of the input. |
| IPreTokenizer | preTokenizer | The pre-tokenization rules. |
| IPostProcessor | postProcessor | The post-processing of the token sequence. See IPostProcessor. |
| ITruncator | truncator | The truncation rules. See ITruncator. |
| IPadding | paddingProcessor | The padding rules. |
| IDecoder | decoder | Modifiers applied to the decoded token sequence. |
| IEnumerable<TokenConfiguration> | addedVocabulary | Special token configurations. |
Exceptions
| Type | Condition |
|---|---|
| ArgumentNullException |
|
Methods
Decode(IReadOnlyList<int>, bool)
Turns a sequence of token ids into a string.
Declaration
public string Decode(IReadOnlyList<int> input, bool skipSpecialTokens = false)
Parameters
| Type | Name | Description |
|---|---|---|
| IReadOnlyList<int> | input | The sequence of token ids. |
| bool | skipSpecialTokens | Do not decode the special tokens. |
Returns
| Type | Description |
|---|---|
| string | The decoded string. |
Encode(string, string, bool)
Turns inputA, optionally inputB into an
IEncoding instance.
Declaration
public IEncoding Encode(string inputA, string inputB = null, bool addSpecialTokens = true)
Parameters
| Type | Name | Description |
|---|---|---|
| string | inputA | The main input to tokenize. Cannot be null. |
| string | inputB | A optional, secondary input to tokenize. |
| bool | addSpecialTokens | Tells whether special tokens must be added to the final IEncoding. |
Returns
| Type | Description |
|---|---|
| IEncoding | The tokenized value as an IEncoding instance. |