Class Tokenizer

This type is the entry point of the tokenization/detokenization pipeline. The pipeline is composed of six steps, and turns an input string into an IEncoding chain:

Pre-tokenization Splits the result of the normalization step into small pieces (example: split by whitespace).
Encoding Central step of the tokenization, this one turns each piece from the pre-tokenization process into sequence of int ids. See IMapper for more details.
Truncation Splits the sequence of ids from the encoding step into smaller subsequences. The most frequent truncation rule in "max length". See ITruncator for more details.
Postprocessing Transforms each subsequences of generated from the truncation. The most common transformation is adding [CLS] and [SEP] tokens before and after the sequence. See IPostProcessor for more details.
Padding Pads each subsequence from the postprocessing to match the expected sequence size.

Inheritance

object

Tokenizer

Implements

ITokenizer

Inherited Members

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Unity.InferenceEngine.Tokenization

Assembly: Unity.InferenceEngine.Tokenization.dll

Syntax

public class Tokenizer : ITokenizer

Constructors

Tokenizer(IMapper, INormalizer, IPreTokenizer, IPostProcessor, ITruncator, IPadding, IDecoder, IEnumerable<TokenConfiguration>)

Initializes a new instance of the Tokenizer type.

Declaration

public Tokenizer(IMapper mapper, INormalizer normalizer = null, IPreTokenizer preTokenizer = null, IPostProcessor postProcessor = null, ITruncator truncator = null, IPadding paddingProcessor = null, IDecoder decoder = null, IEnumerable<TokenConfiguration> addedVocabulary = null)

Parameters

Type	Name	Description
IMapper	mapper	The IMapper encoding to use to turn the strings into tokens.
INormalizer	normalizer	Normalizes portions of the input.
IPreTokenizer	preTokenizer	The pre-tokenization rules.
IPostProcessor	postProcessor	The post-processing of the token sequence. See IPostProcessor.
ITruncator	truncator	The truncation rules. See ITruncator.
IPadding	paddingProcessor	The padding rules.
IDecoder	decoder	Modifiers applied to the decoded token sequence.
IEnumerable<TokenConfiguration>	addedVocabulary	Special token configurations.

Exceptions

Type	Condition
ArgumentNullException	`mapper` cannot be null.

Methods

Decode(IReadOnlyList<int>, bool)

Turns a sequence of token ids into a string.

Declaration

public string Decode(IReadOnlyList<int> input, bool skipSpecialTokens = false)

Parameters

Type	Name	Description
IReadOnlyList<int>	input	The sequence of token ids.
bool	skipSpecialTokens	Do not decode the special tokens.

Returns

Type	Description
string	The decoded string.

Encode(string, string, bool)

Turns inputA, optionally inputB into an IEncoding instance.

Declaration

public IEncoding Encode(string inputA, string inputB = null, bool addSpecialTokens = true)

Parameters

Type	Name	Description
string	inputA	The main input to tokenize. Cannot be null.
string	inputB	A optional, secondary input to tokenize.
bool	addSpecialTokens	Tells whether special tokens must be added to the final IEncoding.

Returns

Type	Description
IEncoding	The tokenized value as an IEncoding instance.

Implements

ITokenizer