docs.unity3d.com
Search Results for

    Show / Hide Table of Contents

    Class Tokenizer

    This type is the entry point of the tokenization/detokenization pipeline. The pipeline is composed of six steps, and turns an input string into an IEncoding chain:

    1. Pre-tokenization Splits the result of the normalization step into small pieces (example: split by whitespace).
    2. Encoding Central step of the tokenization, this one turns each piece from the pre-tokenization process into sequence of int ids. See IMapper for more details.
    3. Truncation Splits the sequence of ids from the encoding step into smaller subsequences. The most frequent truncation rule in "max length". See ITruncator for more details.
    4. Postprocessing Transforms each subsequences of generated from the truncation. The most common transformation is adding [CLS] and [SEP] tokens before and after the sequence. See IPostProcessor for more details.
    5. Padding Pads each subsequence from the postprocessing to match the expected sequence size.
    Inheritance
    object
    Tokenizer
    Implements
    ITokenizer
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Unity.InferenceEngine.Tokenization
    Assembly: Unity.InferenceEngine.Tokenization.dll
    Syntax
    public class Tokenizer : ITokenizer

    Constructors

    Tokenizer(IMapper, INormalizer, IPreTokenizer, IPostProcessor, ITruncator, IPadding, IDecoder, IEnumerable<TokenConfiguration>)

    Initializes a new instance of the Tokenizer type.

    Declaration
    public Tokenizer(IMapper mapper, INormalizer normalizer = null, IPreTokenizer preTokenizer = null, IPostProcessor postProcessor = null, ITruncator truncator = null, IPadding paddingProcessor = null, IDecoder decoder = null, IEnumerable<TokenConfiguration> addedVocabulary = null)
    Parameters
    Type Name Description
    IMapper mapper

    The IMapper encoding to use to turn the strings into tokens.

    INormalizer normalizer

    Normalizes portions of the input.

    IPreTokenizer preTokenizer

    The pre-tokenization rules.

    IPostProcessor postProcessor

    The post-processing of the token sequence. See IPostProcessor.

    ITruncator truncator

    The truncation rules. See ITruncator.

    IPadding paddingProcessor

    The padding rules.

    IDecoder decoder

    Modifiers applied to the decoded token sequence.

    IEnumerable<TokenConfiguration> addedVocabulary

    Special token configurations.

    Exceptions
    Type Condition
    ArgumentNullException

    mapper cannot be null.

    Methods

    Decode(IReadOnlyList<int>, bool)

    Turns a sequence of token ids into a string.

    Declaration
    public string Decode(IReadOnlyList<int> input, bool skipSpecialTokens = false)
    Parameters
    Type Name Description
    IReadOnlyList<int> input

    The sequence of token ids.

    bool skipSpecialTokens

    Do not decode the special tokens.

    Returns
    Type Description
    string

    The decoded string.

    Encode(string, string, bool)

    Turns inputA, optionally inputB into an IEncoding instance.

    Declaration
    public IEncoding Encode(string inputA, string inputB = null, bool addSpecialTokens = true)
    Parameters
    Type Name Description
    string inputA

    The main input to tokenize. Cannot be null.

    string inputB

    A optional, secondary input to tokenize.

    bool addSpecialTokens

    Tells whether special tokens must be added to the final IEncoding.

    Returns
    Type Description
    IEncoding

    The tokenized value as an IEncoding instance.

    Implements

    ITokenizer
    In This Article
    Back to top
    Copyright © 2025 Unity Technologies — Trademarks and terms of use
    • Legal
    • Privacy Policy
    • Cookie Policy
    • Do Not Sell or Share My Personal Information
    • Your Privacy Choices (Cookie Settings)