docs.unity3d.com
Search Results for

    Show / Hide Table of Contents

    Tokenization

    Use the built-in tokenizer to convert text into numerical tokens that can be used as input for models that process text.

    Optional

    The tokenizer is optional for Sentis. You can provide inputs from other sources if you prefer.

    The tokenizer is designed for compatibility with the Hugging Face tokenizers Python library. To configure it, use the tokenizer.json file available in most Hugging Face model repositories.

    Tokenization workflow

    A tokenizer processes text through several steps. Not all steps are required for every model:

    Normalization

    Transforms the input string, such as replacing characters or applying Unicode normalization. This step outputs a new string. See normalizers.

    Pre-tokenization

    Splits the normalized string into smaller parts for token conversion. See pre-tokenizers.

    Models (token-to-ID conversion)

    Maps each substring to a unique integer ID. See models.

    Truncation

    Enforces maximum input length by splitting or trimming token sequences. See truncation.

    Padding

    Adds tokens to ensure sequences have a fixed length when required by the model. See padding.

    Post Processors

    Adds special tokens, such as separators or markers, to prepare the sequence for the model. See post processors.

    Decoders

    Converts token IDs back into text after inference. Decoding is separate from the encoding steps and is only used when interpreting model outputs. See decoders.

    Creating a tokenizer

    At minimum, tokenization requires token-to-ID conversion. Most text-based models also require additional steps such as, normalization, pre-tokenization, or padding.

    The following sample implementation is included with the package and available in the Unity Package Manager.```

    Encode input

    After initialization, the tokenizer converts text inputs into sequences of IDs that you can pass to Sentis.

    Decode output

    For text-based models, use the same tokenizer to decode the generated IDs back into readable text.

    Sample code

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using Unity.InferenceEngine;
    using Unity.InferenceEngine.Tokenization;
    using Unity.InferenceEngine.Tokenization.Decoders;
    using Unity.InferenceEngine.Tokenization.Mappers;
    using Unity.InferenceEngine.Tokenization.Normalizers;
    using Unity.InferenceEngine.Tokenization.Padding;
    using Unity.InferenceEngine.Tokenization.PostProcessors;
    using Unity.InferenceEngine.Tokenization.PostProcessors.Templating;
    using Unity.InferenceEngine.Tokenization.PreTokenizers;
    using Unity.InferenceEngine.Tokenization.Truncators;
    using UnityEngine;
    
    class TokenizerSample : MonoBehaviour
    {
        static Tensor<int> Encode(ITokenizer tokenizer, string input)
        {
            // Generates the sequence
            var encoding = tokenizer.Encode(input);
    
            // Then you can use the encoding to generate your tensors.
    
            // Gets this ids
            // Other masks or available, like:
            // - attention
            // - type ids
            // - special mask.
            int[] ids = encoding.GetIds().ToArray();
    
            // Create a 3D tensor shape
            TensorShape shape = new TensorShape(1, 1, ids.Length);
    
            // Create a new tensor from the array
            return new Tensor<int>(shape, ids);
        }
    
        static string Decode(ITokenizer tokenizer, Tensor<int> tensor)
        {
            var ids = tensor.DownloadToArray();
            return tokenizer.Decode(ids);
        }
    
        static Dictionary<string, int> BuildVocabulary()
        {
            // This stub method returns a legitimate string to id mapping for the tokenizer.
            // It is usually built from a large configuration JSON file.
            return new Dictionary<string, int>();
        }
    
        static TokenConfiguration[] GetAddedTokens()
        {
            // This stub method returns a legitimate collection of token configuration.
            // Token configuration is the Hugging Face equivalent of added token.
            return Array.Empty<TokenConfiguration>();
        }
    
        /// This sample initializes a tokenizer based on All MiniLM L6 v2.
        public ITokenizer CreateTokenizer()
        {
            var vocabulary = BuildVocabulary();
            var addedTokens = GetAddedTokens();
    
            // Central step of the tokenizer
            var mapper = new WordPieceMapper(vocabulary, "[UNK]", "##", 100);
    
    
            // Preliminary steps of the tokenization:
            // - normalization (transforms the input string)
            // - pre-tokenization (splits the input string)
    
            var normalizer = new BertNormalizer(
                cleanText: true,
                handleCjkChars: true,
                stripAccents: null,
                lowerCase: true);
    
            var preTokenizer = new BertPreTokenizer();
    
    
            // Final steps of tokenization:
            // - truncation (splits the token sequences)
            // - post-processing (decorates the token sequences)
            // - padding (adds tokens to match a sequence size).
    
            var truncator = new LongestFirstTruncator(new RightDirectionRangeGenerator(), 128, 0);
    
            var clsId = addedTokens.Where(tc => tc.Value == "[CLS]").Select(tc => tc.Id).FirstOrDefault();
            var sepId = addedTokens.Where(tc => tc.Value == "[SEP]").Select(tc => tc.Id).FirstOrDefault();
            var padId = addedTokens.Where(tc => tc.Value == "[PAD]").Select(tc => tc.Id).FirstOrDefault();
    
            var postProcessor = new TemplatePostProcessor(
              new(Template.Parse("[CLS]:0 $A:0 [SEP]:0")),
              new(Template.Parse("[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1")),
              new (string, int)[] { ("[CLS]", clsId), ("[SEP]", sepId) });
    
            var padding = new RightPadding(
              new FixedPaddingSizeProvider(128),
              new Token(padId, "[PAD]"));
    
    
            // Decoding.
    
            var decoder = new WordPieceDecoder("##", true);
    
    
            // Creates the tokenizer from all the components
            // initialized above.
    
            return new Tokenizer(
                mapper,
                normalizer: normalizer,
                preTokenizer: preTokenizer,
                truncator: truncator,
                postProcessor: postProcessor,
                paddingProcessor: padding,
                decoder: decoder,
                addedVocabulary: addedTokens);
        }
    }
    
    In This Article
    Back to top
    Copyright © 2025 Unity Technologies — Trademarks and terms of use
    • Legal
    • Privacy Policy
    • Cookie Policy
    • Do Not Sell or Share My Personal Information
    • Your Privacy Choices (Cookie Settings)