docs.unity3d.com
Search Results for

    Show / Hide Table of Contents

    Class MetaspacePreTokenizer

    A pre-tokenizer that replaces spaces with a special character (metaspace) and optionally splits the input text at these metaspace boundaries. This is commonly used in SentencePiece-based tokenizers.

    Inheritance
    object
    MetaspacePreTokenizer
    Implements
    IPreTokenizer
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers
    Assembly: Unity.InferenceEngine.Tokenization.dll
    Syntax
    public class MetaspacePreTokenizer : IPreTokenizer
    Remarks

    The metaspace character (default: U+2581 '▁') is used to preserve information about whitespace in the original text while treating it as a regular character during tokenization. This allows the tokenizer to distinguish between "hello world" and "helloworld".

    Constructors

    MetaspacePreTokenizer(char, PrependScheme, bool)

    Initializes a new instance of the MetaspacePreTokenizer class.

    Declaration
    public MetaspacePreTokenizer(char replacement = '▁', PrependScheme prependScheme = PrependScheme.Always, bool split = true)
    Parameters
    Type Name Description
    char replacement

    The character to use as a replacement for spaces. Default is U+2581 ('▁'), the lower one eighth block Unicode character commonly used in SentencePiece.

    PrependScheme prependScheme

    The scheme for prepending the replacement character to the input. Default is Always.

    bool split

    If true, splits the input text at metaspace character boundaries. If false, returns the entire processed text as a single token. Default is true.

    Methods

    PreTokenize(SubString, Output<SubString>)

    Pre-cuts the input into smaller parts.

    Declaration
    public void PreTokenize(SubString input, Output<SubString> output)
    Parameters
    Type Name Description
    SubString input

    The source to pre-cut.

    Output<SubString> output

    Target collection of generated pre-tokenized strings.

    Implements

    IPreTokenizer
    In This Article
    Back to top
    Copyright © 2026 Unity Technologies — Trademarks and terms of use
    • Legal
    • Privacy Policy
    • Cookie Policy
    • Do Not Sell or Share My Personal Information
    • Your Privacy Choices (Cookie Settings)