docs.unity3d.com
Search Results for

    Show / Hide Table of Contents

    Class WhitespacePreTokenizer

    A pre-tokenizer that splits text into word tokens and non-word, non-whitespace tokens. This implementation matches the behavior of the regular expression pattern "\w+|[^\w\s]+".

    Inheritance
    object
    WhitespacePreTokenizer
    Implements
    IPreTokenizer
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers
    Assembly: Unity.InferenceEngine.Tokenization.dll
    Syntax
    public class WhitespacePreTokenizer : IPreTokenizer
    Remarks

    The tokenizer operates in two modes:

    • Word mode: Captures sequences of word characters (letters, digits, and underscores)
    • Symbol mode: Captures sequences of non-word, non-whitespace characters (punctuation, special characters)

    Whitespace characters are skipped and not included in the output tokens.

    Examples

    Input: "Hello, World! Test-123" Output: ["Hello", ",", "World", "!", "Test", "-", "123"]

    Methods

    PreTokenize(SubString, Output<SubString>)

    Pre-cuts the input into smaller parts.

    Declaration
    public void PreTokenize(SubString input, Output<SubString> output)
    Parameters
    Type Name Description
    SubString input

    The source to pre-cut.

    Output<SubString> output

    Target collection of generated pre-tokenized strings.

    Implements

    IPreTokenizer
    In This Article
    Back to top
    Copyright © 2026 Unity Technologies — Trademarks and terms of use
    • Legal
    • Privacy Policy
    • Cookie Policy
    • Do Not Sell or Share My Personal Information
    • Your Privacy Choices (Cookie Settings)