Class WhitespacePreTokenizer

A pre-tokenizer that splits text into word tokens and non-word, non-whitespace tokens. This implementation matches the behavior of the regular expression pattern "\w+|[^\w\s]+".

Inheritance

object

WhitespacePreTokenizer

Implements

IPreTokenizer

Inherited Members

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Unity.InferenceEngine.Tokenization.PreTokenizers

Assembly: Unity.InferenceEngine.Tokenization.dll

Syntax

public class WhitespacePreTokenizer : IPreTokenizer

Remarks

The tokenizer operates in two modes:

Word mode: Captures sequences of word characters (letters, digits, and underscores)
Symbol mode: Captures sequences of non-word, non-whitespace characters (punctuation, special characters)

Whitespace characters are skipped and not included in the output tokens.

Examples

Input: "Hello, World! Test-123" Output: ["Hello", ",", "World", "!", "Test", "-", "123"]

Methods

PreTokenize(SubString, Output<SubString>)

Pre-cuts the input into smaller parts.

Declaration

public void PreTokenize(SubString input, Output<SubString> output)

Parameters

Type	Name	Description
SubString	input	The source to pre-cut.
Output<SubString>	output	Target collection of generated pre-tokenized strings.

Implements

IPreTokenizer