Class StripAccentsNormalizer

A text normalizer that removes Unicode combining mark characters from input strings. Combining marks include diacritical marks, accents, and other modifying characters that typically combine with base characters.

Inheritance

object

StripAccentsNormalizer

Implements

INormalizer

Inherited Members

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Unity.InferenceEngine.Tokenization.Normalizers

Assembly: Unity.InferenceEngine.Tokenization.dll

Syntax

public class StripAccentsNormalizer : INormalizer

Remarks

This normalizer is useful in tokenization pipelines where diacritical marks and accents need to be removed for text processing, such as standardizing text for comparison, simplifying text for machine learning models, or converting accented characters to their base forms.

Methods

Normalize(SubString)

Applies transformations to the input string before pre-tokenization.

Declaration

public SubString Normalize(SubString input)

Parameters

Type	Name	Description
SubString	input	The string to transform.

Returns

Type	Description
SubString	The resulting string.

Implements

INormalizer