Class StripAccentsNormalizer
A text normalizer that removes Unicode combining mark characters from input strings. Combining marks include diacritical marks, accents, and other modifying characters that typically combine with base characters.
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.Normalizers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class StripAccentsNormalizer : INormalizer
Remarks
This normalizer is useful in tokenization pipelines where diacritical marks and accents need to be removed for text processing, such as standardizing text for comparison, simplifying text for machine learning models, or converting accented characters to their base forms.
Methods
Normalize(SubString)
Applies transformations to the input string before pre-tokenization.
Declaration
public SubString Normalize(SubString input)
Parameters
| Type | Name | Description |
|---|---|---|
| SubString | input | The string to transform. |
Returns
| Type | Description |
|---|---|
| SubString | The resulting string. |