Class UnigramMapper
Implements a unigram-based tokenization mapper that converts text into tokens using a vocabulary-based approach. This mapper supports byte-level fallback for handling out-of-vocabulary characters.
Implements
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.Mappers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public class UnigramMapper : IMapper
Constructors
UnigramMapper(IReadOnlyList<UnigramVocabEntry>, int, bool)
Initializes a new instance of the UnigramMapper with the specified vocabulary and unknown token configuration.
Declaration
public UnigramMapper(IReadOnlyList<UnigramVocabEntry> vocab, int unkId = -1, bool byteFallback = false)
Parameters
| Type | Name | Description |
|---|---|---|
| IReadOnlyList<UnigramVocabEntry> | vocab | The vocabulary containing token entries with their scores. |
| int | unkId | The ID of the unknown token in the vocabulary. |
| bool | byteFallback | Whether to enable byte-level fallback for out-of-vocabulary characters. |
Exceptions
| Type | Condition |
|---|---|
| ArgumentNullException | Thrown when |
| ArgumentOutOfRangeException | Thrown when |
Methods
IdToToken(int)
Gets the token value from the specified id.
Declaration
public string IdToToken(int id)
Parameters
| Type | Name | Description |
|---|---|---|
| int | id | The ID of the requested token. |
Returns
| Type | Description |
|---|---|
| string | The token value. |
Exceptions
| Type | Condition |
|---|---|
| ArgumentOutOfRangeException | Thrown when |
TokenToId(string, out int)
Gets the ID of the specified token
Declaration
public bool TokenToId(string token, out int id)
Parameters
| Type | Name | Description |
|---|---|---|
| string | token | The token we want to get the ID of. |
| int | id | The ID of the specified |
Returns
| Type | Description |
|---|---|
| bool | Whether the token exists. |
Exceptions
| Type | Condition |
|---|---|
| ArgumentNullException | Thrown when |
Tokenize(IReadOnlyList<SubString>, Output<Token>)
Tokenizes a list of string values.
Declaration
public void Tokenize(IReadOnlyList<SubString> input, Output<Token> output)
Parameters
| Type | Name | Description |
|---|---|---|
| IReadOnlyList<SubString> | input | The list of string values to tokenize. |
| Output<Token> | output | The recipient of the converted tokens. |