Struct BpeMapperOptions
Configuration settings for the Byte Pair Encoding (BPE) mapper used in tokenization.
Inherited Members
Namespace: Unity.InferenceEngine.Tokenization.Mappers
Assembly: Unity.InferenceEngine.Tokenization.dll
Syntax
public struct BpeMapperOptions
Fields
ByteFallback
Gets or sets a value indicating whether to fall back to byte-level encoding when encountering characters that cannot be tokenized normally.
Declaration
public bool? ByteFallback
Field Value
| Type | Description |
|---|---|
| bool? |
|
DropOut
Gets or sets the dropout rate applied during BPE merge operations. When specified, randomly skips merges during training to improve robustness.
Declaration
public float? DropOut
Field Value
| Type | Description |
|---|---|
| float? | A float value between 0.0 and 1.0 representing the dropout probability,
or |
FuseUnknown
Gets or sets a value indicating whether to fuse consecutive unknown tokens into a single unknown token.
Declaration
public bool? FuseUnknown
Field Value
| Type | Description |
|---|---|
| bool? |
|
IgnoreMerges
Whether or not to direct output words if they are part of the vocab. Not yet implemented.
Declaration
public bool? IgnoreMerges
Field Value
| Type | Description |
|---|---|
| bool? |
SubWordPrefix
Gets or sets the prefix string added to subword tokens to distinguish them from complete words.
Declaration
public string SubWordPrefix
Field Value
| Type | Description |
|---|---|
| string | A string prefix (commonly "##" or "@@") added to subword tokens,
or |
UnknownToken
Gets or sets the token string used to represent unknown or out-of-vocabulary words.
Declaration
public string UnknownToken
Field Value
| Type | Description |
|---|---|
| string | A string representing the unknown token (commonly "<unk>" or "[UNK]"),
or |
WordSuffix
Gets or sets the suffix string added to word tokens to mark word boundaries.
Declaration
public string WordSuffix
Field Value
| Type | Description |
|---|---|
| string | A string suffix (commonly "@@" or specific boundary markers) added to word tokens,
or |