Optimization Guidelines
Loop Vectorization
Loop vectorization is one of the ways that Burst improves performance. Let's say you have code like this:
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
for (var i = 0; i < count; i++)
{
a[i] += b[i];
}
}
public static unsafe void Foo(int count)
{
var a = stackalloc int[count];
var b = stackalloc int[count];
Bar(a, b, count);
}
The compiler is able to convert that scalar loop in Bar
into a vectorized loop. Instead of looping over a single value at a time,
the compiler generates code that loops over multiple values at the same time, producing faster code essentially for free. Here is the
x64
assembly generated for AVX2
for the loop in Bar
above:
.LBB1_4:
vmovdqu ymm0, ymmword ptr [rdx + 4*rax]
vmovdqu ymm1, ymmword ptr [rdx + 4*rax + 32]
vmovdqu ymm2, ymmword ptr [rdx + 4*rax + 64]
vmovdqu ymm3, ymmword ptr [rdx + 4*rax + 96]
vpaddd ymm0, ymm0, ymmword ptr [rcx + 4*rax]
vpaddd ymm1, ymm1, ymmword ptr [rcx + 4*rax + 32]
vpaddd ymm2, ymm2, ymmword ptr [rcx + 4*rax + 64]
vpaddd ymm3, ymm3, ymmword ptr [rcx + 4*rax + 96]
vmovdqu ymmword ptr [rcx + 4*rax], ymm0
vmovdqu ymmword ptr [rcx + 4*rax + 32], ymm1
vmovdqu ymmword ptr [rcx + 4*rax + 64], ymm2
vmovdqu ymmword ptr [rcx + 4*rax + 96], ymm3
add rax, 32
cmp r8, rax
jne .LBB1_4
As can be seen above, the loop has been unrolled and vectorized so that it is 4 vpaddd
instructions, each calculating 8 integer additions,
for a total of 32 integer additions per loop iteration.
This is great! However, loop vectorization is notoriously brittle. As an example, let's introduce a seemingly innocuous branch like this:
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
for (var i = 0; i < count; i++)
{
if (a[i] > b[i])
{
break;
}
a[i] += b[i];
}
}
Now the assembly changes to this:
.LBB1_3:
mov r9d, dword ptr [rcx + 4*r10]
mov eax, dword ptr [rdx + 4*r10]
cmp r9d, eax
jg .LBB1_4
add eax, r9d
mov dword ptr [rcx + 4*r10], eax
inc r10
cmp r8, r10
jne .LBB1_3
This loop is completely scalar and only has 1 integer addition per loop iteration. This is not good! In this simple case, an experienced developer would probably spot that adding the branch will break auto-vectorization. But in more complex real-life code it can be difficult to spot.
To help with this problem, Burst includes, at present, experimental intrinsics (Loop.ExpectVectorized()
and Loop.ExpectNotVectorized()
) to express loop vectorization
assumptions, and have them validated at compile-time. For example, we can change the original Bar
implementation to:
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
for (var i = 0; i < count; i++)
{
Unity.Burst.CompilerServices.Loop.ExpectVectorized();
a[i] += b[i];
}
}
Burst will now validate, at compile-time, that the loop has indeed been vectorized. If the loop is not vectorized, Burst will emit a compiler error. For example, if we do this:
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Bar([NoAlias] int* a, [NoAlias] int* b, int count)
{
for (var i = 0; i < count; i++)
{
Unity.Burst.CompilerServices.Loop.ExpectVectorized();
if (a[i] > b[i])
{
break;
}
a[i] += b[i];
}
}
then Burst will emit the following error at compile-time:
LoopIntrinsics.cs(6,9): Burst error BC1321: The loop is not vectorized where it was expected that it is vectorized.
As these intrinsics are experimental, they need to be enabled with the UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS
preprocessor define.
Note that these loop intrinsics should not be used inside
if
statements. Burst does not currently prevent this from happening, but in a future release this will be a compile-time error.
Compiler Options
When compiling a job, you can change the behavior of the compiler:
- Using a different accuracy for the math functions (sin, cos...)
- Allowing the compiler to re-arrange the floating point calculations by relaxing the order of the math computations.
- Forcing a synchronous compilation of the Job (only for the Editor/JIT case)
- Using internal compiler options (not yet detailed)
These flags can be set through the [BurstCompile]
attribute, for example [BurstCompile(FloatPrecision.Med, FloatMode.Fast)]
FloatPrecision
The accuracy is defined by the following enumeration:
public enum FloatPrecision
{
/// <summary>
/// Use the default target floating point precision - <see cref="FloatPrecision.Medium"/>.
/// </summary>
Standard = 0,
/// <summary>
/// Compute with an accuracy of 1 ULP - highly accurate, but increased runtime as a result, should not be required for most purposes.
/// </summary>
High = 1,
/// <summary>
/// Compute with an accuracy of 3.5 ULP - considered acceptable accuracy for most tasks.
/// </summary>
Medium = 2,
/// <summary>
/// Compute with an accuracy lower than or equal to <see cref="FloatPrecision.Medium"/>, with some range restrictions (defined per function).
/// </summary>
Low = 3,
}
Currently, the implementation is only providing the following accuracy:
FloatPrecision.Standard
is equivalent toFloatPrecision.Medium
providing an accuracy of 3.5 ULP. This is the default value.FloatPrecision.High
provides an accuracy of 1.0 ULP.FloatPrecision.Medium
provides an accuracy of 3.5 ULP.FloatPrecision.Low
has an accuracy defined per function, and functions may specify a restricted range of valid inputs.
Using the FloatPrecision.Standard
accuracy should be largely enough for most games.
An ULP (unit in the last place or unit of least precision) is the spacing between floating-point numbers, i.e., the value the least significant digit represents if it is 1.
Note: The FloatPrecision
Enum was named Accuracy
in early versions of the Burst API.
FloatPrecision.Low
The following table describes the precision and range restrictions for using the FloatPrecision.Low
mode. Any function not described in the table will inherit the ULP requirement from FloatPrecision.Medium
.
Function | Precision | Range |
---|---|---|
Unity.Mathematics.math.sin(x) | 350.0 ULP | |
Unity.Mathematics.math.cos(x) | 350.0 ULP | |
Unity.Mathematics.math.exp(x) | 350.0 ULP | |
Unity.Mathematics.math.exp2(x) | 350.0 ULP | |
Unity.Mathematics.math.exp10(x) | 350.0 ULP | |
Unity.Mathematics.math.log(x) | 350.0 ULP | |
Unity.Mathematics.math.log2(x) | 350.0 ULP | |
Unity.Mathematics.math.log10(x) | 350.0 ULP | |
Unity.Mathematics.math.pow(x, y) | 350.0 ULP | Negative x to the power of a fractional y are not supported. |
Compiler floating point math mode
The compiler floating point math mode is defined by the following enumeration:
/// <summary>
/// Represents the floating point optimization mode for compilation.
/// </summary>
public enum FloatMode
{
/// <summary>
/// Use the default target floating point mode - <see cref="FloatMode.Strict"/>.
/// </summary>
Default = 0,
/// <summary>
/// No floating point optimizations are performed.
/// </summary>
Strict = 1,
/// <summary>
/// Reserved for future.
/// </summary>
Deterministic = 2,
/// <summary>
/// Allows algebraically equivalent optimizations (which can alter the results of calculations), it implies :
/// <para/> optimizations can assume results and arguments contain no NaNs or +/- Infinity and treat sign of zero as insignificant.
/// <para/> optimizations can use reciprocals - 1/x * y , instead of y/x.
/// <para/> optimizations can use fused instructions, e.g. madd.
/// </summary>
Fast = 3,
}
FloatMode.Default
is defaulting toFloatMode.Strict
FloatMode.Strict
: The compiler is not performing any re-arrangement of the calculation and will be careful at respecting special floating point values (denormals, NaN...). This is the default value.FloatMode.Fast
: The compiler can perform instruction re-arrangement and/or using dedicated/less precise hardware SIMD instructions.FloatMode.Deterministic
: Reserved for future, when Burst will provide support for deterministic mode
Typically, some hardware can support Multiply and Add (e.g mad a * b + c
) into a single instruction. These optimizations can be allowed by using the Fast calculation.
The reordering of these instructions can lead to a lower accuracy.
The FloatMode.Fast
compiler floating point math mode can be used for many scenarios where the exact order of the calculation and the uniform handling of NaN values are not strictly required.
Assume Intrinsics
Being able to tell the compiler that an integer lies within a certain range can open up optimization opportunities. The AssumeRange
attribute allows users to tell the compiler that a given scalar-integer lies within a certain constrained range:
[return:AssumeRange(0u, 13u)]
static uint WithConstrainedRange([AssumeRange(0, 26)] int x)
{
return (uint)x / 2u;
}
The above code makes two promises to the compiler:
- That the variable
x
is in the closed-interval range[0..26]
, or more plainly thatx >= 0 && x <= 26
. - That the return value from
WithConstrainedRange
is in the closed-interval range[0..13]
, or more plainly thatx >= 0 && x <= 13
.
These assumptions are fed into the optimizer and allow for better codegen as a result. There are some restrictions:
- You can only place these on scalar-integer (signed or unsigned) types.
- The type of the range arguments must match the type being attributed.
We've also added in some deductions for the .Length
property of NativeArray
and NativeSlice
to tell the optimizer that these always return non-negative integers.
static bool IsLengthNegative(NativeArray<float> na)
{
// The compiler will always replace this with the constant false!
return na.Length < 0;
}
Let's assume you have your own container:
struct MyContainer
{
public int Length;
// Some other data...
}
And you wanted to tell Burst that Length
was always a positive integer. You would do that like so:
struct MyContainer
{
private int _length;
[return: AssumeRange(0, int.MaxValue)]
private int LengthGetter()
{
return _length;
}
public int Length
{
get => LengthGetter();
set => _length = value;
}
// Some other data...
}
Unity.Mathematics
The Unity.Mathematics
provides vector types (float4
, float3
...) that are directly mapped to hardware SIMD registers.
Also, many functions from the math
type are also mapped directly to hardware SIMD instructions.
Note that currently, for an optimal usage of this library, it is recommended to use SIMD 4 wide types (
float4
,int4
,bool4
...)
Generic Jobs
As described in AOT vs JIT, there are currently two modes Burst will compile a Job:
- When in the Editor, it will compile the Job when it is scheduled (sometimes called JIT mode).
- When building a Standalone Player, it will compile the Job as part of the build player (AOT mode).
If the Job is a concrete type (not using generics), the Job will be compiled correctly in both modes.
In case of a generic Job, it can behave more unexpectedly.
While Burst supports generics, it has limited support for using generic Jobs or Function pointers. You could notice that a job scheduled at Editor time is running at full speed with Burst but not when used in a Standalone player. It is usually a problem related to generic Jobs.
A generic Job can be defined like this:
// Direct Generic Job
[BurstCompile]
struct MyGenericJob<TData> : IJob where TData : struct {
public void Execute() { ... }
}
or can be nested:
// Nested Generic Job
public class MyGenericSystem<TData> where TData : struct {
[BurstCompile]
struct MyGenericJob : IJob {
public void Execute() { ... }
}
public void Run()
{
var myJob = new MyGenericJob(); // implicitly MyGenericSystem<TData>.MyGenericJob
myJob.Schedule();
}
}
When the previous Jobs are being used like:
// Direct Generic Job
var myJob = new MyGenericJob<int>();
myJob.Schedule();
// Nested Generic Job
var myJobSystem = new MyGenericSystem<float>();
myJobSystem.Run();
In both cases in a standalone-player build, the Burst compiler will be able to detect that it has to compile MyGenericJob<int>
and MyGenericJob<float>
because the generic jobs (or the type surrounding it for the nested job) are used with fully resolved generic arguments (int
and float
).
But if these jobs are used indirectly through a generic parameter, the Burst compiler won't be able to detect the Jobs it has to compile at standalone-player build time:
public static void GenericJobSchedule<TData>() where TData: struct {
// Generic argument: Generic Parameter TData
// This Job won't be detected by the Burst Compiler at standalone-player build time.
var job = new MyGenericJob<TData>();
job.Schedule();
}
// The implicit MyGenericJob<int> will run at Editor time in full Burst speed
// but won't be detected at standalone-player build time.
GenericJobSchedule<int>();
Same restriction applies when declaring the Job in the context of generic parameter coming from a type:
// Generic Parameter TData
public class SuperJobSystem<TData>
{
// Generic argument: Generic Parameter TData
// This Job won't be detected by the Burst Compiler at standalone-player build time.
public MyGenericJob<TData> MyJob;
}
In summary, if you are using generic jobs, they need to be used directly with fully-resolved generic arguments (e.g
int
,MyOtherStruct
), but can't be used with a generic parameter indirection (e.gMyGenericJob<TContext>
).
Regarding function pointers, they are more restricted as you can't use a generic delegate through a function pointer with Burst:
public delegate void MyGenericDelegate<T>(ref TData data) where TData: struct;
var myGenericDelegate = new MyGenericDelegate<int>(MyIntDelegateImpl);
// Will fail to compile this function pointer.
var myGenericFunctionPointer = BurstCompiler.CompileFunctionPointer<MyGenericDelegate<int>>(myGenericDelegate);
This limitation is due to a limitation of the .NET runtime to interop with such delegates.