You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
278 lines
8.4 KiB
278 lines
8.4 KiB
7 months ago
|
Workload descriptor format
|
||
|
==========================
|
||
|
|
||
|
ctx.engine.duration_us.dependency.wait,...
|
||
|
<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,...
|
||
|
B.<uint>
|
||
|
M.<uint>.<str>[|<str>]...
|
||
|
P|S|X.<uint>.<int>
|
||
|
d|p|s|t|q|a|T.<int>,...
|
||
|
b.<uint>.<str>[|<str>].<str>
|
||
|
f
|
||
|
|
||
|
For duration a range can be given from which a random value will be picked
|
||
|
before every submit. Since this and seqno management requires CPU access to
|
||
|
objects, care needs to be taken in order to ensure the submit queue is deep
|
||
|
enough these operations do not affect the execution speed unless that is
|
||
|
desired.
|
||
|
|
||
|
Additional workload steps are also supported:
|
||
|
|
||
|
'd' - Adds a delay (in microseconds).
|
||
|
'p' - Adds a delay relative to the start of previous loop so that the each loop
|
||
|
starts execution with a given period.
|
||
|
's' - Synchronises the pipeline to a batch relative to the step.
|
||
|
't' - Throttle every n batches.
|
||
|
'q' - Throttle to n max queue depth.
|
||
|
'f' - Create a sync fence.
|
||
|
'a' - Advance the previously created sync fence.
|
||
|
'B' - Turn on context load balancing.
|
||
|
'b' - Set up engine bonds.
|
||
|
'M' - Set up engine map.
|
||
|
'P' - Context priority.
|
||
|
'S' - Context SSEU configuration.
|
||
|
'T' - Terminate an infinite batch.
|
||
|
'X' - Context preemption control.
|
||
|
|
||
|
Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS
|
||
|
|
||
|
Example (leading spaces must not be present in the actual file):
|
||
|
----------------------------------------------------------------
|
||
|
|
||
|
1.VCS1.3000.0.1
|
||
|
1.RCS.500-1000.-1.0
|
||
|
1.RCS.3700.0.0
|
||
|
1.RCS.1000.-2.0
|
||
|
1.VCS2.2300.-2.0
|
||
|
1.RCS.4700.-1.0
|
||
|
1.VCS2.600.-1.1
|
||
|
p.16000
|
||
|
|
||
|
The above workload described in human language works like this:
|
||
|
|
||
|
1. A batch is sent to the VCS1 engine which will be executing for 3ms on the
|
||
|
GPU and userspace will wait until it is finished before proceeding.
|
||
|
2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
|
||
|
duration range), 3.7ms and 1ms respectively. The first batch has a data
|
||
|
dependency on the preceding VCS1 batch, and the last of the group depends
|
||
|
on the first from the group.
|
||
|
5. Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
|
||
|
RCS batch.
|
||
|
6. This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
|
||
|
VCS2 batch.
|
||
|
7. Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
|
||
|
same step the tool is told to wait for the batch completes before
|
||
|
proceeding.
|
||
|
8. Finally the tool is told to wait long enough to ensure the next iteration
|
||
|
starts 16ms after the previous one has started.
|
||
|
|
||
|
When workload descriptors are provided on the command line, commas must be used
|
||
|
instead of new lines.
|
||
|
|
||
|
Multiple dependencies can be given separated by forward slashes.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
1.VCS1.3000.0.1
|
||
|
1.RCS.3700.0.0
|
||
|
1.VCS2.2300.-1/-2.0
|
||
|
|
||
|
I this case the last step has a data dependency on both first and second steps.
|
||
|
|
||
|
Batch durations can also be specified as infinite by using the '*' in the
|
||
|
duration field. Such batches must be ended by the terminate command ('T')
|
||
|
otherwise they will cause a GPU hang to be reported.
|
||
|
|
||
|
Sync (fd) fences
|
||
|
----------------
|
||
|
|
||
|
Sync fences are also supported as dependencies.
|
||
|
|
||
|
To use them put a "f<N>" token in the step dependecy list. N is this case the
|
||
|
same relative step offset to the dependee batch, but instead of the data
|
||
|
dependency an output fence will be emitted at the dependee step, and passed in
|
||
|
as a dependency in the current step.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
1.VCS1.3000.0.0
|
||
|
1.RCS.500-1000.-1/f-1.0
|
||
|
|
||
|
In this case the second step will have both a data dependency and a sync fence
|
||
|
dependency on the previous step.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
1.RCS.500-1000.0.0
|
||
|
1.VCS1.3000.f-1.0
|
||
|
1.VCS2.3000.f-2.0
|
||
|
|
||
|
VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
1.RCS.500-1000.0.0
|
||
|
f
|
||
|
2.VCS1.3000.f-1.0
|
||
|
2.VCS2.3000.f-2.0
|
||
|
1.RCS.500-1000.0.1
|
||
|
a.-4
|
||
|
s.-4
|
||
|
s.-4
|
||
|
|
||
|
VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence
|
||
|
created at the second step. They are submitted ahead of time while still not
|
||
|
runnable. When the second RCS batch completes the standalone fence is signaled
|
||
|
which allows the two VCS batches to be executed. Finally we wait until the both
|
||
|
VCS batches have completed before starting the (optional) next iteration.
|
||
|
|
||
|
Submit fences
|
||
|
-------------
|
||
|
|
||
|
Submit fences are a type of input fence which are signalled when the originating
|
||
|
batch buffer is submitted to the GPU. (In contrary to normal sync fences, which
|
||
|
are signalled when completed.)
|
||
|
|
||
|
Submit fences have the identical syntax as the sync fences with the lower-case
|
||
|
's' being used to select them. Eg:
|
||
|
|
||
|
1.RCS.500-1000.0.0
|
||
|
1.VCS1.3000.s-1.0
|
||
|
1.VCS2.3000.s-2.0
|
||
|
|
||
|
Here VCS1 and VCS2 batches will only be submitted for executing once the RCS
|
||
|
batch enters the GPU.
|
||
|
|
||
|
Context priority
|
||
|
----------------
|
||
|
|
||
|
P.1.-1
|
||
|
1.RCS.1000.0.0
|
||
|
P.2.1
|
||
|
2.BCS.1000.-2.0
|
||
|
|
||
|
Context 1 is marked as low priority (-1) and then a batch buffer is submitted
|
||
|
against it. Context 2 is marked as high priority (1) and then a batch buffer
|
||
|
is submitted against it which depends on the batch from context 1.
|
||
|
|
||
|
Context priority command is executed at workload runtime and is valid until
|
||
|
overriden by another (optional) same context priority change. Actual driver
|
||
|
ioctls are executed only if the priority level has changed for the context.
|
||
|
|
||
|
Context preemption control
|
||
|
--------------------------
|
||
|
|
||
|
X.1.0
|
||
|
1.RCS.1000.0.0
|
||
|
X.1.500
|
||
|
1.RCS.1000.0.0
|
||
|
|
||
|
Context 1 is marked as non-preemptable batches and a batch is sent against 1.
|
||
|
The same context is then marked to have batches which can be preempted every
|
||
|
500us and another batch is submitted.
|
||
|
|
||
|
Same as with context priority, context preemption commands are valid until
|
||
|
optionally overriden by another preemption control change on the same context.
|
||
|
|
||
|
Engine maps
|
||
|
-----------
|
||
|
|
||
|
Engine maps are a per context feature which changes the way engine selection is
|
||
|
done in the driver.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
M.1.VCS1|VCS2
|
||
|
|
||
|
This sets up context 1 with an engine map containing VCS1 and VCS2 engine.
|
||
|
Submission to this context can now only reference these two engines.
|
||
|
|
||
|
Engine maps can also be defined based on class like VCS.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
M.1.VCS
|
||
|
|
||
|
This sets up the engine map to all available VCS class engines.
|
||
|
|
||
|
Context load balancing
|
||
|
----------------------
|
||
|
|
||
|
Context load balancing (aka Virtual Engine) is an i915 feature where the driver
|
||
|
will pick the best engine (most idle) to submit to given previously configured
|
||
|
engine map.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
B.1
|
||
|
|
||
|
This enables load balancing for context number one.
|
||
|
|
||
|
Engine bonds
|
||
|
------------
|
||
|
|
||
|
Engine bonds are extensions on load balanced contexts. They allow expressing
|
||
|
rules of engine selection between two co-operating contexts tied with submit
|
||
|
fences. In other words, the rule expression is telling the driver: "If you pick
|
||
|
this engine for context one, then you have to pick that engine for context two".
|
||
|
|
||
|
Syntax is:
|
||
|
b.<context>.<engine_list>.<master_engine>
|
||
|
|
||
|
Engine list is a list of one or more sibling engines separated by a pipe
|
||
|
character (eg. "VCS1|VCS2").
|
||
|
|
||
|
There can be multiple bonds tied to the same context.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
M.1.RCS|VECS
|
||
|
B.1
|
||
|
M.2.VCS1|VCS2
|
||
|
B.2
|
||
|
b.2.VCS1.RCS
|
||
|
b.2.VCS2.VECS
|
||
|
|
||
|
This tells the driver that if it picked RCS for context one, it has to pick VCS1
|
||
|
for context two. And if it picked VECS for context one, it has to pick VCS1 for
|
||
|
context two.
|
||
|
|
||
|
If we extend the above example with more workload directives:
|
||
|
|
||
|
1.DEFAULT.1000.0.0
|
||
|
2.DEFAULT.1000.s-1.0
|
||
|
|
||
|
We get to a fully functional example where two batch buffers are submitted in a
|
||
|
load balanced fashion, telling the driver they should run simultaneously and
|
||
|
that valid engine pairs are either RCS + VCS1 (for two contexts respectively),
|
||
|
or VECS + VCS2.
|
||
|
|
||
|
This can also be extended using sync fences to improve chances of the first
|
||
|
submission not getting on the hardware after the second one. Second block would
|
||
|
then look like:
|
||
|
|
||
|
f
|
||
|
1.DEFAULT.1000.f-1.0
|
||
|
2.DEFAULT.1000.s-1.0
|
||
|
a.-3
|
||
|
|
||
|
Context SSEU configuration
|
||
|
--------------------------
|
||
|
|
||
|
S.1.1
|
||
|
1.RCS.1000.0.0
|
||
|
S.2.-1
|
||
|
2.RCS.1000.0.0
|
||
|
|
||
|
Context 1 is configured to run with one enabled slice (slice mask 1) and a batch
|
||
|
is sumitted against it. Context 2 is configured to run with all slices (this is
|
||
|
the default so the command could also be omitted) and a batch submitted against
|
||
|
it.
|
||
|
|
||
|
This shows the dynamic SSEU reconfiguration cost beween two contexts competing
|
||
|
for the render engine.
|
||
|
|
||
|
Slice mask of -1 has a special meaning of "all slices". Otherwise any integer
|
||
|
can be specifying as the slice mask, but beware any apart from 1 and -1 can make
|
||
|
the workload not portable between different GPUs.
|