You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
306 lines
13 KiB
306 lines
13 KiB
VC4
|
|
===
|
|
|
|
Mesa's ``vc4`` graphics driver supports multiple implementations of
|
|
Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
|
|
through Raspberry Pi 3 hardware, and the driver is included as an
|
|
option as of the 2016-02-09 Rasbpian release using ``raspi-config``.
|
|
On most other distributions such as Debian or Fedora, you need no
|
|
configuration to enable the driver.
|
|
|
|
This Mesa driver talks directly to the `vc4
|
|
<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
|
|
driver for scheduling graphics commands, and that module also provides
|
|
KMS display support. The driver makes no use of the closed source VPU
|
|
firmware on the VideoCore IV block, instead talking directly to the
|
|
GPU block from Linux.
|
|
|
|
GLES2 support
|
|
-------------
|
|
|
|
The vc4 driver is a nearly conformant GLES2 driver, and the hardware
|
|
has achieved GLES2 conformance with other driver stacks.
|
|
|
|
OpenGL support
|
|
--------------
|
|
|
|
Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
|
|
mostly correct but with a few caveats.
|
|
|
|
* 4-byte index buffers.
|
|
|
|
GLES2.0, and vc4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
|
|
them in vc4, we create a shadow copy of your index buffer with the
|
|
indices truncated to 2 bytes. This is incorrect (and will assertion
|
|
fail in debug builds of Mesa) if any of the indices were >65535. To
|
|
fix that, we would need to detect this case and rewrite the index
|
|
buffer and vertex buffers to do a series of draws each with small
|
|
indices and new vertex attrib bindings.
|
|
|
|
To avoid this problem, ensure that all index buffers are written using
|
|
``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
|
|
with updated vertex attrib bindings.
|
|
|
|
* Occlusion queries
|
|
|
|
The VC4 hardware has no support for occlusion queries. GL 2.0
|
|
requires that you support the occlusion queries extension, but you can
|
|
report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
|
|
GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
|
|
"we want the functions to be present everywhere, but we want it to be
|
|
optional for hardware to support it. Sadly, gallium doesn't yet allow
|
|
the driver to report 0 query bits.
|
|
|
|
* Primitive mode
|
|
|
|
VC4 doesn't support reducing triangles/quads/polygons to lines and
|
|
points like desktop GL. If front/back mode matched, we could rewrite
|
|
the index buffer to the new primitive type, but we don't. If
|
|
front/back mode don't match, we would need to run the vertex shader in
|
|
software, classify the prims, write new index buffers, and emit
|
|
(possibly many) new draw calls to rasterize the new prims in the same
|
|
order.
|
|
|
|
Bug Reporting
|
|
-------------
|
|
|
|
VC4 rendering bugs should go to Mesa's gitlab `issues
|
|
<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.
|
|
|
|
By far the easiest way to communicate bug reports for rendering
|
|
problems is to take an apitrace. This passes exactly the drawing you
|
|
saw to the developer, without the developer needing to download and
|
|
build the application and replicate whatever steps you took to produce
|
|
the problem. Traces attached to bug reports should ideally be small.
|
|
|
|
For GPU hangs, if you can get a short apitrace that produces the
|
|
problem, that's still the best. If the problem takes a long time to
|
|
reproduce or you can't capture it in a trace, describing how to
|
|
reproduce and including a gpu hang dump would be the most
|
|
useful. Install `vc4-gpu-tools
|
|
<https://github.com/anholt/vc4-gpu-tools/>` and use
|
|
``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
|
|
provide useful information.
|
|
|
|
Tiled Rendering
|
|
---------------
|
|
|
|
VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
|
|
32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
|
|
looks like::
|
|
|
|
(CPU) Allocate space to store a list of draw commands per tile
|
|
(CPU) Set up a command list per tile that does:
|
|
Either load the current tile's color buffer from memory, or clear it.
|
|
Either load the current tile's depth buffer from memory, or clear it.
|
|
Branch into the draw list for the tile
|
|
Store the depth buffer if anybody might read it.
|
|
Store the color buffer if anybody might read it.
|
|
(GPU) Initialize the per-tile draw call lists to empty.
|
|
(GPU) Run all draw calls collecting vertex data
|
|
(GPU) For each tile covered by a draw call's primitive.
|
|
Emit state packets to the list to update it to the current draw call's state.
|
|
Emit a primitive description into the tile's draw call list.
|
|
|
|
Tiled rendering avoids the need for large render target caches, at the
|
|
expense of increasing the cost of vertex processing. Unlike some tiled
|
|
renderers, VC4 has no non-tiled rendering mode.
|
|
|
|
Performance Tricks
|
|
------------------
|
|
|
|
* Reducing memory bandwidth by clearing.
|
|
|
|
Even if your drawing is going to cover the entire render target, it's
|
|
more efficient for VC4 if you emit a ``glClear()`` of the color and
|
|
depth buffers. This means we can skip the load of the previous state
|
|
from memory, in favor of a cheap GPU-side ``memset()`` of the tile
|
|
buffer before we start running the draw calls.
|
|
|
|
* Reducing memory bandwidth with scissoring.
|
|
|
|
If all draw calls for the frame are with a ``glScissor()`` to only
|
|
part of the screen, then we can skip setting up the tiles for that
|
|
area, which means a little less memory used setting up the empty bins,
|
|
and a lot less memory used loading/storing the unchanged tiles.
|
|
|
|
* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.
|
|
|
|
If we don't know who might use the contents of the framebuffer's depth
|
|
or color in the future, then we have to store it for later. If you use
|
|
glInvalidateFramebuffer() before accessing the results of your
|
|
rendering, then we can skip the store of the depth or color
|
|
buffer. Note that this is unimplemented.
|
|
|
|
* Avoid non-constant GLSL array indexing
|
|
|
|
In VC4 the only non-constant-index array access supported in hardware
|
|
is uniforms. For everything else (inputs, outputs, temporaries), we
|
|
have to lower them to an IF ladder like::
|
|
|
|
if (index == 0)
|
|
return array[0]
|
|
else if (index == 1)
|
|
return array[1]
|
|
...
|
|
|
|
This is very expensive as we probably have to execute every branch of
|
|
every IF statement due to it being a SIMD machine. So, it is
|
|
recommended (if you can) to avoid non-uniform non-constant array
|
|
indexing.
|
|
|
|
Note that if you do variable indexing within a bounded loop that Mesa
|
|
can unroll, that can actually count as constant indexing.
|
|
|
|
* Increasing GPU memory Increase CMA pool size
|
|
|
|
The memory for the VC4 driver is allocated from the standard Linux cma
|
|
pool. The size of this pool defaults to 64 MB. To increase this, pass
|
|
an additional parameter on the kernel command line. Edit the boot
|
|
partition's ``cmdline.txt`` to add::
|
|
|
|
cma=256M@256M
|
|
|
|
``cmdline.txt`` is a single line with whitespace separated parameters.
|
|
|
|
The first value is the size of the pool and the second parameter is
|
|
the start address of the pool. The pool size can be increased further,
|
|
but it must fit into the memory, so size + start address must be below
|
|
1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
|
|
reduces the memory available to Linux.
|
|
|
|
* Decrease firmware memory
|
|
|
|
The firmware allocates a fixed chunk of memory before booting
|
|
Linux. If firmware functions are not required, this amount can be
|
|
reduced.
|
|
|
|
In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
|
|
edit gpu_mem to 64 if you need video decoding.
|
|
|
|
Performance debugging
|
|
---------------------
|
|
|
|
* Step 1: Known issues
|
|
|
|
The first tool to look at is running your application with the
|
|
environment variable ``VC4_DEBUG=perf`` set. This will report debug
|
|
information for many known causes of performance problems on the
|
|
console. Not all of them will cause visible performance improvements
|
|
when fixed, but it's a good first step to see what might going wrong.
|
|
|
|
* Step 2: CPU vs GPU
|
|
|
|
The primary question is figuring out whether the CPU is busy in your
|
|
application, the CPU is busy in the GL driver, the GPU is waiting for
|
|
the CPU, or the CPU is waiting for the GPU. Ideally, you get to the
|
|
point where the CPU is waiting for the GPU infrequently but for a
|
|
significant amount of time (however long it takes the GPU to draw a
|
|
frame).
|
|
|
|
Start with top while your application is running. Is the CPU usage
|
|
around 90%+? If so, then our performance analysis will be with
|
|
sysprof. If it's not very high, is the GPU staying busy? We don't have
|
|
a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
|
|
useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
|
|
means that the GPU is currently busy processing some rendering job.
|
|
|
|
* sysprof for CPU usage
|
|
|
|
If the CPU is totally busy and the GPU isn't terribly busy, there is
|
|
an excellent tool for debugging: sysprof. Install, run as root (so you
|
|
can get system-wide profiling), hit play and later stop. The top-left
|
|
area shows the flat profile sorted by total time of that symbol plus
|
|
its descendants. The top few are generally uninteresting (main() and
|
|
its descendants consuming a lot), but eventually you can get down to
|
|
something interesting. Click it, and to the right you get the
|
|
callchains to descendants -- where all that time actually went. On the
|
|
other hand, the lower left shows callers -- double-clicking those
|
|
selects that as the symbol to view, instead.
|
|
|
|
Note that you need debug symbols for the callgraphs in sysprof to
|
|
work, which is where most of its value is. Most distributions offer
|
|
debug symbol packages from their builds which can be installed
|
|
separately, and sysprof will find them. I've found that on arm, the
|
|
debug packages are not enough, and if someone could determine what is
|
|
necessary for callgraphs in debugging, that would be really helpful.
|
|
|
|
* perf for CPU waits on GPU
|
|
|
|
If the CPU is not very busy and the GPU is not very busy, then we're
|
|
probably ping-ponging between the two. Most cases of this would be
|
|
noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
|
|
this happens, use the perf tool from the Linux kernel (note: unrelated
|
|
to ``VC4_DEBUG=perf``)::
|
|
|
|
sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena
|
|
|
|
If you want to see the whole system's stalls for a period of time
|
|
(very useful!), use the -a flag instead of a particular command
|
|
name. Just ``^C`` when you're done capturing data.
|
|
|
|
At exit, you'll have ``perf.data`` in the current directory. You can print
|
|
out the results with::
|
|
|
|
perf report | less
|
|
|
|
* Debugging for GPU fully busy
|
|
|
|
As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
|
|
performance counters in OpenGL. Install apitrace, and trace your
|
|
application with::
|
|
|
|
apitrace trace <application> # for GLX applications
|
|
apitrace trace -a egl <application> # for EGL applications
|
|
|
|
Once you've captured a trace, you can see what counters are available
|
|
and replay it while looking while looking at some of those counters::
|
|
|
|
apitrace replay <application>.trace --list-metrics
|
|
|
|
apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading
|
|
|
|
Multiple counters can be captured at once with commas separating them.
|
|
|
|
Once you've found what draw calls are surprisingly expensive in one of
|
|
the counters, you can work out which ones they were at the GL level by
|
|
opening the trace up in qapitrace and using ``^-G`` to jump to that call
|
|
number and ``^-L`` to look up the GL state at that call.
|
|
|
|
shader-db
|
|
---------
|
|
|
|
shader-db is often used as a proxy for real-world app performance when
|
|
working on the compiler in Mesa. On vc4, there is a lot of
|
|
state-dependent code in the shaders (like blending or vertex attribute
|
|
format handling), so the typical `shader-db
|
|
<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
|
|
areas for optimization. Instead, anholt wrote a `new one
|
|
<https://cgit.freedesktop.org/~anholt/shader-db-2/>`__ based on
|
|
apitraces. Once you have a collection of traces, starting from
|
|
`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__,
|
|
you can test a compiler change in this shader-db with::
|
|
|
|
./run.py > before
|
|
(cd ../mesa && make install)
|
|
./run.py > after
|
|
./report.py before after
|
|
|
|
Hardware Documentation
|
|
----------------------
|
|
|
|
For driver developers, Broadcom publicly released a `specification
|
|
<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
|
|
is closely related to the vc4 GPU present in the Raspberry Pi. They
|
|
also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
|
|
of a corresponding Android graphics driver. That graphics driver was
|
|
ported to Raspbian for a demo, but was not expected to have ongoing
|
|
development.
|
|
|
|
Developers with NDA access with Broadcom or Raspberry Pi can
|
|
potentially get access to "simpenrose", the C software simulator of
|
|
the GPU. The Mesa driver includes a backend (`vc4_simulator.c`) to
|
|
use simpenrose from an x86 system with the i915 graphics driver with
|
|
all of the vc4 rendering commands emulated on simpenrose and memcpyed
|
|
to the real GPU.
|