You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
82 lines
3.6 KiB
82 lines
3.6 KiB
4 months ago
|
METAPROGRAMMED GEMM
|
||
|
===================
|
||
|
|
||
|
The two main goals of this library are:
|
||
|
- providing a new matrix multiplication kernel.
|
||
|
- providing the optimized codepaths for as many possible user scenarios without
|
||
|
enforcing additional input data constraints (padding, sizes, strides, layout)
|
||
|
|
||
|
To enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup.
|
||
|
|
||
|
The new kernel
|
||
|
--------------
|
||
|
|
||
|
The multiplication kernel - the most inner loop of the matrix multiplication,
|
||
|
which is responsible for the row/column products was rewritten. The new code
|
||
|
produces a 3x3 result patch and processes the row/column arrays in 8 element
|
||
|
packs (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using
|
||
|
specialized 8bit multiplication, aggregating to vector aggregators and then
|
||
|
reduction with parallel horizontal addition we devised code that achieved
|
||
|
higher arithmetical density (arithmetical operation per assembly instruction).
|
||
|
The arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla
|
||
|
Nexus 5 phone (which is practically peak for this device).
|
||
|
|
||
|
In order to feed the kernel with input data and minimize the number of
|
||
|
instructions other than the arithmetical operations a different packing
|
||
|
scheme was used. Three rows (columns) are interweaved every 8 elements so that
|
||
|
they can be read from continuous memory in one op inside the kernel. Additional
|
||
|
memory preload hint operations are inserted into the kernel to hide memory
|
||
|
latency behind arithmetical operations.
|
||
|
|
||
|
Generated code
|
||
|
--------------
|
||
|
|
||
|
The basic kernel used in this approach is of shape 3x3x8. Obviously this
|
||
|
kernel can be easily applied to multipications where matrix sizes are:
|
||
|
M x K, K x N where M and N are multiplies of 3 and K is a multiply of 8.
|
||
|
|
||
|
We rejected two obvious solutions of: padding the matrix sizes to appropriate
|
||
|
sizes, or using the reference implementation for the leftovers. Neither did
|
||
|
we consider enforcing extra constraints on the caller.
|
||
|
|
||
|
In order to allow all matrix sizes the kernels processing all combinations of
|
||
|
1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all
|
||
|
possible depths the leftover values (up to 7 elements) needed to be handled.
|
||
|
|
||
|
Instead of writing those kernels by hand we decided to generate them with
|
||
|
some python scripts. 9 Versions of the multiplication kernel were prepared.
|
||
|
Additionally packing and unpacking code for different row/column counts and
|
||
|
depth leftovers was generated. Finally different code was generated for
|
||
|
aligned memory reads/writes and unaligned memory reads/writes.
|
||
|
|
||
|
Using those multiplication and packing/unpacking primitives 144 gemm function
|
||
|
versions were prepared. On top of them one high level gemm function that would
|
||
|
switch to one of those preoptimized versions.
|
||
|
|
||
|
This approach allowed moving all unnecessary branching and conditional execution
|
||
|
outside of the inner loops. It also allowed removing of all short loops required
|
||
|
for leftover handling. Finally aligned memory reads/writes are used everywhere
|
||
|
where the provided input data allows.
|
||
|
|
||
|
Results
|
||
|
-------
|
||
|
|
||
|
The library shows up to 35% faster gemm execution in some cases (e.g. ImageNet
|
||
|
benchmark).
|
||
|
|
||
|
Files
|
||
|
-----
|
||
|
|
||
|
single_thread_gemm.h
|
||
|
-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the
|
||
|
optimized, unrolled and curried pack/unpack, and multiply procedures and
|
||
|
a single gemm function that switches between the optimized versions based
|
||
|
on the runtime parameters.
|
||
|
|
||
|
multi_thread_gemm.h
|
||
|
-- a simple parallelization scheme for the gemm function.
|
||
|
|
||
|
generators/gemm_NxMxK_neon.py
|
||
|
-- script that generates the single_thread_gemm.h header library.
|
||
|
Usage: python gemm_NxMxK_neon > single_thread_gemm.h
|