You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
177 lines
6.7 KiB
177 lines
6.7 KiB
# Kernels in gemmlowp
|
|
|
|
## Kernels provide an inner-loop implementation, and a format
|
|
|
|
Here we assume familiarity with the concepts of kernels and of packing as
|
|
explained in [design.md](design.md).
|
|
|
|
gemmlowp is designed to be easily extensible to different architectures and
|
|
other low-level details, while achieving high performance. Thus a line had to be
|
|
drawn between the generic GEMM code and the specific parts that need to be
|
|
manually designed for each architecture, etc. The design choice made in gemmlowp
|
|
is to have easily swappable GEMM kernels.
|
|
|
|
In itself, a GEMM kernel is just an implementation of the inner-most loop in a
|
|
GEMM (That inner-most loop has to be over the 'depth' dimension so as to be able
|
|
to accumulate into a small enough number of accumulators to fit in registers).
|
|
|
|
Thus, by itself, a GEMM kernel should be just a function computing a block of
|
|
GEMM.
|
|
|
|
However, GEMM kernels may need to differ not just in how they implement this
|
|
computation, but also in the format of data that they operate on. Indeed, in
|
|
order to maximize the ratio of arithmetic instructions to memory access
|
|
instructions, GEMM kernels want to handle blocks as wide as possible given the
|
|
number of registers of the CPU architecture.
|
|
|
|
Thus, in order to allow efficient specialization to diverse architectures,
|
|
gemmlowp allows each GEMM kernel to dictate the format of data that it expects,
|
|
in addition to providing its inner-loop implementation.
|
|
|
|
The former is given by a 'Format' typedef, and the latter by a 'Run' method.
|
|
|
|
A good example is to look at internal/kernel_neon.h, and specifically at the
|
|
NEONKernel12x4Depth2 kernel, which specifies its format as
|
|
|
|
```
|
|
typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
|
|
KernelSideFormat<CellFormat<4, 2>, 1> > Format;
|
|
```
|
|
|
|
The meaning of these terms is explained in the lengthy comment at the top of
|
|
internal/kernel.h. Here, they mean that this kernel handles at each iteration
|
|
(along the depth dimension):
|
|
|
|
- 3 'cells' of size 4x2 each of the lhs, so a total lhs block of size 12x2
|
|
|
|
- 1 'cell' of size 2x4 of the rhs.
|
|
|
|
In other words, this kernel handles 12 rows of the lhs and 4 columns of the
|
|
rhs, and handles two levels of depth at once. The 'cells' and `CellFormat`
|
|
detail the layout of these 12x2 and 2x4 blocks.
|
|
|
|
This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding
|
|
12x4 GEMM; for ease of reference let us paste the critical comment and code
|
|
here:
|
|
|
|
```
|
|
"loop_NEONKernel12x4Depth2_%=:\n"
|
|
|
|
// Overview of register layout:
|
|
//
|
|
// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
|
|
// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
|
|
// (q1--q3).
|
|
// A 12x4 block of accumulators is stored in 32bit in q4--q15.
|
|
//
|
|
// +-----+-----+-----+-----+
|
|
// |d0[0]|d0[1]|d0[2]|d0[3]|
|
|
// Rhs +-----+-----+-----+-----+
|
|
// |d1[0]|d1[1]|d1[2]|d1[3]|
|
|
// +-----+-----+-----+-----+
|
|
//
|
|
// | | | | |
|
|
//
|
|
// Lhs | | | | |
|
|
//
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
//
|
|
// Accumulator
|
|
|
|
// Load 1 Rhs cell of size 2x4
|
|
"vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
|
|
|
|
// Load 3 Lhs cells of size 4x2 each
|
|
"vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
|
|
"vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
|
|
"vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
|
|
|
|
// Expand Lhs/Rhs cells to 16 bit.
|
|
"vmovl.u8 q0, d0\n"
|
|
"vmovl.u8 q1, d2\n"
|
|
"vmovl.u8 q2, d4\n"
|
|
"vmovl.u8 q3, d6\n"
|
|
|
|
// Multiply-accumulate, level of depth 0
|
|
"vmlal.u16 q4, d2, d0[0]\n"
|
|
"vmlal.u16 q5, d2, d0[1]\n"
|
|
"vmlal.u16 q6, d2, d0[2]\n"
|
|
"vmlal.u16 q7, d2, d0[3]\n"
|
|
"vmlal.u16 q8, d4, d0[0]\n"
|
|
"vmlal.u16 q9, d4, d0[1]\n"
|
|
"vmlal.u16 q10, d4, d0[2]\n"
|
|
"vmlal.u16 q11, d4, d0[3]\n"
|
|
"vmlal.u16 q12, d6, d0[0]\n"
|
|
"vmlal.u16 q13, d6, d0[1]\n"
|
|
"vmlal.u16 q14, d6, d0[2]\n"
|
|
"vmlal.u16 q15, d6, d0[3]\n"
|
|
|
|
// Multiply-accumulate, level of depth 1
|
|
"vmlal.u16 q4, d3, d1[0]\n"
|
|
"vmlal.u16 q5, d3, d1[1]\n"
|
|
"vmlal.u16 q6, d3, d1[2]\n"
|
|
"vmlal.u16 q7, d3, d1[3]\n"
|
|
"vmlal.u16 q8, d5, d1[0]\n"
|
|
"vmlal.u16 q9, d5, d1[1]\n"
|
|
"vmlal.u16 q10, d5, d1[2]\n"
|
|
"vmlal.u16 q11, d5, d1[3]\n"
|
|
"vmlal.u16 q12, d7, d1[0]\n"
|
|
"vmlal.u16 q13, d7, d1[1]\n"
|
|
"vmlal.u16 q14, d7, d1[2]\n"
|
|
"vmlal.u16 q15, d7, d1[3]\n"
|
|
|
|
// Loop. Decrement loop index (depth) by 2, since we just handled 2
|
|
// levels of depth (Kernel::kDepth=2).
|
|
"subs %[run_depth], #2\n"
|
|
"bne loop_NEONKernel12x4Depth2_%=\n"
|
|
```
|
|
|
|
## Packing code adapts to the format chosen by the kernel
|
|
|
|
As explained in [design.md](design.md), gemmlowp starts by packing blocks of the
|
|
lhs and rhs matrices for optimally efficient traversal by the kernel. This
|
|
depends on fine details of the kernel format, in ways that can only be
|
|
efficiently handled by knowing these kernel format details at compile-time.
|
|
|
|
This is the reason why all the code in [internal/pack.h](../internal/pack.h) is
|
|
templated in the corresponding kernel format.
|
|
|
|
The code in internal/pack.h isn't tightly optimized by itself, but it is
|
|
structured in such a way that the critical code is in a template,
|
|
`PackingRegisterBlock`, that can easily be specialized to override the slow
|
|
generic code with fast specific packing code for specific formats, on specific
|
|
platforms.
|
|
|
|
See [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON
|
|
specializations of the packing code for the particular kernel formats that are
|
|
used by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h).
|
|
|
|
## Wrapping up: how to optimize gemmlowp for a CPU architecture
|
|
|
|
In conclusion, the key feature of gemmlowp when it comes to efficiently
|
|
supporting a specific CPU architecture, is that it allows to freely replace the
|
|
inner loop of the GEMM by providing one's own GEMM kernel, which is also free to
|
|
dictate its required data layout; each data layout then also needs optimized
|
|
packing code. The steps are thus:
|
|
|
|
1. Freely design a GEMM kernel with a freely chosen data layout.
|
|
2. Implement the GEMM kernel, similar to
|
|
[internal/kernel_neon.h](../internal/kernel_neon.h).
|
|
3. Implement the optimized packing code, similar to
|
|
[internal/pack_neon.h](../internal/pack_neon.h).
|