You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
162 lines
7.3 KiB
162 lines
7.3 KiB
# Gemmlowp's public entry points
|
|
|
|
gemmlowp's public interface is defined in
|
|
[public/gemmlowp.h](../public/gemmlowp.h).
|
|
|
|
## GemmWithOutputPipeline
|
|
|
|
The primary public entry point is: `GemmWithOutputPipeline`.
|
|
|
|
A usage example is given in
|
|
[doc/quantization_example.cc](quantization_example.cc).
|
|
|
|
The high-level overview of how this specifies a low-precision matrix
|
|
multiplication is explained in [low-precision.md](low-precision.md). The
|
|
rationale for a specific quantization paradigm is given in
|
|
[quantization.md](quantization.md). That specific quantization paradigm is
|
|
implemented at two different stages of the computation: as pre-processing on
|
|
the operands and as post-processing on the result:
|
|
|
|
* Pre-processing on the LHS, RHS operands, in the form of adding constant
|
|
`lhs_offset`, `rhs_offset` to them, is explained in
|
|
[low-precision.md](low-precision.md).
|
|
|
|
* Post-processing on the result, in the form of a flexible "output pipeline",
|
|
is explained in [output.md](output.md).
|
|
|
|
More details on this below as we discuss specific function parameters.
|
|
|
|
The prototype is:
|
|
|
|
```
|
|
template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
|
|
MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
|
|
typename OutputPipelineType, typename GemmContextType>
|
|
void GemmWithOutputPipeline(GemmContextType* context,
|
|
const MatrixMap<const InputScalar, LhsOrder>& lhs,
|
|
const MatrixMap<const InputScalar, RhsOrder>& rhs,
|
|
MatrixMap<OutputScalar, ResultOrder>* result,
|
|
int lhs_offset, int rhs_offset,
|
|
const OutputPipelineType& output_pipeline);
|
|
```
|
|
|
|
A typical call looks like (from the [usage example](quantization_example.cc)):
|
|
|
|
```
|
|
gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
|
|
gemmlowp::DefaultL8R8BitDepthParams>(
|
|
&gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
|
|
&uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
|
|
```
|
|
|
|
### Template parameters
|
|
|
|
Typically only the 3 first template parameters need to be specified, the rest
|
|
being automatically deduced from function parameters:
|
|
|
|
* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
|
|
this must be `std::uint8_t`.
|
|
* `OutputScalar`: The scalar type of the result. At the moment,
|
|
this must be `std::uint8_t`.
|
|
* `BitDepthParams`: Defines the bit format of the input and output matrices
|
|
and the required accuracy of the computation. At the moment, the only
|
|
non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
|
|
[less-than-8-bit.md](less-than-8-bit.md) for other values and the general
|
|
idea of this, and how it may become more useful in the future.
|
|
|
|
The other template parameters, which typically do not need to be specified, are:
|
|
|
|
* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
|
|
column-major) of the LHS, RHS, result matrices. See
|
|
[public/map.h](../public/map.h). See the below performance note: we
|
|
recommend using respectively RowMajor, ColMajor, ColMajor for optimal
|
|
performance.
|
|
* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
|
|
See below explanation of the `output_pipeline` parameter, and
|
|
[output.md](output.md).
|
|
* `GemmContextType`: the type of the `context` parameter. At the moment, this
|
|
must be `gemmlowp::GemmContext`.
|
|
|
|
### Function parameters
|
|
|
|
The function parameters taken by `GemmWithOutputPipeline` are:
|
|
|
|
* `context`: The `gemmlowp::GemmContext` object holding state and resources to
|
|
be used for this gemmlowp call.
|
|
* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
|
|
`MatrixMap` objects, mapping external buffers as matrices, not owning data.
|
|
See [public/map.h](../public/map.h).
|
|
* `result`: pointer to the destination `MatrixMap` object, which must be
|
|
already constructed, wrapping the external destination buffer with the
|
|
wanted destination matrix shape and storage layout. No memory allocation
|
|
will be performed by gemmlowp for the destination buffer. See
|
|
[public/map.h](../public/map.h).
|
|
* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
|
|
LHS, RHS matrices respectively, as explained in
|
|
[low-precision.md](low-precision.md). This is only the part of the
|
|
quantization paradigm explained in [quantization.md](quantization.md) that
|
|
needs to be implemented as operations on the operands; everything else is
|
|
operations on the result, see `output_pipeline`.
|
|
* `output_pipeline` is a `std::tuple` of output stages (see
|
|
[public/output_stages.h](../public/output_stages.h)), specifying the output
|
|
pipeline (see [output.md](output.md)). This is the part of the quantization
|
|
paradigm explained in [quantization.md](quantization.md) that needs to be
|
|
implemented as operations on the result matrix.
|
|
|
|
### Performance note on storage orders.
|
|
|
|
gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
|
|
result matrices. However, not all are equally optimized for.
|
|
|
|
Because gemmlowp is primarily aimed at neural network inference workloads,
|
|
optimization focus is on this particular combination of storage orders:
|
|
|
|
* `LhsOrder=RowMajor`
|
|
* `RhsOrder=ColMajor`
|
|
* `ResultOrder=ColMajor`
|
|
|
|
The rationale is that the LHS is typically the constant weights of a neural
|
|
network layer (e.g. the weights of a Convolutional layer implemented as a matrix
|
|
multiplication), while the RHS and result are neural network activations,
|
|
respectively the input and output activations of the layer.
|
|
|
|
Because the RHS and result are activations, we want them to share the same
|
|
storage order -- so that one layer's output activations can be readily used as
|
|
the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
|
|
|
|
We also know from general considerations on matrix multiplication that it is
|
|
slightly more efficient to have the direction of accumulation (the "depth"
|
|
dimension) be the direction of contiguous storage in memory. That means that it
|
|
is always going to be slightly easier and more efficient to have
|
|
`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
|
|
|
|
Putting this together, we arrive at gemmlowp's focus on the above-described
|
|
combination of storage orders.
|
|
|
|
Using other storage orders will typically mean taking less efficient paths in
|
|
the packing and unpacking stages, see [packing.md](packing.md). The compute
|
|
kernel stage ([kernel.md](kernel.md)) is unaffected.
|
|
|
|
## GemmWithOutputPipelinePC
|
|
|
|
This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
|
|
scalar. They are then broadcasted against LHS, RHS respectively.
|
|
|
|
This is useful for some flavors of neural network inference with "per-channel
|
|
quantization", whence the PC suffix. This has been useful in some settings where
|
|
a neural network trained in float arithmetic was subsequently quantized. On the
|
|
other hand, retraining neural networks for quantized inference tends to remove
|
|
the need for per-channel quantization. For that reason, the long-term usefulness
|
|
of this entry point is in question.
|
|
|
|
## Gemm
|
|
|
|
This is gemmlowp's original, now legacy and deprecated, entry point. See the
|
|
section of [low-precision.md](low-precision.md) on the legacy quantization
|
|
paradigm. Avoid in new code.
|
|
|
|
## The eight_bit_int_gemm directory
|
|
|
|
As explained in the top-level [README.md](../README.md#public-interfaces), this
|
|
is entirely deprecated.
|