|
|
# 'linalg' Dialect
|
|
|
|
|
|
[TOC]
|
|
|
|
|
|
## Rationale
|
|
|
|
|
|
<img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
|
|
|
|
|
|
Linalg is designed to solve the High-level Hierarchical Optimization
|
|
|
(HHO box) in MLIR and to interoperate nicely within a
|
|
|
*Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
|
|
|
|
|
|
The [Rationale Document](../Rationale/RationaleLinalgDialect.md)
|
|
|
goes into significantly more design and architectural decision details.
|
|
|
|
|
|
## Set of Key Transformations<a name="key_transformations"></a>
|
|
|
|
|
|
The following key transformations have been central to driving the design of
|
|
|
Linalg. They are all implemented in terms of the properties of the
|
|
|
`linalg.generic` OpInterface and avoid the pitfall of relying on hardcoded
|
|
|
one-off op knowledge.
|
|
|
|
|
|
The textual form description of these transformations is left for future
|
|
|
work. Still, it is useful to at least the key transformations that are
|
|
|
performed on the Linalg IR and that have influenced its design:
|
|
|
1. Progressive Buffer Allocation.
|
|
|
1. Parametric Tiling.
|
|
|
1. Promotion to Temporary Buffer in Fast Memory.
|
|
|
1. Tiled Producer-Consumer Fusion with Parametric Tile-And-Fuse.
|
|
|
1. Map to Parallel and Reduction Loops and Hardware.
|
|
|
1. Vectorization: Rewrite in Vector Form.
|
|
|
1. Lower to Loops (Affine, Generic, and Parallel).
|
|
|
1. Lower to Library Calls or Special Instructions, Intrinsics or ISA.
|
|
|
1. Partially Lower to Iterations Over a Finer-Grained Linalg Op.
|
|
|
|
|
|
## High-Level Description of Linalg Ops<a name="linalg_ops"></a>
|
|
|
Linalg takes at least some inspiration from all previously [listed prior
|
|
|
art](#prior_art). The design enables the definition of ***CustomOps*** with
|
|
|
generic properties that enable [key transformations](#key_transformations),
|
|
|
including lowering to scalar load/store and other operations or to external
|
|
|
library calls and intrinsics.
|
|
|
|
|
|
These ops can have ***either tensor or buffer operands***, subject to
|
|
|
[conventions and limitations](#tensors_and_buffers).
|
|
|
|
|
|
### Payload-Carrying Ops<a name="payload_ops"></a>
|
|
|
Linalg defines two payload carrying operations that implement the [structured ops](
|
|
|
https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p
|
|
|
) abstraction on tensors and buffers. This is architected as two generic operations
|
|
|
`linalg.generic` (resp. `linalg.indexed_generic`) that can express custom
|
|
|
operations with *index-free semantics* (resp. *indexing semantics*).
|
|
|
The properties of these generic ops are the result of applying the
|
|
|
guiding principles described in the [Rationale Document](../Rationale/RationaleLinalgDialect.md).
|
|
|
They are listed next, with a brief example and discussion for each.
|
|
|
|
|
|
#### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>
|
|
|
A `linalg.generic` op fully *derives* the specification of its iteration space
|
|
|
from its operands.
|
|
|
The property enforces that a localized IR element (the op) *has* all the information
|
|
|
needed to synthesize the control-flow required to iterate over its operands,
|
|
|
according to their type. This notion of IR localization bears some resemblance
|
|
|
to [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).
|
|
|
|
|
|
Consider the following fully specified `linalg.generic` example.
|
|
|
Here, the first operand is a `memref` of `f32` scalar elements that
|
|
|
has an ordinary identity layout, and the second one is a `memref` of
|
|
|
4-element vectors with a 2-strided, 1-offset layout.
|
|
|
|
|
|
```mlir
|
|
|
// File name: example1.mlir
|
|
|
#accesses = [
|
|
|
affine_map<(m) -> (m)>,
|
|
|
affine_map<(m) -> (m)>
|
|
|
]
|
|
|
#attrs = {
|
|
|
args_in = 1,
|
|
|
args_out = 1,
|
|
|
indexing_maps = #accesses,
|
|
|
iterator_types = ["parallel"]
|
|
|
}
|
|
|
// memory layouts
|
|
|
#identity = affine_map<(d0) -> (d0)>
|
|
|
|
|
|
func @example(%A: memref<?xf32, #identity>,
|
|
|
%B: memref<?xvector<4xf32>, offset: 1, strides: [2]>) {
|
|
|
linalg.generic #attrs %A, %B {
|
|
|
^bb0(%a: f32, %b: vector<4xf32>):
|
|
|
%c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
|
|
|
linalg.yield %c: vector<4xf32>
|
|
|
} : memref<?xf32, #identity>, memref<?xvector<4xf32>, offset: 1, strides: [2]>
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The property "*Input and Output Operands Define The Iteration Space*" is
|
|
|
materialized by a lowering into a form that will resemble:
|
|
|
|
|
|
```mlir
|
|
|
// Run: mlir-opt example1.mlir -allow-unregistered-dialect -convert-linalg-to-loops
|
|
|
// This converted representation is in the `scf` dialect.
|
|
|
// It's syntax can be found here: https://mlir.llvm.org/docs/Dialects/SCFDialect/
|
|
|
#map0 = affine_map<(d0) -> (d0 * 2 + 1)>
|
|
|
|
|
|
func @example(%arg0: memref<?xf32>, %arg1: memref<?xvector<4xf32>, #map0>) {
|
|
|
%c0 = constant 0 : index
|
|
|
%c1 = constant 1 : index
|
|
|
%0 = dim %arg0, %c0 : memref<?xf32>
|
|
|
scf.for %arg2 = %c0 to %0 step %c1 {
|
|
|
%1 = load %arg0[%arg2] : memref<?xf32>
|
|
|
%2 = load %arg1[%arg2] : memref<?xvector<4xf32>, #map0>
|
|
|
%3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
|
|
|
store %3, %arg1[%arg2] : memref<?xvector<4xf32>, #map0>
|
|
|
}
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The property participates in simplifying analyses and transformations. For
|
|
|
instance, it guarantees no out-of bounds access can occur by construction
|
|
|
(assuming dynamic operand dimensions agree with each other, which is the
|
|
|
purpose of the `assert` runtime check).
|
|
|
|
|
|
Before lowering to loop form, loop induction variables and iterators are *not yet
|
|
|
materialized*. This is a necessary property if we want an abstraction that
|
|
|
works on both tensor values and buffers because ***values don’t escape
|
|
|
loops/nesting***.
|
|
|
|
|
|
The main implications are that:
|
|
|
1. The semantics of the ops are *restricted to operate on structured data
|
|
|
types*, on which we can define an iterator.
|
|
|
2. This does not model arbitrary code with side-effects.
|
|
|
|
|
|
We do not think these are serious limitations in practice because MLIR is all
|
|
|
about mixing different levels of abstractions in the same IR. As long as
|
|
|
Linalg can progressively lower to the next level of abstraction, it can also
|
|
|
be just bypassed for things that do not fit.
|
|
|
|
|
|
At the same time, conditioning op semantics on structured data types is a very
|
|
|
promising path towards extensibility to non-dense tensors as experience with
|
|
|
LIFT abstractions for
|
|
|
[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
|
|
|
and [position-dependent
|
|
|
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
|
|
|
as well as [TACO](http://tensor-compiler.org/), has shown.
|
|
|
|
|
|
#### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>
|
|
|
A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
|
|
|
loops) and the data.
|
|
|
|
|
|
Consider the following fully specified `linalg.generic` example.
|
|
|
Here, the first `memref` is a 2-strided one on both of its dimensions,
|
|
|
and the second `memref` uses an identity layout.
|
|
|
|
|
|
```
|
|
|
// File name: example2.mlir
|
|
|
#indexing_maps = [
|
|
|
affine_map<(i, j) -> (j, i)>,
|
|
|
affine_map<(i, j) -> (j)>
|
|
|
]
|
|
|
#attrs = {
|
|
|
args_in = 1,
|
|
|
args_out = 1,
|
|
|
indexing_maps = #indexing_maps,
|
|
|
iterator_types = ["parallel", "parallel"]
|
|
|
}
|
|
|
|
|
|
func @example(%A: memref<8x?xf32, offset: 0, strides: [2, 2]>,
|
|
|
%B: memref<?xvector<4xf32>>) {
|
|
|
linalg.generic #attrs %A, %B {
|
|
|
^bb0(%a: f32, %b: vector<4xf32>):
|
|
|
%c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
|
|
|
linalg.yield %c: vector<4xf32>
|
|
|
}: memref<8x?xf32 , offset: 0, strides: [2, 2]>, memref<?xvector<4xf32>>
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The property "*Reversible Mappings Between Control and Data Structures*" is
|
|
|
materialized by a lowering into a form that will resemble:
|
|
|
```
|
|
|
// Run: mlir-opt example2.mlir -allow-unregistered-dialect -convert-linalg-to-loops
|
|
|
#map0 = affine_map<(d0, d1) -> (d0 * 2 + d1 * 2)>
|
|
|
|
|
|
func @example(%arg0: memref<8x?xf32, #map0>, %arg1: memref<?xvector<4xf32>>) {
|
|
|
%c8 = constant 8 : index
|
|
|
%c0 = constant 0 : index
|
|
|
%c1 = constant 1 : index
|
|
|
%0 = dim %arg0, %c1 : memref<8x?xf32, #map0>
|
|
|
scf.for %arg2 = %c0 to %0 step %c1 {
|
|
|
scf.for %arg3 = %c0 to %c8 step %c1 {
|
|
|
%1 = load %arg0[%arg3, %arg2] : memref<8x?xf32, #map0>
|
|
|
%2 = load %arg1[%arg3] : memref<?xvector<4xf32>>
|
|
|
%3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
|
|
|
store %3, %arg1[%arg3] : memref<?xvector<4xf32>>
|
|
|
}
|
|
|
}
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
This mapping needs to be reversible because we want to be
|
|
|
able to go back and forth between the two and answer questions such as:
|
|
|
- Given a subset of the iteration space, what subset of data does it read and
|
|
|
write?
|
|
|
- Given a subset of data read or written, what subset of the iteration space
|
|
|
is responsible for this read or write?
|
|
|
|
|
|
Answering these `2` questions is one of the main analyses that Linalg uses to
|
|
|
implement transformations such as tiling, tiled producer-consumer fusion, and
|
|
|
promotion to temporary buffers in fast memory.
|
|
|
|
|
|
In the current implementation, `linalg.generic` uses a list of [AffineMaps](https://mlir.llvm.org/docs/LangRef/#affinemap-attribute) (see the `#indexing_maps` attribute in the previous examples).
|
|
|
This is a pragmatic short-term solution, but in the longer term note that
|
|
|
this property could be even evaluated dynamically, similarly to
|
|
|
inspector-executor algorithms.
|
|
|
|
|
|
#### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>
|
|
|
A `linalg.generic` op fully *declares* the type of its iterators. This
|
|
|
information is used in transformations.
|
|
|
|
|
|
These properties are derived from established practice in the field and mirror
|
|
|
the properties from Ken Kennedy's [Optimizing Compilers for Modern Architectures](
|
|
|
https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9).
|
|
|
The key idea of legality of loop transformations expressed by Kennedy is
|
|
|
that ***the lexicographic order of all dependence vectors must be
|
|
|
preserved***.
|
|
|
|
|
|
This can be better captured directly at the loop level thanks to specific
|
|
|
iterator types, among which:
|
|
|
*parallel*, *reduction*, *partition*, *permutable/monotonic*, *sequential*,
|
|
|
*dependence distance*, ...
|
|
|
|
|
|
These types are traditionally the result of complex dependence analyses and
|
|
|
have been referred to as "*bands*" in the polyhedral community (e.g. *parallel
|
|
|
bands*, *permutable bands*, etc, in
|
|
|
[ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
|
|
|
parlance).
|
|
|
|
|
|
Specifying the information declaratively in a `linalg.generic` allows
|
|
|
conveying properties that may be hard (or even impossible) to derive from
|
|
|
lower-level information. These properties can be brought all the way to the
|
|
|
moment when they are useful for transformations, used and then discarded.
|
|
|
|
|
|
Additionally, these properties may also be viewed as a contract that the
|
|
|
frontend/user guarantees and that the compiler may take advantage of. The
|
|
|
common example is the use of data-dependent reduction semantics for
|
|
|
specifying histogram computations. If the frontend has additional knowledge
|
|
|
that proper atomic operations are available, it may be better to specify
|
|
|
parallel semantics and use the special atomic in the computation region.
|
|
|
|
|
|
At this time, Linalg only has an explicit use for *parallel* and *reduction*
|
|
|
loops but previous experience shows that the abstraction generalizes.
|
|
|
|
|
|
#### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>
|
|
|
A `linalg.generic` op has a compute payload that is fully generic thanks to
|
|
|
the use of
|
|
|
[Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md#regions).
|
|
|
|
|
|
The region takes as arguments the scalar elemental types of the tensor or
|
|
|
buffer operands of the `linalg.generic`. For flexibility and ability to match
|
|
|
library calls, additional special values may be passed. For instance, a
|
|
|
`linalg.fill` operation takes a buffer and an additional scalar value.
|
|
|
|
|
|
At this time there are no additional restrictions to the region
|
|
|
semantics. This is meant to allow the exploration of various design tradeoffs
|
|
|
at the intersection of regions and iterator types.
|
|
|
In particular, the frontend is responsible for the semantics of iterator types
|
|
|
to correspond to the operations inside the region: the region can capture
|
|
|
buffers arbitrarily and write into them. If this conflicts with some parallel
|
|
|
iterator requirement, this is undefined behavior.
|
|
|
|
|
|
Previous examples already elaborate compute payloads with an unregistered function `"some_compute"`. The following code snippet shows what the result will be when using a concrete operation `addf`:
|
|
|
```
|
|
|
// File name: example3.mlir
|
|
|
#indexing_maps = [
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>
|
|
|
]
|
|
|
#attrs = {
|
|
|
args_in = 2,
|
|
|
args_out = 1,
|
|
|
indexing_maps = #indexing_maps,
|
|
|
iterator_types = ["parallel", "parallel"]
|
|
|
}
|
|
|
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
|
|
|
linalg.generic #attrs %A, %B, %C {
|
|
|
^bb0(%a: f32, %b: f32, %c: f32):
|
|
|
%d = addf %a, %b : f32
|
|
|
linalg.yield %d : f32
|
|
|
}: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
This function basically element-wise adds up two matrices (`%A` and `%B`) and stores the result into another one (`%C`).
|
|
|
|
|
|
The property "*The Compute Payload is Specified With a Region*" is
|
|
|
materialized by a lowering into a form that will resemble:
|
|
|
```
|
|
|
// Run: mlir-opt example3.mlir -convert-linalg-to-loops
|
|
|
#indexing_maps = [
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>
|
|
|
]
|
|
|
#attrs = {
|
|
|
args_in = 2,
|
|
|
args_out = 1,
|
|
|
indexing_maps = #indexing_maps,
|
|
|
iterator_types = ["parallel", "parallel"]
|
|
|
}
|
|
|
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
|
|
|
linalg.generic #attrs %A, %B, %C {
|
|
|
^bb0(%a: f32, %b: f32, %c: f32):
|
|
|
%d = addf %a, %b : f32
|
|
|
linalg.yield %d : f32
|
|
|
}: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
In the process of lowering to loops and lower-level constructs, similar
|
|
|
requirements are encountered, as are discussed in the [inlined call op
|
|
|
proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/282/2).
|
|
|
We expect to be able to reuse the common lower-level infrastructure provided
|
|
|
it evolves to support both region arguments and captures.
|
|
|
|
|
|
#### Property 5: May Map To an External Library Call<a name="prop5"></a>
|
|
|
A `linalg.generic` op may map to an external library call by specifying a
|
|
|
`SymbolAttr`. At this level of abstraction, the important glue is the ability
|
|
|
to perform transformations that preserve the structure necessary to ***call
|
|
|
the external library after different transformations have been applied***.
|
|
|
|
|
|
This involves considerations related to preservation of op semantics
|
|
|
and integration at the ABI level. Regardless of whether one wants to use
|
|
|
external library calls or a custom ISA, the problem for codegen is similar:
|
|
|
preservation of a fixed granularity.
|
|
|
|
|
|
Consider the following example that adds an additional attribute `library_call="pointwise_add"`
|
|
|
that specifies the name of an external library call we intend to use:
|
|
|
```
|
|
|
// File name: example4.mlir
|
|
|
#indexing_maps = [
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>,
|
|
|
affine_map<(i, j) -> (i, j)>
|
|
|
]
|
|
|
#attrs = {
|
|
|
args_in = 2,
|
|
|
args_out = 1,
|
|
|
indexing_maps = #indexing_maps,
|
|
|
iterator_types = ["parallel", "parallel"],
|
|
|
library_call = "pointwise_add"
|
|
|
}
|
|
|
func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
|
|
|
linalg.generic #attrs %A, %B, %C {
|
|
|
^bb0(%a: f32, %b: f32, %c: f32):
|
|
|
%d = addf %a, %b : f32
|
|
|
linalg.yield %d : f32
|
|
|
}: memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>
|
|
|
return
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The property "*Map To an External Library Call*" is
|
|
|
materialized by a lowering into a form that will resemble:
|
|
|
|
|
|
```
|
|
|
// Run: mlir-opt example4.mlir -convert-linalg-to-std
|
|
|
// Note that we lower the Linalg dialect directly to the Standard dialect.
|
|
|
// See this doc: https://mlir.llvm.org/docs/Dialects/Standard/
|
|
|
|
|
|
#map0 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
|
|
|
|
|
|
func @example(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
|
|
|
%0 = memref_cast %arg0 : memref<?x?xf32> to memref<?x?xf32, #map0>
|
|
|
%1 = memref_cast %arg1 : memref<?x?xf32> to memref<?x?xf32, #map0>
|
|
|
%2 = memref_cast %arg2 : memref<?x?xf32> to memref<?x?xf32, #map0>
|
|
|
call @pointwise_add(%0, %1, %2) : (memref<?x?xf32, #map0>, memref<?x?xf32, #map0>, memref<?x?xf32, #map0>) -> ()
|
|
|
return
|
|
|
}
|
|
|
func @pointwise_add(memref<?x?xf32, #map0>, memref<?x?xf32, #map0>, memref<?x?xf32, #map0>) attributes {llvm.emit_c_interface}
|
|
|
```
|
|
|
|
|
|
Which, after lowering to LLVM resembles:
|
|
|
```
|
|
|
// Run: mlir-opt example4.mlir -convert-linalg-to-std | mlir-opt -convert-std-to-llvm
|
|
|
// Some generated code are omitted here.
|
|
|
func @example(%arg0: !llvm<"float*">, ...) {
|
|
|
...
|
|
|
llvm.call @pointwise_add(...) : (!llvm<"float*">, ...) -> ()
|
|
|
return
|
|
|
}
|
|
|
|
|
|
llvm.func @pointwise_add(%arg0: !llvm<"float*">, ...) attributes {llvm.emit_c_interface} {
|
|
|
...
|
|
|
llvm.call @_mlir_ciface_pointwise_add(%9, %19, %29) : (!llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }
|
|
|
*">) -> ()
|
|
|
llvm.return
|
|
|
}
|
|
|
llvm.func @_mlir_ciface_pointwise_add(!llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }*">) attributes {llvm.emit_c_interface}
|
|
|
```
|
|
|
|
|
|
##### Convention For External Library Interoperability
|
|
|
The `linalg` dialect adopts a convention that is similar to `BLAS` when
|
|
|
offloading operations to fast library implementations: pass a non-owning
|
|
|
pointer to input and output data with additional metadata. This convention
|
|
|
is also found in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`,
|
|
|
`cuDNN`, etc.. and more generally at interface points across language
|
|
|
boundaries (e.g. C++ / Python).
|
|
|
|
|
|
Generally, `linalg` passes non-owning pointers to View data structures
|
|
|
to pre-compiled library calls linked externally.
|
|
|
|
|
|
There is an [ongoing
|
|
|
discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
|
|
|
on the topic of extending interoperability in the presence of key attributes.
|
|
|
|
|
|
#### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>
|
|
|
Perfectly nested loops form a particularly important class of structure that
|
|
|
enables key loop transformations such as tiling and mapping to library calls.
|
|
|
Unfortunately, this type of structure is easily broken by transformations such
|
|
|
as partial loop fusion. Tiling and mapping to library calls become more
|
|
|
challenging, or even infeasible. Linalg ops adopt perfect-nestedness
|
|
|
as a first-class property: the structure cannot be broken and is
|
|
|
transported in the IR by construction.
|
|
|
|
|
|
A `linalg.generic` op represents a perfectly nested loop nest that writes the
|
|
|
entire memory region. This is a structural constraint across regions and
|
|
|
loops that has proven to be key in simplifying transformations.
|
|
|
|
|
|
One particular point to mention is that converting imperfectly nested code
|
|
|
into perfectly nested code can often be done with enough loop distribution
|
|
|
and embedding of conditionals down to the innermost loop level.
|
|
|
|
|
|
Previous experience with Tensor Comprehensions gave us the intuition that
|
|
|
forcing innermost control-flow nesting is a lot like writing data-parallel
|
|
|
code with arrays of boolean values and predication.
|
|
|
This type of trick has also been used before in polyhedral compilers to
|
|
|
convert non-affine control into affine compute dependencies.
|
|
|
|
|
|
While it may be possible to automate such rewrites from generic IR,
|
|
|
`linalg.generic` just forces the semantics for now.
|
|
|
|
|
|
The key implication is that this conversion to deep predication needs to be
|
|
|
undone once we are done with Linalg transformations.
|
|
|
After iterators and induction variables are materialized (i.e. after lowering
|
|
|
out of `linalg.generic` occurred), the overall performance will be greatly
|
|
|
influenced by the quality of canonicalizations, foldings and *Loop Independent
|
|
|
Code Motion* (LICM).
|
|
|
|
|
|
In the grander scheme, the reliance on late LICM was deemed a necessary risk.
|
|
|
|
|
|
#### Putting it Together<a name="summary"></a>
|
|
|
As it stands, the six properties above define the semantics of a
|
|
|
`linalg.generic` op. It is an open question whether all of these semantics are
|
|
|
strictly necessary in practice and whether some should or could be derived
|
|
|
automatically while still maintaining the [core guiding
|
|
|
principles](#guiding_principles).
|
|
|
|
|
|
For the time being, we have settled on the combination of these properties
|
|
|
because of empirical evidence building and working on multiple high-level
|
|
|
compilers. As we lay those down and engage more with the community, we expect
|
|
|
multiple rounds of discussions and design changes to the original architecture.
|
|
|
|
|
|
### Tensors and Buffers: Conventions and Limitations <a name="tensors_and_buffers"></a>
|
|
|
|
|
|
Tensors are immutable SSA values, buffers are mutable regions of memory subject
|
|
|
to side-effects and aliasing. As a consequence, output buffers are passed as
|
|
|
operands whereas output tensors are new SSA values corresponding to op results.
|
|
|
Inputs can be arbitrary tensors or buffers and are always passed as operands.
|
|
|
|
|
|
The following convention is currently in-flight and is in the process of
|
|
|
replacing other existing conventions. The following convention currently applies
|
|
|
to "named" structured ops which are auto-generated by the linalg-ods tool.
|
|
|
|
|
|
The convention adopted is as follows:
|
|
|
|
|
|
1. A first block of `ins` op operands hold read-only inputs of ShapedType.
|
|
|
2. An optional second block of `outs` op operands hold read-write output
|
|
|
buffers of MemRefType.
|
|
|
3. An optional third block of `init` operands hold initialization tensors of
|
|
|
RankedTensorType. Such tensors can appear when the op performs a reduction
|
|
|
and returns a tensor.
|
|
|
|
|
|
Structured ops with fully parallel semantics, have empty `init`. They may either
|
|
|
write in-place into `outs` buffers or return new tensors.
|
|
|
|
|
|
Structured ops with reduction semantics and output tensor(s) however have
|
|
|
additional restrictions:
|
|
|
|
|
|
1. They can only return a single tensor for now.
|
|
|
2. They cannot have any output buffer operand (i.e. `outs` is empty).
|
|
|
3. They have exactly one `init` tensor of the same type as the unique output
|
|
|
tensor. Such an `init` tensor does not have an explicit associate indexing
|
|
|
map. Instead the map of the result tensor is used to signify that the `init`
|
|
|
and the `result` are "tied".
|
|
|
|
|
|
Points 1. and 2. keep complexity of the representation in check by allowing only
|
|
|
a single result tensor, when reductions are present.
|
|
|
|
|
|
Point 3. is related to the fact that SSA values cannot represent in-place
|
|
|
updates. Instead, linalg adopts a similar convention that exists in e.g.
|
|
|
`vector.outerproduct`: the value that is reduced into is passed as an explicit
|
|
|
argument and a new result of the same shape is produced.
|
|
|
|
|
|
It is expected buffer allocation will fold this last input onto the result in a
|
|
|
single output buffer argument, which is why the same indexing map is required:
|
|
|
the last input operand is said to be "tied" to the result.
|
|
|
|
|
|
Alternative, more complex representations, would allow for:
|
|
|
|
|
|
1. Multiple results and `init` tensors in arbitrary orders, which could be
|
|
|
captured by an extra ArrayAttr of position pairs.
|
|
|
2. Relaxing the conditions on the indexing map equalities on the each pair and
|
|
|
e.g. allow implicit broadcasts of the input.
|
|
|
|
|
|
These representations are deemed unnecessarily complex for now and are left for
|
|
|
future discussion.
|
|
|
|
|
|
As an illustration, the syntax for a `linalg.matmul` writing into a buffer is:
|
|
|
|
|
|
```
|
|
|
linalg.matmul ins(%a, %b : memref<?x?xf32>, tensor<?x?xf32>)
|
|
|
outs(%c : memref<?x?xf32>)
|
|
|
```
|
|
|
|
|
|
, whereas the syntax for a `linalg.matmul` returning a new tensor is:
|
|
|
|
|
|
```
|
|
|
%d = linalg.matmul ins(%a, %b : tensor<?x?xf32>, memref<?x?xf32>)
|
|
|
init(%c : tensor<?x?xf32>)
|
|
|
-> tensor<?x?xf32>
|
|
|
```
|
|
|
|
|
|
### Data Representation: Views<a name="views"></a>
|
|
|
The current implementation uses the [Strided MemRef (a.k.a View)](
|
|
|
https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
|
|
|
abstraction. The name *View* is used interchangeably in `linalg` to signify
|
|
|
*Strided MemRef*.
|
|
|
In the future we expect to use other structured data types and
|
|
|
support ragged, mixed-sparse and other types. We expect to draw on the
|
|
|
experience from existing LIFT abstractions for
|
|
|
[sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
|
|
|
and [position-dependent
|
|
|
arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
|
|
|
|
|
|
### Metadata Ops<a name="metadata_ops"></a>
|
|
|
A set of ops that manipulate metadata but do not move memory. These ops take
|
|
|
`view` operands + extra attributes and return new `view`s. The returned
|
|
|
`view`s generally alias the operand `view`. At the moment the existing ops
|
|
|
are:
|
|
|
|
|
|
* `std.view`,
|
|
|
* `std.subview`,
|
|
|
* `std.transpose`.
|
|
|
* `linalg.range`,
|
|
|
* `linalg.slice`,
|
|
|
* `linalg.reshape`,
|
|
|
|
|
|
Future ops are added on a per-need basis but should include:
|
|
|
|
|
|
* `linalg.tile`,
|
|
|
* `linalg.intersection`,
|
|
|
* `linalg.convex_union`,
|
|
|
* `linalg.difference` (would need to work on a list of views).
|
|
|
|
|
|
These additional operations correspond to abstractions that have been known to
|
|
|
work in the field of large-scale distributed stencil computations.
|
|
|
|
|
|
In a longer-term future, the abstractions from [Legion data-centric
|
|
|
programming model](https://legion.stanford.edu/overview/) seem generally
|
|
|
appealing.
|
|
|
|
|
|
### Named Payload-Carrying Ops<a name="named_ops"></a>
|
|
|
Additionally, `linalg` provides a small subset of commonly named operations:
|
|
|
|
|
|
* `linalg.copy`,
|
|
|
* `linalg.fill`,
|
|
|
* `linalg.dot`,
|
|
|
* `linalg.matmul`,
|
|
|
* `linalg.conv`.
|
|
|
|
|
|
These named operations adhere to the `linalg.generic` op interface. Work is in
|
|
|
progress to define declarative mechanisms to automatically generate named ops
|
|
|
from a description in terms of only the generic op interface.
|
|
|
|
|
|
This is the main reason there are only a small number of ops today: we expect
|
|
|
them to be auto-generated from Tablegen soon.
|
|
|
|
|
|
### Named Payload Ops Specification
|
|
|
|
|
|
Linalg provides a declarative specification and a generation tool
|
|
|
(`mlir-linalg-ods-gen`) to automatically produce named ops from a notation that
|
|
|
is inspired by Einstein notation.
|
|
|
|
|
|
The syntax and semantics used in `mlir-linalg-ods-gen` are very much in flight
|
|
|
and borrow from Tensor Comprehensions (TC) but differ in a few dimensions, to
|
|
|
better adapt to Linalg:
|
|
|
|
|
|
1. The input and output tensor parameters are specified as `id :
|
|
|
type(symbolic-affine-expression-list)` (e.g. `A : f32(M, N + M)`) and each
|
|
|
new symbol is discovered eagerly. TC on the other hand does not allow
|
|
|
general symbolic affine expressions.
|
|
|
1. The output shapes are specified explicitly, in TC they are always derived
|
|
|
from the input shapes.
|
|
|
1. The operations used to specify computations use EDSC intrinsics so that they
|
|
|
can easily be parsed and emitted into a simple region builder without
|
|
|
resorting to more general MLIR parsing.
|
|
|
1. Reduction dimensions are specified with angle bracket notation on the
|
|
|
operation they apply to (e.g. `std_add<k>` specifies that `k` is a reduction
|
|
|
dimension). In TC, a reduction is specified with `op=` operator and the
|
|
|
reduction dimensions are inferred.
|
|
|
1. The parallel and reduction dimension are ordered by the textual program
|
|
|
order. For instance, in the comprehension `O(i, j) = std_add<k, l>(...)`,
|
|
|
`i` (resp. `j`) is a parallel iterator encoded by affine dimension of
|
|
|
position `0` (resp. `1`); `k` (resp. `l`) is a reduction iterator encoded by
|
|
|
an affine dimension of position `2` (resp. `3`).
|
|
|
|
|
|
These decisions and syntax are subject to evolution and change. In particular,
|
|
|
op-specific attributes, dynamic ranks, some form of templating, shape
|
|
|
calculation function specification, etc. may be added in the future.
|
|
|
|
|
|
At this time, the following restrictions are imposed on the syntax and
|
|
|
semantics:
|
|
|
|
|
|
1. Each def may only contain a single comprehension but each comprehension may
|
|
|
perform multiple updates.
|
|
|
2. Each tensor may only be used with a single indexing expression.
|
|
|
|
|
|
The following specification may be used to define a named `batchmatmul` op:
|
|
|
|
|
|
```
|
|
|
def batchmatmul(A: f32(Batch, M, K), B: f32(K, N)) -> (C: f32(Batch, M, N)) {
|
|
|
C(b, m, n) = std_addf<k>(std_mulf(A(b, m, k), B(k, n)));
|
|
|
}
|
|
|
```
|
|
|
|
|
|
When `mlir-linalg-ods-gen -gen-ods-decl=1` is called, the following ODS is
|
|
|
produced:
|
|
|
|
|
|
```
|
|
|
def batchmatmulOp : LinalgNamedStructured_Op<"batchmatmul", [
|
|
|
NInputs<2>,
|
|
|
NOutputs<1>,
|
|
|
NamedStructuredOpTrait]> { ... }
|
|
|
```
|
|
|
|
|
|
When `mlir-linalg-ods-gen -gen-impl=1` is called, the following C++ is produced:
|
|
|
|
|
|
```
|
|
|
llvm::Optional<SmallVector<StringRef, 8>> batchmatmul::referenceIterators() {
|
|
|
return SmallVector<StringRef, 8>{
|
|
|
getParallelIteratorTypeName(),
|
|
|
getParallelIteratorTypeName(),
|
|
|
getParallelIteratorTypeName(),
|
|
|
getReductionIteratorTypeName() };
|
|
|
}
|
|
|
llvm::Optional<SmallVector<AffineMap, 8>> batchmatmul::referenceIndexingMaps() {
|
|
|
MLIRContext *context = getContext();
|
|
|
AffineExpr d0, d1, d2, d3;
|
|
|
bindDims(context, d0, d1, d2, d3);
|
|
|
return SmallVector<AffineMap, 8>{
|
|
|
AffineMap::get(4, 0, {d0, d1, d3}),
|
|
|
AffineMap::get(4, 0, {d3, d2}),
|
|
|
AffineMap::get(4, 0, {d0, d1, d2}) };
|
|
|
}
|
|
|
void batchmatmul::regionBuilder(ArrayRef<BlockArgument> args) {
|
|
|
using namespace edsc;
|
|
|
using namespace intrinsics;
|
|
|
Value _0(args[0]), _1(args[1]), _2(args[2]);
|
|
|
Value _4 = std_mulf(_0, _1);
|
|
|
Value _5 = std_addf(_2, _4);
|
|
|
(linalg_yield(ValueRange{ _5 }));
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Open Issues and Design Alternatives<a name="open_issues"></a>
|
|
|
Multiple open issues and design alternatives are in flight and it is time to
|
|
|
lay them out for the community to discuss and pick apart:
|
|
|
1. Should `linalg.generic` support nesting?
|
|
|
1. Should `linalg.generic` regions take views or only scalars?
|
|
|
1. Should we try to solve automatic differentiation at this level of
|
|
|
abstraction?
|
|
|
1. Are all the six properties really necessary?
|
|
|
1. Is this relying too much on declarative specification and would we be
|
|
|
better off relying more on analyses?
|
|
|
1. Is this general enough for the community's needs? If not how should this be
|
|
|
extended, if at all?
|
|
|
...
|
|
|
|
|
|
These key questions (and much more) should be really thought of in the general
|
|
|
context of MLIR in which different levels of IR interoperate seamlessly. In
|
|
|
practice, it is not necessary (or beneficial) to try and solve all problems in the
|
|
|
same IR.
|
|
|
|
|
|
## Operations
|
|
|
|
|
|
[include "Dialects/LinalgOps.md"]
|