|
|
# ProtoZero design document
|
|
|
|
|
|
ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
|
|
|
purposefully built for Perfetto's tracing use cases.
|
|
|
|
|
|
## Motivations
|
|
|
|
|
|
ProtoZero has been designed and optimized for proto serialization, which is used
|
|
|
by all Perfetto tracing paths.
|
|
|
Deserialization was introduced only at a later stage of the project and is
|
|
|
mainly used by offline tools
|
|
|
(e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
|
|
|
The _zero-copy zero-alloc zero-syscall_ statement applies only to the
|
|
|
serialization code.
|
|
|
|
|
|
Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
|
|
|
event in Perfetto is a proto
|
|
|
(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
|
|
|
allows events to be strongly typed and makes it easier for the team to maintain
|
|
|
backwards compatibility using a language that is understood across the board.
|
|
|
|
|
|
Tracing fast-paths need to have very little overhead, because instrumentation
|
|
|
points are sprinkled all over the codebase of projects like Android
|
|
|
and Chrome and are performance-critical.
|
|
|
|
|
|
Overhead here is not just defined as CPU time (or instructions retired) it
|
|
|
takes to execute the instrumentation point. A big source of overhead in a
|
|
|
tracing system is represented by the working set of the instrumentation points,
|
|
|
specifically extra I-cache and D-cache misses which would slow down the
|
|
|
non-tracing code _after_ the tracing instrumentation point.
|
|
|
|
|
|
The major design departures of ProtoZero from canonical C++ protobuf libraries
|
|
|
like [libprotobuf](https://github.com/google/protobuf) are:
|
|
|
|
|
|
* Treating serialization and deserialization as different use-cases served by
|
|
|
different code.
|
|
|
|
|
|
* Optimizing for binary size and working-set-size on the serialization paths.
|
|
|
|
|
|
* Ignoring most of the error checking and long-tail features of protobuf
|
|
|
(repeated vs optional, full type checks).
|
|
|
|
|
|
* ProtoZero is not designed as general-purpose protobuf de/serialization and is
|
|
|
heavily customized to maintain the tracing writing code minimal and allow the
|
|
|
compiler to see through the architectural layers.
|
|
|
|
|
|
* Code generated by ProtoZero needs to be hermetic. When building the
|
|
|
amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
|
|
|
perfetto tracing sources need to not have any dependency on any other
|
|
|
libraries other than the C++ standard library and C library.
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
At the build-system level, ProtoZero is extremely similar to the conventional
|
|
|
libprotobuf library.
|
|
|
The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
|
|
|
libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
|
|
|
compiler plugin.
|
|
|
|
|
|
ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
|
|
|
on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
|
|
|
it, however, has no runtime dependency (not even header-only dependencies) on
|
|
|
libprotobuf.
|
|
|
|
|
|
In order to generate ProtoZero stubs from proto you need to:
|
|
|
|
|
|
1. Build the ProtoZero compiler plugin, which lives in
|
|
|
[src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
|
|
|
```bash
|
|
|
tools/ninja -C out/default protozero_plugin protoc
|
|
|
```
|
|
|
|
|
|
2. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
|
|
|
```bash
|
|
|
out/default/protoc \
|
|
|
--plugin=protoc-gen-plugin=out/default/protozero_plugin \
|
|
|
--plugin_out=wrapper_namespace=pbzero:/tmp/ \
|
|
|
test_msg.proto
|
|
|
```
|
|
|
This generates `/tmp/test_msg.pbzero.{cc,h}`.
|
|
|
|
|
|
NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
|
|
|
The .cc file is emitted only because some build systems' rules assume that
|
|
|
protobuf codegens generate both a .cc and a .h file.
|
|
|
|
|
|
## Proto serialization
|
|
|
|
|
|
The quickest way to undestand ProtoZero design principles is to start from a
|
|
|
small example and compare the generated code between libprotobuf and ProtoZero.
|
|
|
|
|
|
```protobuf
|
|
|
syntax = "proto2";
|
|
|
|
|
|
message TestMsg {
|
|
|
optional string str_val = 1;
|
|
|
optional int32 int_val = 2;
|
|
|
repeated TestMsg nested = 3;
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### libprotobuf approach
|
|
|
|
|
|
The libprotobuf approach is to generate a C++ class that has one member for each
|
|
|
proto field, with dedicated serialization and de-serialization methods.
|
|
|
|
|
|
```bash
|
|
|
out/default/protoc --cpp_out=. test_msg.proto
|
|
|
```
|
|
|
|
|
|
generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
|
|
|
as follows:
|
|
|
|
|
|
```c++
|
|
|
// This class is generated by the standard protoc compiler in the .pb.h source.
|
|
|
class TestMsg : public protobuf::MessageLite {
|
|
|
private:
|
|
|
int32 int_val_;
|
|
|
ArenaStringPtr str_val_;
|
|
|
RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg>
|
|
|
|
|
|
public:
|
|
|
const std::string& str_val() const;
|
|
|
void set_str_val(const std::string& value);
|
|
|
|
|
|
bool has_int_val() const;
|
|
|
int32_t int_val() const;
|
|
|
void set_int_val(int32_t value);
|
|
|
|
|
|
::TestMsg* add_nested();
|
|
|
::TestMsg* mutable_nested(int index);
|
|
|
const TestMsg& nested(int index);
|
|
|
|
|
|
std::string SerializeAsString();
|
|
|
bool ParseFromString(const std::string&);
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The main characteristic of these stubs are:
|
|
|
|
|
|
* Code generated from .proto messages can be used in the codebase as general
|
|
|
purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()`
|
|
|
methods (although anecdotal evidence suggests that most project use these
|
|
|
proto-generated classes only at the de/serialization endpoints).
|
|
|
|
|
|
* The end-to-end journey of serializing a proto involves two steps:
|
|
|
1. Setting the individual int / string / vector fields of the generated class.
|
|
|
2. Doing a serialization pass over these fields.
|
|
|
|
|
|
In turn this has side-effects on the code generated. STL copy/assignment
|
|
|
operators for strings and vectors are non-trivial because, for instance, they
|
|
|
need to deal with dynamic memory resizing.
|
|
|
|
|
|
#### ProtoZero approach
|
|
|
|
|
|
```c++
|
|
|
// This class is generated by the ProtoZero plugin in the .pbzero.h source.
|
|
|
class TestMsg : public protozero::Message {
|
|
|
public:
|
|
|
void set_str_val(const std::string& value) {
|
|
|
AppendBytes(/*field_id=*/1, value.data(), value.size());
|
|
|
}
|
|
|
void set_str_val(const char* data, size_t size) {
|
|
|
AppendBytes(/*field_id=*/1, data, size);
|
|
|
}
|
|
|
void set_int_val(int32_t value) {
|
|
|
AppendVarInt(/*field_id=*/2, value);
|
|
|
}
|
|
|
TestMsg* add_nested() {
|
|
|
return BeginNestedMessage<TestMsg>(/*field_id=*/3);
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods
|
|
|
are invoked, the passed arguments are directly serialized into the target
|
|
|
buffer. This introduces some limitations:
|
|
|
|
|
|
* Readback is not possible: these classes cannot be used as C++ struct
|
|
|
replacements.
|
|
|
|
|
|
* No error-checking is performed: nothing prevents a non-repeated field to be
|
|
|
emitted twice in the serialized proto if the caller accidentally calls a
|
|
|
`set_*()` method twice. Basic type checks are still performed at compile-time
|
|
|
though.
|
|
|
|
|
|
* Nested fields must be filled in a stack fashion and cannot be written
|
|
|
interleaved. Once a nested message is started, its fields must be set before
|
|
|
going back setting the fields of the parent message. This turns out to not be
|
|
|
a problem for most tracing use-cases.
|
|
|
|
|
|
This has a number of advantages:
|
|
|
|
|
|
* The classes generated by ProtoZero don't add any extra state on top of the
|
|
|
base class they derive (`protozero::Message`). They define only inline
|
|
|
setter methods that call base-class serialization methods. Compilers can
|
|
|
see through all the inline expansions of these methods.
|
|
|
|
|
|
* As a consequence of that, the binary cost of ProtoZero is independent of the
|
|
|
number of protobuf messages defined and their fields, and depends only on the
|
|
|
number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto
|
|
|
messages and fields) anecdotally has been a big issue with libprotobuf.
|
|
|
|
|
|
* The serialization methods don't involve any copy or dynamic allocation. The
|
|
|
inline expansion calls directly into the corresponding `AppendVarInt()` /
|
|
|
`AppendString()` methods of `protozero::Message`.
|
|
|
|
|
|
* This allows to directly serialize trace events into the
|
|
|
[tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
|
|
|
not contiguous.
|
|
|
|
|
|
### Scattered buffer writing
|
|
|
|
|
|
A key part of the ProtoZero design is supporting direct serialization on
|
|
|
non-globally-contiguous sequences of contiguous memory regions.
|
|
|
|
|
|
This happens by decoupling `protozero::Message`, the base class for all the
|
|
|
generated classes, from the `protozero::ScatteredStreamWriter`.
|
|
|
The problem it solves is the following: ProtoZero is based on direct
|
|
|
serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
|
|
|
most cases. At the same time, there is no limit in how much data the caller will
|
|
|
try to write into an individual message, a trace event can be up to 256 MiB big.
|
|
|
|
|
|
![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)
|
|
|
|
|
|
#### Fast-path
|
|
|
|
|
|
At all times the underlying `ScatteredStreamWriter` knows what are the bounds
|
|
|
of the current buffer. All write operations are bound checked and hit a
|
|
|
slow-path when crossing the buffer boundary.
|
|
|
|
|
|
Most write operations can be completed within the current buffer boundaries.
|
|
|
In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
|
|
|
the extra overhead of var-int encoding for protobuf preambles and
|
|
|
length-delimited fields.
|
|
|
|
|
|
#### Slow-path
|
|
|
|
|
|
When crossing the boundary, the slow-path asks the
|
|
|
`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
|
|
|
`GetNewBuffer()` is up to the client. In tracing use-cases, that call will
|
|
|
acquire a new thread-local chunk from the tracing shared memory buffer.
|
|
|
|
|
|
Other heap-based implementations are possible. For instance, the ProtoZero
|
|
|
sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
|
|
|
[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
|
|
|
which allocates a new heap buffer when crossing the boundaries of the current
|
|
|
one.
|
|
|
|
|
|
Consider the following example:
|
|
|
|
|
|
```c++
|
|
|
TestMsg outer_msg;
|
|
|
for (int i = 0; i < 1000; i++) {
|
|
|
TestMsg* nested = outer_msg.add_nested();
|
|
|
nested->set_int_val(42);
|
|
|
}
|
|
|
```
|
|
|
|
|
|
At some point one of the `set_int_val()` calls will hit the slow-path and
|
|
|
acquire a new buffer. The overall idea is having a serialization mechanism
|
|
|
that is extremely lightweight most of the times and that requires some extra
|
|
|
function calls when buffer boundary, so that their cost gets amortized across
|
|
|
all trace events.
|
|
|
|
|
|
In the context of the overall Perfetto tracing use case, the slow-path involves
|
|
|
grabbing a process-local mutex and finding the next free chunk in the shared
|
|
|
memory buffer. Hence writes are lock-free as long as they happen within the
|
|
|
thread-local chunk and require a critical section to acquire a new chunk once
|
|
|
every 4KB-32KB (depending on the trace configuration).
|
|
|
|
|
|
The assumption is that the likeliness that two threads will cross the chunk
|
|
|
boundary and call `GetNewBuffer()` at the same time is extremely low and hence
|
|
|
the critical section is un-contended most of the times.
|
|
|
|
|
|
```mermaid
|
|
|
sequenceDiagram
|
|
|
participant C as Call site
|
|
|
participant M as Message
|
|
|
participant SSR as ScatteredStreamWriter
|
|
|
participant DEL as Buffer Delegate
|
|
|
C->>M: set_int_val(...)
|
|
|
activate C
|
|
|
M->>SSR: AppendVarInt(...)
|
|
|
deactivate C
|
|
|
Note over C,SSR: A typical write on the fast-path
|
|
|
|
|
|
C->>M: set_str_val(...)
|
|
|
activate C
|
|
|
M->>SSR: AppendString(...)
|
|
|
SSR->>DEL: GetNewBuffer(...)
|
|
|
deactivate C
|
|
|
Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
|
|
|
```
|
|
|
|
|
|
### Deferred patching
|
|
|
|
|
|
Nested messages in the protobuf binary encoding are prefixed with their
|
|
|
varint-encoded size.
|
|
|
|
|
|
Consider the following:
|
|
|
|
|
|
```c++
|
|
|
TestMsg* nested = outer_msg.add_nested();
|
|
|
nested->set_int_val(42);
|
|
|
nested->set_str_val("foo");
|
|
|
```
|
|
|
|
|
|
The canonical encoding of this protobuf message, using libprotobuf, would be:
|
|
|
|
|
|
```bash
|
|
|
1a 07 0a 03 66 6f 6f 10 2a
|
|
|
^-+-^ ^-----+------^ ^-+-^
|
|
|
| | |
|
|
|
| | +--> Field ID: 2 [int_val], value = 42.
|
|
|
| |
|
|
|
| +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
|
|
|
|
|
|
|
+------> Field ID: 3 [nested], length: 7 # !!!
|
|
|
```
|
|
|
|
|
|
The second byte in this sequence (07) is problematic for direct encoding. At the
|
|
|
point where `outer_msg.add_nested()` is called, we can't possibly know upfront
|
|
|
what the overall size of the nested message will be (in this case, 5 + 2 = 7).
|
|
|
|
|
|
The way we get around this in ProtoZero is by reserving four bytes for the
|
|
|
_size_ of each nested message and back-filling them once the message is
|
|
|
finalized (or when we try to set a field in one of the parent messages).
|
|
|
We do this by encoding the size of the message using redundant varint encoding,
|
|
|
in this case: `87 80 80 00` instead of `07`.
|
|
|
|
|
|
At the C++ level, the `protozero::Message` class holds a pointer to its `size`
|
|
|
field, which typically points to the beginning of the message, where the four
|
|
|
bytes are reserved, and back-fills it in the `Message::Finalize()` pass.
|
|
|
|
|
|
This works fine for cases where the entire message lies in one contiguous buffer
|
|
|
but opens a further challenge: a message can be several MBs big. Looking at this
|
|
|
from the overall tracing perspective, the shared memory buffer chunk that holds
|
|
|
the beginning of a message can be long gone (i.e. committed in the central
|
|
|
service buffer) by the time we get to the end.
|
|
|
|
|
|
In order to support this use case, at the tracing code level (outside of
|
|
|
ProtoZero), when a message crosses the buffer boundary, its `size` field gets
|
|
|
redirected to a temporary patch buffer
|
|
|
(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
|
|
|
sent out-of-band, piggybacking over the next commit IPC (see
|
|
|
[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))
|
|
|
|
|
|
### Performance characteristics
|
|
|
|
|
|
NOTE: For the full code of the benchmark see
|
|
|
`/src/protozero/test/protozero_benchmark.cc`
|
|
|
|
|
|
We consider two scenarios: writing a simple event and a nested event
|
|
|
|
|
|
#### Simple event
|
|
|
|
|
|
Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
|
|
|
2 x 64-bit) and a 32 bytes string, as follows:
|
|
|
|
|
|
```c++
|
|
|
void FillMessage_Simple(T* msg) {
|
|
|
msg->set_field_int32(...);
|
|
|
msg->set_field_uint32(...);
|
|
|
msg->set_field_int64(...);
|
|
|
msg->set_field_uint64(...);
|
|
|
msg->set_field_string(...);
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### Nested event
|
|
|
|
|
|
Consists of filling a similar message which is recursively nested 3 levels deep:
|
|
|
|
|
|
```c++
|
|
|
void FillMessage_Nested(T* msg, int depth = 0) {
|
|
|
FillMessage_Simple(msg);
|
|
|
if (depth < 3) {
|
|
|
auto* child = msg->add_field_nested();
|
|
|
FillMessage_Nested(child, depth + 1);
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
#### Comparison terms
|
|
|
|
|
|
We compare, for the same message type, the performance of ProtoZero,
|
|
|
libprotobuf and a speed-of-light serializer.
|
|
|
|
|
|
The speed-of-light serializer is a very simple C++ class that just appends
|
|
|
data into a linear buffer making all sorts of favourable assumptions. It does
|
|
|
not use any binary-stable encoding, it does not perform bound checking,
|
|
|
all writes are 64-bit aligned, it doesn't deal with any thread-safety.
|
|
|
|
|
|
```c++
|
|
|
struct SOLMsg {
|
|
|
template <typename T>
|
|
|
void Append(T x) {
|
|
|
// The memcpy will be elided by the compiler, which will emit just a
|
|
|
// 64-bit aligned mov instruction.
|
|
|
memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
|
|
|
ptr_ += sizeof(x);
|
|
|
}
|
|
|
|
|
|
void set_field_int32(int32_t x) { Append(x); }
|
|
|
void set_field_uint32(uint32_t x) { Append(x); }
|
|
|
void set_field_int64(int64_t x) { Append(x); }
|
|
|
void set_field_uint64(uint64_t x) { Append(x); }
|
|
|
void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }
|
|
|
|
|
|
alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
|
|
|
char* ptr_ = &storage_[0];
|
|
|
};
|
|
|
```
|
|
|
|
|
|
The speed-of-light serializer serves as a reference for _how fast a serializer
|
|
|
could be if argument marshalling and bound checking were zero cost._
|
|
|
|
|
|
#### Benchmark results
|
|
|
|
|
|
##### Google Pixel 3 - aarch64
|
|
|
|
|
|
```bash
|
|
|
$ cat out/droid_arm64/args.gn
|
|
|
target_os = "android"
|
|
|
is_clang = true
|
|
|
is_debug = false
|
|
|
target_cpu = "arm64"
|
|
|
|
|
|
$ ninja -C out/droid_arm64/ perfetto_benchmarks && \
|
|
|
adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
|
|
|
adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'
|
|
|
|
|
|
------------------------------------------------------------------------
|
|
|
Benchmark Time CPU Iterations
|
|
|
------------------------------------------------------------------------
|
|
|
BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807
|
|
|
BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528
|
|
|
BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381
|
|
|
BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468
|
|
|
BM_Protozero_Nested_Protozero 780 ns 773 ns 901369
|
|
|
BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958
|
|
|
```
|
|
|
|
|
|
##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux
|
|
|
|
|
|
```bash
|
|
|
|
|
|
$ cat out/linux_clang_release/args.gn
|
|
|
is_clang = true
|
|
|
is_debug = false
|
|
|
|
|
|
$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
|
|
|
out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*
|
|
|
|
|
|
------------------------------------------------------------------------
|
|
|
Benchmark Time CPU Iterations
|
|
|
------------------------------------------------------------------------
|
|
|
BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801
|
|
|
BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544
|
|
|
BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387
|
|
|
BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411
|
|
|
BM_Protozero_Nested_Protozero 843 ns 843 ns 849302
|
|
|
BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910
|
|
|
```
|