You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
423 lines
20 KiB
423 lines
20 KiB
# Buffers and dataflow
|
|
|
|
This page describes the dataflow in Perfetto when recording traces. It describes
|
|
all the buffering stages, explains how to size the buffers and how to debug
|
|
data losses.
|
|
|
|
## Concepts
|
|
|
|
Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline.
|
|
In many senses, its architecture is very similar to modern GPUs' command
|
|
buffers.
|
|
|
|
The design principles of the tracing dataflow are:
|
|
|
|
* The tracing fastpath is based on direct writes into a shared memory buffer.
|
|
* Highly optimized for low-overhead writing. NOT optimized for low-latency
|
|
reading.
|
|
* Trace data is eventually committed in the central trace buffer by the end
|
|
of the trace or when explicit flush requests are issued via the IPC channel.
|
|
* Producers are untrusted and should not be able to see each-other's trace data,
|
|
as that would leak sensitive information.
|
|
|
|
In the general case, there are two types buffers involved in a trace. When
|
|
pulling data from the Linux kernel's ftrace infrastructure, there is a third
|
|
stage of buffering (one per-CPU) involved:
|
|
|
|
![Buffers](/docs/images/buffers.png)
|
|
|
|
#### Tracing service's central buffers
|
|
|
|
These buffers (yellow, in the picture above) are defined by the user in the
|
|
`buffers` section of the [trace config](config.md). In the most simple cases,
|
|
one tracing session = one buffer, regardless of the number of data sources and
|
|
producers.
|
|
|
|
This is the place where the tracing data is ultimately kept, while in memory,
|
|
whether it comes from the kernel ftrace infrastructure, from some other data
|
|
source in `traced_probes` or from another userspace process using the
|
|
[Perfetto SDK](/docs/instrumentation/tracing-sdk.md).
|
|
At the end of the trace (or during, if in [streaming mode]) these buffers are
|
|
written into the output trace file.
|
|
|
|
These buffers can contain a mixture of trace packets coming from different data
|
|
sources and even different producer processes. What-goes-where is defined in the
|
|
[buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config.
|
|
Because of this, the tracing buffers are not shared across processes, to avoid
|
|
cross-talking and information leaking across producer processes.
|
|
|
|
#### Shared memory buffers
|
|
|
|
Each producer process has one memory buffer shared 1:1 with the tracing service
|
|
(blue, in the picture above), regardless of the number of data sources it hosts.
|
|
This buffer is a temporary staging buffer and has two purposes:
|
|
|
|
1. Zero-copy on the writer path. This buffer allows direct serialization of the
|
|
tracing data from the writer fastpath in a memory region directly readable by
|
|
the tracing service.
|
|
|
|
2. Decoupling writes from reads of the tracing service. The tracing service has
|
|
the job of moving trace packets from the shared memory buffer (blue) into the
|
|
central buffer (yellow) as fast as it can.
|
|
The shared memory buffer hides the scheduling and response latencies of the
|
|
tracing service, allowing the producer to keep writing without losing data
|
|
when the tracing service is temporarily blocked.
|
|
|
|
#### Ftrace buffer
|
|
|
|
When the `linux.ftrace` data source is enabled, the kernel will have its own
|
|
per-CPU buffers. These are unavoidable because the kernel cannot write directly
|
|
into user-space buffers. The `traced_probes` process will periodically read
|
|
those buffers, convert the data into binary protos and follow the same dataflow
|
|
of userspace tracing. These buffers need to be just large enough to hold data
|
|
between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`).
|
|
|
|
## Life of a trace packet
|
|
|
|
Here is a summary to understand the dataflow of trace packets across buffers.
|
|
Consider the case of a producer process hosting two data sources writing packets
|
|
at a different rates, both targeting the same central buffer.
|
|
|
|
1. When each data source starts writing, it will grab a free page of the shared
|
|
memory buffer and directly serialize proto-encoded tracing data onto it.
|
|
|
|
2. When a page of the shared memory buffer is filled, the producer will send an
|
|
async IPC to the service, asking it to copy the shared memory page just
|
|
written. Then, the producer will grab the next free page in the shared memory
|
|
buffer and keep writing.
|
|
|
|
3. When the service receives the IPC, it copies the shared memory page into
|
|
the central buffer and marks the shared memory buffer page as free again. Data
|
|
sources within the producer are able to reuse that page at this point.
|
|
|
|
4. When the tracing session ends, the service sends a `Flush` request to all
|
|
data sources. In reaction to this, data sources will commit all outstanding
|
|
shared memory pages, even if not completely full. The services copies these
|
|
pages into the service's central buffer.
|
|
|
|
![Dataflow animation](/docs/images/dataflow.svg)
|
|
|
|
## Buffer sizing
|
|
|
|
#### Central buffer sizing
|
|
|
|
The math for sizing the central buffer is quite straightforward: in the default
|
|
case of tracing without `write_into_file` (when the trace file is written only
|
|
at the end of the trace), the buffer will hold as much data as it has been
|
|
written by the various data sources.
|
|
|
|
The total length of the trace will be `(buffer size) / (aggregated write rate)`.
|
|
If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold
|
|
~ 8 seconds of tracing data.
|
|
|
|
The write rate is highly dependent on the data sources configured and by the
|
|
activity of the system. 1-2 MB/s is a typical figure on Android traces with
|
|
scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier
|
|
data sources are enabled (e.g., syscall or pagefault tracing).
|
|
|
|
When using [streaming mode] the buffer needs to be able to hold enough data
|
|
between two `file_write_period_ms` periods (default: 5s).
|
|
For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s
|
|
the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses.
|
|
|
|
#### Shared memory buffer sizing
|
|
|
|
The sizing of the shared memory buffer depends on:
|
|
|
|
* The scheduling characteristics of the underlying system, i.e. for how long the
|
|
tracing service can be blocked on the scheduler queues. This is a function of
|
|
the kernel configuration and nice-ness level of the `traced` process.
|
|
* The max write rate of all data sources within a producer process.
|
|
|
|
Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets
|
|
blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB
|
|
to avoid losses.
|
|
|
|
Empirical measurements suggest that on most Android systems a shared memory
|
|
buffer size of 128-512 KB is good enough.
|
|
|
|
The default shared memory buffer size is 256 KB. When using the Perfetto Client
|
|
Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`.
|
|
|
|
WARNING: if a data source writes very large trace packets in a single batch,
|
|
either the shared memory buffer needs to be big enough to handle that or
|
|
`BufferExhaustedPolicy.kStall` must be employed.
|
|
|
|
For instance, consider a data source that emits a 2MB screenshot every 10s.
|
|
Its (simplified) code, would look like:
|
|
```c++
|
|
for (;;) {
|
|
ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) {
|
|
auto packet = ctx.NewTracePacket();
|
|
packet.set_bitmap(Grab2MBScreenshot());
|
|
});
|
|
std::this_thread::sleep_for(std::chrono::seconds(10));
|
|
}
|
|
```
|
|
|
|
Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will
|
|
create bursts of 2MB back-to-back without yielding; it is limited only by the
|
|
tracing serialization overhead. In practice, it will write the 2MB buffer at
|
|
O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be
|
|
unlikely to catch up at that rate and data losses will be experienced.
|
|
|
|
In a case like this these options are:
|
|
|
|
* Increase the size of the shared memory buffer in the producer that hosts the
|
|
data source.
|
|
* Split the write into chunks spaced by some delay.
|
|
* Adopt the `BufferExhaustedPolicy::kStall` when defining the data source:
|
|
|
|
```c++
|
|
class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> {
|
|
public:
|
|
constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy =
|
|
BufferExhaustedPolicy::kStall;
|
|
...
|
|
};
|
|
```
|
|
|
|
## Debugging data losses
|
|
|
|
#### Ftrace kernel buffer losses
|
|
|
|
When using the Linux kernel ftrace data source, losses can occur in the
|
|
kernel -> userspace path if the `traced_probes` process gets blocked for too
|
|
long.
|
|
|
|
At the trace proto level, losses in this path are recorded:
|
|
* In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the
|
|
beginning and end of the trace. If the `overrun` field is non-zero, data has
|
|
been lost.
|
|
* In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows
|
|
to locate precisely the point where data loss happened.
|
|
|
|
At the TraceProcessor SQL level, this data is available in the `stats` table:
|
|
|
|
```sql
|
|
> select * from stats where name like 'ftrace_cpu_overrun_end'
|
|
name idx severity source value
|
|
-------------------- -------------------- -------------------- ------ ------
|
|
ftrace_cpu_overrun_e 0 data_loss trace 0
|
|
ftrace_cpu_overrun_e 1 data_loss trace 0
|
|
ftrace_cpu_overrun_e 2 data_loss trace 0
|
|
ftrace_cpu_overrun_e 3 data_loss trace 0
|
|
ftrace_cpu_overrun_e 4 data_loss trace 0
|
|
ftrace_cpu_overrun_e 5 data_loss trace 0
|
|
ftrace_cpu_overrun_e 6 data_loss trace 0
|
|
ftrace_cpu_overrun_e 7 data_loss trace 0
|
|
```
|
|
|
|
These losses can be mitigated either increasing
|
|
[`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig]
|
|
or decreasing
|
|
[`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig]
|
|
|
|
#### Shared memory losses
|
|
|
|
Tracing data can be lost in the shared memory due to bursts while traced is
|
|
blocked.
|
|
|
|
At the trace proto level, losses in this path are recorded:
|
|
|
|
* In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats].
|
|
* In [`TracePacket.previous_packet_dropped`][TracePacket].
|
|
Caveat: the very first packet emitted by every data source is also marked as
|
|
`previous_packet_dropped=true`. This is because the service has no way to
|
|
tell if that was the truly first packet or everything else before that was
|
|
lost.
|
|
|
|
At the TraceProcessor SQL level, this data is available in the `stats` table:
|
|
```sql
|
|
> select * from stats where name = 'traced_buf_trace_writer_packet_loss'
|
|
name idx severity source value
|
|
-------------------- -------------------- -------------------- --------- -----
|
|
traced_buf_trace_wri 0 data_loss trace 0
|
|
```
|
|
|
|
#### Central buffer losses
|
|
|
|
Data losses in the central buffer can happen for two different reasons:
|
|
|
|
1. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by
|
|
virtue of wrapping in the ring buffer.
|
|
These losses are recorded, at the trace proto level, in
|
|
[`TraceStats.BufferStats.chunks_overwritten`][BufferStats].
|
|
|
|
2. When using `fill_policy: DISCARD`, newer tracing data committed after the
|
|
buffer is full is dropped.
|
|
These losses are recorded, at the trace proto level, in
|
|
[`TraceStats.BufferStats.chunks_discarded`][BufferStats].
|
|
|
|
At the TraceProcessor SQL level, this data is available in the `stats` table,
|
|
one entry per central buffer:
|
|
|
|
```sql
|
|
> select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded'
|
|
name idx severity source value
|
|
-------------------- -------------------- -------------------- ------- -----
|
|
traced_buf_chunks_di 0 info trace 0
|
|
traced_buf_chunks_ov 0 data_loss trace 0
|
|
```
|
|
|
|
Summary: the best way to detect and debug data losses is to use Trace Processor
|
|
and issue the query:
|
|
`select * from stats where severity = 'data_loss' and value != 0`
|
|
|
|
## Atomicity and ordering guarantees
|
|
|
|
A "writer sequence" is the sequence of trace packets emitted by a given
|
|
TraceWriter from a data source. In almost all cases 1 data source ==
|
|
1+ TraceWriter(s). Some data sources that support writing from multiple threads
|
|
typically create one TraceWriter per thread.
|
|
|
|
* Trace packets written from a sequence are emitted in the trace file in the
|
|
same order they have been written.
|
|
|
|
* There is no ordering guarantee between packets written by different sequences.
|
|
Sequences are, by design, concurrent and more than one linearization is
|
|
possible. The service does NOT respect global timestamp ordering across
|
|
different sequences. If two packets from two sequences were emitted in
|
|
global timestamp order, the service can still emit them in the trace file in
|
|
the opposite order.
|
|
|
|
* Trace packets are atomic. If a trace packet is emitted in the trace file, it
|
|
is guaranteed to be contain all the fields that the data source wrote. If a
|
|
trace packet is large and spans across several shared memory buffer pages, the
|
|
service will save it in the trace file only if it can observe that all
|
|
fragments have been committed without gaps.
|
|
|
|
* If a trace packet is lost (e.g. because of wrapping in the ring buffer
|
|
or losses in the shared memory buffer), no further trace packet will be
|
|
emitted for that sequence, until all packets before are dropped as well.
|
|
In other words, if the tracing service ends up in a situation where it sees
|
|
packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new
|
|
packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap,
|
|
the full sequence 5, 6, 7, 8, 9 will be emitted.
|
|
This behavior, however, doesn't hold when using [streaming mode] because,
|
|
in that case, the periodic read will consume the packets in the buffer and
|
|
clear the gaps, allowing the sequence to restart.
|
|
|
|
## Incremental state in trace packets
|
|
|
|
In many cases trace packets are fully independent of each other and can be
|
|
processed and interpreted without further context.
|
|
In some cases, however, they can have _incremental state_ and behave similarly
|
|
to inter-frame video encoding techniques, where some frames require the keyframe
|
|
to be present to be meaningfully decoded.
|
|
|
|
Here are are two concrete examples:
|
|
|
|
1. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are
|
|
keyed by thread id. In most cases users want to map those events back to the
|
|
parent process (the thread-group). To solve this, when both the
|
|
`linux.ftrace` and the `linux.process_stats` data sources are enabled in a
|
|
Perfetto trace, the latter does capture process<>thread associations from
|
|
the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace.
|
|
A typical trace in this case looks as follows:
|
|
```
|
|
# From process_stats's /proc scanner.
|
|
pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger"
|
|
|
|
# From ftrace
|
|
timestamp: 95054961131912; sched_wakeup: pid: 610; target_cpu: 2;
|
|
timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98
|
|
```
|
|
The /proc entry is emitted only once per process to avoid bloating the size of
|
|
the trace. In lack of data losses this is fine to be able to reconstruct all
|
|
scheduling events for that pid. If, however, the process_stats packet gets
|
|
dropped in the ring buffer, there will be no way left to work out the process
|
|
details for all the other ftrace events that refer to that PID.
|
|
|
|
2. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto
|
|
SDK makes extensive use of string interning. Mos strings and descriptors
|
|
(e.g. details about processes / threads) are emitted only once and later
|
|
referred to using a monotonic ID. In case a loss of the descriptor packet,
|
|
it is not possible to make fully sense of those events.
|
|
|
|
Trace Processor has built-in mechanism that detect loss of interning data and
|
|
skips ingesting packets that refer to missing interned strings or descriptors.
|
|
|
|
When using tracing in ring-buffer mode, these types of losses are very likely to
|
|
happen.
|
|
|
|
There are two mitigations for this:
|
|
|
|
1. Issuing periodic invalidations of the incremental state via
|
|
[`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig].
|
|
This will cause the data sources that make use of incremental state to
|
|
periodically drop the interning / process mapping tables and re-emit the
|
|
descriptors / strings on the next occurrence. This mitigates quite well the
|
|
problem in the context of ring-buffer traces, as long as the
|
|
`clear_period_ms` is one order of magnitude lower than the estimated length
|
|
of trace data in the central trace buffer.
|
|
|
|
2. Recording the incremental state into a dedicated buffer (via
|
|
`DataSourceConfig.target_buffer`). This technique is quite commonly used with
|
|
in the ftrace + process_stats example mentioned before, recording the
|
|
process_stats packet in a dedicated buffer less likely to wrap (ftrace events
|
|
are much more frequent than descriptors for new processes).
|
|
|
|
## Flushes and windowed trace importing
|
|
|
|
Another common problem experienced in traces that involve multiple data sources
|
|
is the non-synchronous nature of trace commits. As explained in the
|
|
[Life of a trace packet](#life-of-a-trace-packet) section above, trace data is
|
|
committed only when a full memory page of the shared memory buffer is filled (or
|
|
at when the tracing session ends). In most cases, if data sources produce events
|
|
at a regular cadence, pages are filled quite quickly and events are committed
|
|
in the central buffers within seconds.
|
|
|
|
In some other cases, however, a data source can emit events only sporadically.
|
|
Imagine the case of a data source that emits events when the display is turned
|
|
on/off. Such an infrequent event might end up being staged in the shared memory
|
|
buffer for very long times and can end up being committed in the trace buffer
|
|
hours after it happened.
|
|
|
|
Another scenario where this can happen is when using ftrace and when a
|
|
particular CPU is idle most of the time or gets hot-unplugged (ftrace uses
|
|
per-cpu buffers). In this case a CPU might record little-or-no data for several
|
|
minutes while the other CPUs pump thousands of new trace events per second.
|
|
|
|
This causes two side effects that end up breaking user expectations or causing
|
|
bugs:
|
|
|
|
* The UI can show an abnormally long timeline with a huge gap in the middle.
|
|
The packet ordering of events doesn't matter for the UI because events are
|
|
sorted by timestamp at import time. The trace in this case will contain very
|
|
recent events plus a handful of stale events that happened hours before. The
|
|
UI, for correctness, will try to display all events, showing a handful of
|
|
early events, followed by a huge temporal gap when nothing happened,
|
|
followed by the stream of recent events.
|
|
|
|
* When recording long traces, Trace Processor can show import errors of the form
|
|
"XXX event out-of-order". This is because. in order to limit the memory usage
|
|
at import time, Trace Processor sorts events using a sliding window. If trace
|
|
packets are too out-of-order (trace file order vs timestamp order), the
|
|
sorting will fail and some packets will be dropped.
|
|
|
|
#### Mitigations
|
|
|
|
The best mitigation for these sort of problems is to specify a
|
|
[`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually
|
|
good enough for most cases), especially when recording long traces.
|
|
|
|
This will cause the tracing service to issue periodic flush requests to data
|
|
sources. A flush requests causes the data source to commit the shared memory
|
|
buffer pages into the central buffer, even if they are not completely full.
|
|
By default, a flush issued only at the end of the trace.
|
|
|
|
In case of long traces recorded without `flush_period_ms`, another option is to
|
|
pass the `--full-sort` option to `trace_processor_shell` when importing the
|
|
trace. Doing so will disable the windowed sorting at the cost of a higher
|
|
memory usage (the trace file will be fully buffered in memory before parsing).
|
|
|
|
[streaming mode]: /docs/concepts/config#long-traces
|
|
[TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig
|
|
[FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig
|
|
[IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig
|
|
[FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats
|
|
[FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle
|
|
[TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket
|
|
[BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats
|