You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
601 lines
23 KiB
601 lines
23 KiB
AutoFDO and ARM Trace {#AutoFDO}
|
|
=====================
|
|
|
|
@brief Using CoreSight trace and perf with OpenCSD for AutoFDO.
|
|
|
|
## Introduction
|
|
|
|
Feedback directed optimization (FDO, also know as profile guided
|
|
optimization - PGO) uses a profile of a program's execution to guide the
|
|
optmizations performed by the compiler. Traditionally, this involves
|
|
building an instrumented version of the program, which records a profile of
|
|
execution as it runs. The instrumentation adds significant runtime
|
|
overhead, possibly changing the behaviour of the program and it may not be
|
|
possible to run the instrumented program in a production environment
|
|
(e.g. where performance criteria must be met).
|
|
|
|
AutoFDO uses facilities in the hardware to sample the behaviour of the
|
|
program in the production environment and generate the execution profile.
|
|
An improved profile can be obtained by including the branch history
|
|
(i.e. a record of the last branches taken) when generating an instruction
|
|
samples. On Arm systems, the ETM can be used to generate such records.
|
|
|
|
The process can be broken down into the following steps:
|
|
|
|
* Record execution trace of the program
|
|
* Convert the execution trace to instruction samples with branch histories
|
|
* Convert the instruction samples to source level profiles
|
|
* Use the source level profile with the compiler
|
|
|
|
This article describes how to enable ETM trace on Arm targets running Linux
|
|
and use the ETM trace to generate AutoFDO profiles and compile an optimized
|
|
program.
|
|
|
|
|
|
## Execution trace on Arm targets
|
|
|
|
Debug and trace of Arm targets is provided by CoreSight. This consists of
|
|
a set of components that allow access to debug logic, record (trace) the
|
|
execution of a processor and route this data through the system, collecting
|
|
it into a store.
|
|
|
|
To record the execution of a processor, we require the following
|
|
components:
|
|
|
|
* A trace source. The core contains a trace unit, called an ETM that emits
|
|
data describing the instructions executed by the core.
|
|
* Trace links. The trace data generated by the ETM must be moved through
|
|
the system to the component that collects the data (sink). Links
|
|
include:
|
|
* Funnels: merge multiple streams of data
|
|
* FIFOs: buffer data to smooth out bursts
|
|
* Replicators: send a stream of data to multiple components
|
|
* Sinks. These receive the trace data and store it or send it to an
|
|
external device:
|
|
* ETB: A small circular buffer (64-128 kilobytes) that stores the most
|
|
recent data
|
|
* ETR: A larger (several megabytes) buffer that uses system RAM to
|
|
store data
|
|
* TPIU: Sends data to an off-chip capture device (e.g. Arm DSTREAM)
|
|
|
|
Each Arm SoC design may have a different layout (topology) of components.
|
|
This topology is described to the OS drivers by the platform's devicetree
|
|
or (in future) ACPI firmware.
|
|
|
|
For application profiling, we need to store several megabytes of data
|
|
within the system, so will use ETR with the capture tool (perf)
|
|
periodically draining the buffer to a file.
|
|
|
|
Even though we have a large capture buffer, the ETM can still generate a
|
|
lot of data very quickly - typically an ETM will generate ~1 bit of data
|
|
per instruction (depending on the workload), which results in 256Mbytes per
|
|
second for a core running at 2GHz. This leads to problems storing and
|
|
decoding such large volumes of data. AutoFDO uses samples of program
|
|
execution, so we can avoid this problem by using the ETM's features to
|
|
only record small slices of execution - e.g. collect ~5000 cycles of data
|
|
every 50M cycles. This reduces the data rate to a manageable level - a few
|
|
megabytes per minute. This technique is known as 'strobing'.
|
|
|
|
|
|
## Enabling trace
|
|
|
|
### Driver support
|
|
|
|
To collect ETM trace, the CoreSight drivers must be included in the
|
|
kernel. Some of the driver support is not yet included in the mainline
|
|
kernel and many targets are using older kernels. To enable CoreSight trace
|
|
on these targets, Arm have provided backports of the latest CoreSight
|
|
drivers and ETM strobing patch at:
|
|
|
|
<https://gitlab.arm.com/linux-arm/linux-coresight-backports>
|
|
|
|
This repository can be cloned with:
|
|
|
|
```
|
|
git clone https://git.gitlab.arm.com/linux-arm/linux-coresight-backports.git
|
|
```
|
|
|
|
You can include these backports in your kernel by either merging the
|
|
appropriate branch using git or generating patches (using `git
|
|
format-patch`).
|
|
|
|
For 5.x based kernel onwards, the only patch which needs to be applied is the one enabling strobing - etm4x: `Enable strobing of ETM`.
|
|
|
|
For 4.9 based kernels, use the `coresight-4.9-etr-etm_strobe` branch:
|
|
|
|
```
|
|
git merge coresight-4.9-etr-etm_strobe
|
|
```
|
|
|
|
or
|
|
|
|
```
|
|
git format-patch --output-directory /output/dir v4.9..coresight-4.9-etr-etm_strobe
|
|
cd my_kernel
|
|
git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
|
|
```
|
|
|
|
For 4.14 based kernels, use the `coresight-4.14-etm_strobe` branch:
|
|
|
|
```
|
|
git merge coresight-4.14-etm_strobe
|
|
```
|
|
|
|
or
|
|
|
|
```
|
|
git format-patch --output-directory /output/dir v4.14..coresight-4.14-etm_strobe
|
|
cd my_kernel
|
|
git am /output/dir/*.patch # or patch -p1 /output/dir/*.patch if not using git
|
|
```
|
|
|
|
The CoreSight trace drivers must also be enabled in the kernel
|
|
configuration. This can be done using the configuration menu (`make
|
|
menuconfig`), selecting `Kernel hacking` / `arm64 Debugging` /`CoreSight Tracing Support` and
|
|
enabling all options, or by setting the following in the configuration
|
|
file:
|
|
|
|
```
|
|
CONFIG_CORESIGHT=y
|
|
CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
|
|
CONFIG_CORESIGHT_SINK_TPIU=y
|
|
CONFIG_CORESIGHT_SOURCE_ETM4X=y
|
|
CONFIG_CORESIGHT_DYNAMIC_REPLICATOR=y
|
|
CONFIG_CORESIGHT_STM=y
|
|
CONFIG_CORESIGHT_CATU=y
|
|
```
|
|
|
|
Compile the kernel for your target in the usual way, e.g.
|
|
|
|
```
|
|
make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
|
|
```
|
|
|
|
Each target may have a different layout of CoreSight components. To
|
|
collect trace into a sink, the kernel drivers need to know which other
|
|
devices need to be configured to route data from the source to the sink.
|
|
This is described in the devicetree (and in future, the ACPI tables). The
|
|
device tree will define which CoreSight devices are present in the system,
|
|
where they are located and how they are connected together. The devicetree
|
|
for some platforms includes a description of the platform's CoreSight
|
|
components, but in other cases you may have to ask the platform/SoC vendor
|
|
to supply it or create it yourself (see Appendix: Describing CoreSight in
|
|
Devicetree).
|
|
|
|
Once the target has been booted with the devicetree describing the
|
|
CoreSight devices, you should find the devices in sysfs:
|
|
|
|
```
|
|
# ls /sys/bus/coresight/devices/
|
|
etm0 etm2 etm4 etm6 funnel0 funnel2 funnel4 stm0 tmc_etr0
|
|
etm1 etm3 etm5 etm7 funnel1 funnel3 replicator0 tmc_etf0
|
|
```
|
|
|
|
The naming convention for etm devices can be different according to the kernel version you're using.
|
|
For more information about the naming scheme, please check out the [Linux Kernel Documentation](https://www.kernel.org/doc/html/latest/trace/coresight/coresight.html#device-naming-scheme)
|
|
|
|
If `/sys/bus/coresight/devices/` is empty, you may want to check out your Kernel configuration to make sure your .config file is including CoreSight dependencies, such as the clock.
|
|
|
|
### Perf tools
|
|
|
|
The perf tool is used to capture execution trace, configuring the trace
|
|
sources to generate trace, routing the data to the sink and collecting the
|
|
data from the sink.
|
|
|
|
Arm recommends to use the perf version corresponding to the kernel running
|
|
on the target. This can be built from the same kernel sources with
|
|
|
|
```
|
|
make -C tools/perf CORESIGHT=1 VF=1 ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
|
|
```
|
|
|
|
When specifying CORESIGHT=1, perf will be built using the installed OpenCSD library.
|
|
If you are cross compiling, then additional setup is required to ensure the build process links against the correct version of the library.
|
|
|
|
If the post-processing (`perf inject`) of the captured data is not being
|
|
done on the target, then the OpenCSD library is not required for this build
|
|
of perf.
|
|
|
|
Trace is captured by collecting the `cs_etm` event from perf. The sink
|
|
to collect data into is specified as a parameter of this event. Trace can
|
|
also be restricted to user space or kernel space with 'u' or 'k'
|
|
parameters. For example:
|
|
|
|
```
|
|
perf record -e cs_etm/@tmc_etr0/u --per-thread -- /bin/ls
|
|
```
|
|
|
|
Will record the userspace execution of '/bin/ls' using tmc_etr0 as sink.
|
|
|
|
## Capturing modes
|
|
|
|
You can trace a single-threaded program in two different ways:
|
|
|
|
1. By specifying `--per-thread`, and in this case the CoreSight subsystem will
|
|
record only a trace relative to the given program.
|
|
|
|
2. By NOT specifying `--per-thread`, and in this case CPU-wide tracing will
|
|
be enabled. In this scenario the trace will contain both the target program trace
|
|
and other workloads that were executing on the same CPU
|
|
|
|
|
|
|
|
## Processing trace and profiles
|
|
|
|
perf is also used to convert the execution trace an instruction profile.
|
|
This requires a different build of perf, using the version of perf from
|
|
Linux v4.17 or later, as the trace processing code isn't included in the
|
|
driver backports. Trace decode is provided by the OpenCSD library
|
|
(<https://github.com/Linaro/OpenCSD>), v0.9.1 or later. This is packaged
|
|
for debian testing (install the libopencsd0, libopencsd-dev packages) or
|
|
can be compiled from source and installed.
|
|
|
|
The autoFDO tool <https://github.com/google/autofdo> is used to convert the
|
|
instruction profiles to source profiles for the GCC and clang/llvm
|
|
compilers.
|
|
|
|
|
|
## Recording and profiling
|
|
|
|
Once trace collection using perf is working, we can now use it to profile
|
|
an application.
|
|
|
|
The application must be compiled to include sufficient debug information to
|
|
map instructions back to source lines. For GCC, use the `-g1` or `-gmlt`
|
|
options. For clang/llvm, also add the `-fdebug-info-for-profiling` option.
|
|
|
|
perf identifies the active program or library using the build identifier
|
|
stored in the elf file. This should be added at link time with the compiler
|
|
flag `-Wl,--build-id=sha1`.
|
|
|
|
The next step is to record the execution trace of the application using the
|
|
perf tool. The ETM strobing should be configured before running the perf
|
|
tool. There are two parameters:
|
|
|
|
* window size: A number of CPU cycles (W)
|
|
* period: Trace is enabled for W cycle every _period_ * W cycles.
|
|
|
|
For example, a typical configuration is to use a window size of 5000 cycles
|
|
and a period of 10000 - this will collect 5000 cycles of trace every 50M
|
|
cycles. With these proof-of-concept patches, the strobe parameters are
|
|
configured via sysfs - each ETM will have `strobe_window` and
|
|
`strobe_period` parameters in `/sys/bus/coresight/devices/<sink>` and
|
|
these values will have to be written to each (In a future version, this
|
|
will be integrated into the drivers and perf tool).
|
|
The `set_strobing.sh` script in this directory [`<opencsd>/decoder/tests/auto-fdo`] automates this process.
|
|
|
|
To collect trace from an application using ETM strobing, run:
|
|
|
|
```
|
|
sudo ./set_strobing.sh 5000 10000
|
|
perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>"
|
|
```
|
|
|
|
The raw trace can be examined using the `perf report` command:
|
|
|
|
```
|
|
perf report -D -i perf.data --stdio
|
|
```
|
|
|
|
Perf needs to be built from your linux kernel version souce code repository against the OpenCSD library in order to be able to properly read ETM-gathered samples and post-process them.
|
|
If running `perf report` produces an error like:
|
|
|
|
```
|
|
0x1f8 [0x268]: failed to process type: 70 [Operation not permitted]
|
|
Error:
|
|
failed to process sample
|
|
```
|
|
or
|
|
|
|
```
|
|
"file uses a more recent and unsupported ABI (8 bytes extra). incompatible file format".
|
|
```
|
|
|
|
You are probably using a perf version which is not using this library: please make sure to install this project in your system by either compiling it from [Source Code]( <https://github.com/Linaro/OpenCSD>) from v0.9.1 or later and compile perf using this library.
|
|
Otherwise, this project is packaged for debian (install the libopencsd0, libopencsd-dev packages).
|
|
|
|
|
|
For example:
|
|
|
|
```
|
|
0x1d370 [0x30]: PERF_RECORD_AUXTRACE size: 0x2003c0 offset: 0 ref: 0x39ba881d145f8639 idx: 0 tid: 4551 cpu: -1
|
|
|
|
. ... CoreSight ETM Trace data: size 2098112 bytes
|
|
Idx:0; ID:12; I_ASYNC : Alignment Synchronisation.
|
|
Idx:12; ID:12; I_TRACE_INFO : Trace Info.; INFO=0x0
|
|
Idx:17; ID:12; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
|
|
Idx:48; ID:14; I_ASYNC : Alignment Synchronisation.
|
|
Idx:60; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
|
|
Idx:65; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
|
|
Idx:96; ID:14; I_ASYNC : Alignment Synchronisation.
|
|
Idx:108; ID:14; I_TRACE_INFO : Trace Info.; INFO=0x0
|
|
Idx:113; ID:14; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFF000008A4991C;
|
|
Idx:122; ID:14; I_TRACE_ON : Trace On.
|
|
Idx:123; ID:14; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000000000407B00; Ctxt: AArch64,EL0, NS;
|
|
Idx:134; ID:14; I_ATOM_F3 : Atom format 3.; ENN
|
|
Idx:135; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
|
|
Idx:136; ID:14; I_ATOM_F5 : Atom format 5.; ENENE
|
|
Idx:137; ID:14; I_ATOM_F5 : Atom format 5.; NENEN
|
|
Idx:138; ID:14; I_ATOM_F3 : Atom format 3.; ENN
|
|
Idx:139; ID:14; I_ATOM_F3 : Atom format 3.; NNE
|
|
Idx:140; ID:14; I_ATOM_F1 : Atom format 1.; E
|
|
.....
|
|
```
|
|
|
|
The execution trace is then converted to an instruction profile using
|
|
the perf build with trace decode support. This may be done on a different
|
|
machine than that which collected the trace (e.g. when cross compiling for
|
|
an embedded target). The `perf inject` command
|
|
decodes the execution trace and generates periodic instruction samples,
|
|
with branch histories:
|
|
|
|
!! Careful: if you are using a device different than the one used to collect the profiling data,
|
|
you'll need to run `perf buildid-cache` as described below.
|
|
```
|
|
perf inject -i perf.data -o inj.data --itrace=i100000il
|
|
```
|
|
|
|
The `--itrace` option configures the instruction sample behaviour:
|
|
|
|
* `i100000i` generates an instruction sample every 100000 instructions
|
|
(only instruction count periods are currently supported, future versions
|
|
may support time or cycle count periods)
|
|
* `l` includes the branch histories on each sample
|
|
* `b` generates a sample on each branch (not used here)
|
|
|
|
Perf requires the original program binaries to decode the execution trace.
|
|
If running the `inject` command on a different system than the trace was
|
|
captured on, then the binary and any shared libraries must be added to
|
|
perf's cache with:
|
|
|
|
```
|
|
perf buildid-cache -a /path/to/binary_or_library
|
|
```
|
|
|
|
`perf report` can also be used to show the instruction samples:
|
|
|
|
```
|
|
perf report -D -i inj.data --stdio
|
|
.......
|
|
0x1528 [0x630]: PERF_RECORD_SAMPLE(IP, 0x2): 4551/4551: 0x434b98 period: 3093 addr: 0
|
|
... branch stack: nr:64
|
|
..... 0: 0000000000434b58 -> 0000000000434b68 0 cycles P 0
|
|
..... 1: 0000000000436a88 -> 0000000000434b4c 0 cycles P 0
|
|
..... 2: 0000000000436a64 -> 0000000000436a78 0 cycles P 0
|
|
..... 3: 00000000004369d0 -> 0000000000436a60 0 cycles P 0
|
|
..... 4: 000000000043693c -> 00000000004369cc 0 cycles P 0
|
|
..... 5: 00000000004368a8 -> 0000000000436928 0 cycles P 0
|
|
..... 6: 000000000042d070 -> 00000000004368a8 0 cycles P 0
|
|
..... 7: 000000000042d108 -> 000000000042d070 0 cycles P 0
|
|
.......
|
|
..... 57: 0000000000448ee0 -> 0000000000448f24 0 cycles P 0
|
|
..... 58: 0000000000448ea4 -> 0000000000448ebc 0 cycles P 0
|
|
..... 59: 0000000000448e20 -> 0000000000448e94 0 cycles P 0
|
|
..... 60: 0000000000448da8 -> 0000000000448ddc 0 cycles P 0
|
|
..... 61: 00000000004486f4 -> 0000000000448da8 0 cycles P 0
|
|
..... 62: 00000000004480fc -> 00000000004486d4 0 cycles P 0
|
|
..... 63: 0000000000448658 -> 00000000004480ec 0 cycles P 0
|
|
... thread: program1:4551
|
|
...... dso: /home/root/program1
|
|
.......
|
|
```
|
|
|
|
The instruction samples produced by `perf inject` is then passed to the
|
|
autofdo tool to generate source level profiles for the compiler. For
|
|
clang/LLVM:
|
|
|
|
```
|
|
create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
|
|
```
|
|
|
|
And for GCC:
|
|
|
|
```
|
|
create_gcov -binary=/path/to/binary -profile=inj.data -gcov_version=1 -gcov=program.gcov
|
|
```
|
|
|
|
The profiles can be viewed with:
|
|
|
|
```
|
|
llvm-profdata show -sample program.llvmprof
|
|
```
|
|
|
|
Or, for GCC:
|
|
|
|
```
|
|
dump_gcov -gcov_version=1 program.gcov
|
|
```
|
|
|
|
## Using profile in the compiler
|
|
|
|
The profile produced by the above steps can then be passed to the compiler
|
|
to optimize the next build of the program.
|
|
|
|
For GCC, use the `-fauto-profile` option:
|
|
|
|
```
|
|
gcc -O2 -fauto-profile=program.gcov -o program program.c
|
|
```
|
|
|
|
For Clang, use the `-fprofile-sample-use` option:
|
|
|
|
```
|
|
clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
|
|
```
|
|
|
|
|
|
## Summary
|
|
|
|
The basic commands to run an application and create a compiler profile are:
|
|
|
|
```
|
|
sudo ./set_strobing.sh 5000 10000
|
|
perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>"
|
|
perf inject -i perf.data -o inj.data --itrace=i100000il
|
|
create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof
|
|
clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c
|
|
```
|
|
|
|
Use `create_gcov` for gcc.
|
|
|
|
## High Level Summary for recoding on Arm board and decoding on different host
|
|
|
|
1. (on Arm board)
|
|
|
|
sudo ./set_strobing.sh 5000 10000
|
|
perf record -e cs_etm/@tmc_etr0/u --per-thread -- <your app>.
|
|
If you specify `-N, --no-buildid-cache`, perf will just take care of recording the target binary and nothing will be copied.<br> If you don't specify it, any recorded dynamic library will be copied to ~/.debug in the board.
|
|
|
|
2. (on Arm board) `perf archive` which saves all the found libraries in a tar (internally, it looks into perf.data file and performs a lookup using perf-buildid-list --with-hits)
|
|
3. (on host) `scp` to copy perf.data and the .tar file generated from `perf archive`.
|
|
4. (on host) Run `tar xvf perf_data.tar.bz2 -C ~/.debug` to populate the buildid-cache
|
|
5. (on host) Double check the setup is correct:
|
|
|
|
a. `perf buildid-list -i perf.data` gives you the list of dynamic libraries buildids whose trace has been recorded and saved in perf.data.
|
|
b. `perf buildid-cache --list` lists the dynamic libraries in the buildid cache that will be used by `perf inject`.
|
|
Make sure the output of (a) and (b) overlaps as in buildid value for those binaries you are interested into optimizing with afdo.
|
|
|
|
6. (on host) `perf inject -i perf.data -o inj.data --itrace=i100000il` will check for the dynamic libraries using the buildid inside the buildid-cache and post-process the trace.<br> buildids have to be the same, otherwise it won't be possible to post-process the trace.
|
|
|
|
7. (on host) `create_llvm_prof -binary=/path/to/binary -profile=inj.data -out=program.llvmprof` takes the output from perf-inject and tranforms it into a format that the compiler can read.
|
|
8. (on host) `clang -O2 -fprofile-sample-use=program.llvmprof -o program program.c` to make clang use the produced profile.<br>
|
|
If you are confident enough that your profile is accurate, you can add the `-fprofile-sample-accurate` flag, which will penalize all the callsites without corresponding profile, marking them as cold.
|
|
|
|
If you are using the same host for both building the binary to be traced and re-building it with afdo:
|
|
|
|
1. You won't need to copy back any dynamic libraries from the board (since you already have them), and can use `--no-buildid-cache` when recording
|
|
2. You have to make sure the relevant dynamic libraries to be optimized are present in the buildid-cache.
|
|
|
|
You can easily add a dynamic library manually into the build-id cache by running:
|
|
|
|
`perf buildid-cache --add <path/to/library/or/binary> -vvv`
|
|
|
|
You can easily check what is currently contained in you buildid-cache by running:
|
|
|
|
`perf buildid-cache --list`
|
|
|
|
You can check the buildid of a given binary/dynamic library:
|
|
|
|
`file <path/to/dynamic/library>`
|
|
|
|
## References
|
|
|
|
* AutoFDO tool: <https://github.com/google/autofdo>
|
|
* GCC's wiki on autofdo: <https://gcc.gnu.org/wiki/AutoFDO>, <https://gcc.gnu.org/wiki/AutoFDO/Tutorial>
|
|
* Google paper: <https://ai.google/research/pubs/pub45290>
|
|
* CoreSight kernel docs: Documentation/trace/coresight.txt
|
|
|
|
|
|
## Appendix: Describing CoreSight in Devicetree
|
|
|
|
|
|
Each component has an entry in the device tree that describes its:
|
|
|
|
* type: The `compatible` field defines which driver to use
|
|
* location: A `reg` defines the component's address and size on the bus
|
|
* clocks: The `clocks` and `clock-names` fields state which clock provides
|
|
the `apb_pclk` clock.
|
|
* connections to other components: `port` and `ports` field link the
|
|
component to ports of other components
|
|
|
|
To create the device tree, some information about the platform is required:
|
|
|
|
* The memory address of the CoreSight components. This is the address in
|
|
the CPU's address space where the CPU can access each CoreSight
|
|
component.
|
|
* The connections between the components.
|
|
|
|
This information can be found in the SoC's reference manual or you may need
|
|
to ask the platform/SoC vendor to supply it.
|
|
|
|
An ETMv4 source is declared with a section like this:
|
|
|
|
```
|
|
etm0: etm@22040000 {
|
|
compatible = "arm,coresight-etm4x", "arm,primecell";
|
|
reg = <0 0x22040000 0 0x1000>;
|
|
|
|
cpu = <&A72_0>;
|
|
clocks = <&soc_smc50mhz>;
|
|
clock-names = "apb_pclk";
|
|
port {
|
|
cluster0_etm0_out_port: endpoint {
|
|
remote-endpoint = <&cluster0_funnel_in_port0>;
|
|
};
|
|
};
|
|
};
|
|
```
|
|
|
|
This describes an ETMv4 attached to core A72_0, located at 0x22040000, with
|
|
its output linked to port 0 of a funnel. The funnel is described with:
|
|
|
|
```
|
|
funnel@220c0000 { /* cluster0 funnel */
|
|
compatible = "arm,coresight-funnel", "arm,primecell";
|
|
reg = <0 0x220c0000 0 0x1000>;
|
|
|
|
clocks = <&soc_smc50mhz>;
|
|
clock-names = "apb_pclk";
|
|
power-domains = <&scpi_devpd 0>;
|
|
ports {
|
|
#address-cells = <1>;
|
|
#size-cells = <0>;
|
|
|
|
port@0 {
|
|
reg = <0>;
|
|
cluster0_funnel_out_port: endpoint {
|
|
remote-endpoint = <&main_funnel_in_port0>;
|
|
};
|
|
};
|
|
|
|
port@1 {
|
|
reg = <0>;
|
|
cluster0_funnel_in_port0: endpoint {
|
|
slave-mode;
|
|
remote-endpoint = <&cluster0_etm0_out_port>;
|
|
};
|
|
};
|
|
|
|
port@2 {
|
|
reg = <1>;
|
|
cluster0_funnel_in_port1: endpoint {
|
|
slave-mode;
|
|
remote-endpoint = <&cluster0_etm1_out_port>;
|
|
};
|
|
};
|
|
};
|
|
};
|
|
```
|
|
|
|
This describes a funnel located at 0x220c0000, receiving data from 2 ETMs
|
|
and sending the merged data to another funnel. We continue describing
|
|
components with similar blocks until we reach the sink (an ETR):
|
|
|
|
```
|
|
etr@20070000 {
|
|
compatible = "arm,coresight-tmc", "arm,primecell";
|
|
reg = <0 0x20070000 0 0x1000>;
|
|
iommus = <&smmu_etr 0>;
|
|
|
|
clocks = <&soc_smc50mhz>;
|
|
clock-names = "apb_pclk";
|
|
power-domains = <&scpi_devpd 0>;
|
|
port {
|
|
etr_in_port: endpoint {
|
|
slave-mode;
|
|
remote-endpoint = <&replicator_out_port1>;
|
|
};
|
|
};
|
|
};
|
|
```
|
|
|
|
Full descriptions of the properties of each component can be found in the
|
|
Linux source at Documentation/devicetree/bindings/arm/coresight.txt.
|
|
The Arm Juno platform's devicetree (arch/arm64/boot/dts/arm) provides an example
|
|
description of CoreSight description.
|
|
|
|
Many systems include a TPIU for off-chip trace. While this isn't required
|
|
for self-hosted trace, it should still be included in the devicetree. This
|
|
allows the drivers to access it to ensure it is put into a disabled state,
|
|
otherwise it may limit the trace bandwidth causing data loss.
|