|
9 months ago | |
---|---|---|
.. | ||
BUILD | 9 months ago | |
CMakeLists.txt | 9 months ago | |
README.md | 9 months ago | |
instrumentation.cc | 9 months ago | |
instrumentation.h | 9 months ago | |
profiler.cc | 9 months ago | |
profiler.h | 9 months ago | |
test.cc | 9 months ago | |
test_instrumented_library.cc | 9 months ago | |
test_instrumented_library.h | 9 months ago | |
treeview.cc | 9 months ago | |
treeview.h | 9 months ago |
README.md
A minimalistic profiler sampling pseudo-stacks
Overview
The present directory is the "ruy profiler". As a time profiler, it allows to measure where code is spending time.
Contrary to most typical profilers, what it samples is not real call stacks, but "pseudo-stacks" which are just simple data structures constructed from within the program being profiled. Using this profiler requires manually instrumenting code to construct such pseudo-stack information.
Another unusual characteristic of this profiler is that it uses only the C++11 standard library. It does not use any non-portable feature, in particular it does not rely on signal handlers. The sampling is performed by a thread, the "profiler thread".
A discussion of pros/cons of this approach is appended below.
How to use this profiler
How to instrument code
An example of instrumented code is given in test_instrumented_library.cc
.
Code is instrumented by constructing ScopeLabel
objects. These are RAII
helpers, ensuring that the thread pseudo-stack contains the label during their
lifetime. In the most common use case, one would construct such an object at the
start of a function, so that its scope is the function scope and it allows to
measure how much time is spent in this function.
#include "ruy/profiler/instrumentation.h"
...
void SomeFunction() {
ruy::profiler::ScopeLabel function_label("SomeFunction");
... do something ...
}
A ScopeLabel
may however have any scope, for instance:
if (some_case) {
ruy::profiler::ScopeLabel extra_work_label("Some more work");
... do some more work ...
}
The string passed to the ScopeLabel
constructor must be just a pointer to a
literal string (a char*
pointer). The profiler will assume that these pointers
stay valid until the profile is finalized.
However, that literal string may be a printf
format string, and labels may
have up to 4 parameters, of type int
. For example:
void SomeFunction(int size) {
ruy::profiler::ScopeLabel function_label("SomeFunction (size=%d)", size);
How to run the profiler
Profiling instrumentation is a no-op unless the preprocessor token
RUY_PROFILER
is defined, so defining it is the first step when actually
profiling. When building with Bazel, the preferred way to enable that is to pass
this flag on the Bazel command line:
--define=ruy_profiler=true
To actually profile a code scope, it is enough to construct a ScopeProfile
object, also a RAII helper. It will start the profiler on construction, and on
destruction it will terminate the profiler and report the profile treeview on
standard output by default. Example:
void SomeProfiledBenchmark() {
ruy::profiler::ScopeProfile profile;
CallSomeInstrumentedCode();
}
An example is provided by the :test
target in the present directory. Run it
with --define=ruy_profiler=true
as explained above:
bazel run -c opt \
--define=ruy_profiler=true \
//tensorflow/lite/experimental/ruy/profiler:test
The default behavior dumping the treeview on standard output may be overridden
by passing a pointer to a TreeView
object to the ScopeProfile
constructor.
This causes the tree-view to be stored in that TreeView
object, where it may
be accessed an manipulated using the functions declared in treeview.h
. The
aforementioned :test
provides examples for doing so.
Advantages and inconvenients
Compared to a traditional profiler, e.g. Linux's "perf", the present kind of profiler has the following inconvenients:
- Requires manual instrumentation of code being profiled.
- Substantial overhead, modifying the performance characteristics of the code being measured.
- Questionable accuracy.
But also the following advantages:
- Profiling can be driven from within a benchmark program, allowing the entire profiling procedure to be a single command line.
- Not relying on symbol information removes removes exposure to toolchain details and means less hassle in some build environments, especially embedded/mobile (single command line to run and profile, no symbols files required).
- Fully portable (all of this is standard C++11).
- Fully testable (see
:test
). Profiling becomes just another feature of the code like any other. - Customized instrumentation can result in easier to read treeviews (only relevant functions, and custom labels may be more readable than function names).
- Parametrized/formatted labels allow to do things that aren't possible with call-stack-sampling profilers. For example, break down a profile where much time is being spent in matrix multiplications, by the various matrix multiplication shapes involved.
The philosophy underlying this profiler is that software performance depends on software engineers profiling often, and a key factor limiting that in practice is the difficulty or cumbersome aspects of profiling with more serious profilers such as Linux's "perf", especially in embedded/mobile development: multiple command lines are involved to copy symbol files to devices, retrieve profile data from the device, etc. In that context, it is useful to make profiling as easy as benchmarking, even on embedded targets, even if the price to pay for that is lower accuracy, higher overhead, and some intrusive instrumentation requirement.
Another key aspect determining what profiling approach is suitable for a given context, is whether one already has a-priori knowledge of where much of the time is likely being spent. When one has such a-priori knowledge, it is feasible to instrument the known possibly-critical code as per the present approach. On the other hand, in situations where one doesn't have such a-priori knowledge, a real profiler such as Linux's "perf" allows to right away get a profile of real stacks, from just symbol information generated by the toolchain.