You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
329 lines
15 KiB
329 lines
15 KiB
Demonstrations of trace.
|
|
|
|
|
|
trace probes functions you specify and displays trace messages if a particular
|
|
condition is met. You can control the message format to display function
|
|
arguments and return values.
|
|
|
|
For example, suppose you want to trace all commands being exec'd across the
|
|
system:
|
|
|
|
# trace 'sys_execve "%s", arg1'
|
|
PID COMM FUNC -
|
|
4402 bash sys_execve /usr/bin/man
|
|
4411 man sys_execve /usr/local/bin/less
|
|
4411 man sys_execve /usr/bin/less
|
|
4410 man sys_execve /usr/local/bin/nroff
|
|
4410 man sys_execve /usr/bin/nroff
|
|
4409 man sys_execve /usr/local/bin/tbl
|
|
4409 man sys_execve /usr/bin/tbl
|
|
4408 man sys_execve /usr/local/bin/preconv
|
|
4408 man sys_execve /usr/bin/preconv
|
|
4415 nroff sys_execve /usr/bin/locale
|
|
4416 nroff sys_execve /usr/bin/groff
|
|
4418 groff sys_execve /usr/bin/grotty
|
|
4417 groff sys_execve /usr/bin/troff
|
|
^C
|
|
|
|
The ::sys_execve syntax specifies that you want an entry probe (which is the
|
|
default), in a kernel function (which is the default) called sys_execve. Next,
|
|
the format string to print is simply "%s", which prints a string. Finally, the
|
|
value to print is the first argument to the sys_execve function, which happens
|
|
to be the command that is exec'd. The above trace was generated by executing
|
|
"man ls" in a separate shell. As you see, man executes a number of additional
|
|
programs to finally display the man page.
|
|
|
|
Next, suppose you are looking for large reads across the system. Let's trace
|
|
the read system call and inspect the third argument, which is the number of
|
|
bytes to be read:
|
|
|
|
# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
|
|
PID COMM FUNC -
|
|
4490 dd sys_read read 1048576 bytes
|
|
4490 dd sys_read read 1048576 bytes
|
|
4490 dd sys_read read 1048576 bytes
|
|
4490 dd sys_read read 1048576 bytes
|
|
^C
|
|
|
|
During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
|
|
The individual reads are visible, with the custom format message printed for
|
|
each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
|
|
evaluated for each invocation of the probe before printing anything.
|
|
|
|
You can also trace user functions. For example, let's simulate the bashreadline
|
|
script, which attaches to the readline function in bash and prints its return
|
|
value, effectively snooping all bash shell input across the system:
|
|
|
|
# trace 'r:bash:readline "%s", retval'
|
|
PID COMM FUNC -
|
|
2740 bash readline echo hi!
|
|
2740 bash readline man ls
|
|
^C
|
|
|
|
The special retval keyword stands for the function's return value, and can
|
|
be used only in a retprobe, specified by the 'r' prefix. The next component
|
|
of the probe is the library that contains the desired function. It's OK to
|
|
specify executables too, as long as they can be found in the PATH. Or, you
|
|
can specify the full path to the executable (e.g. "/usr/bin/bash").
|
|
|
|
Sometimes it can be useful to see where in code the events happen. There are
|
|
flags to print the kernel stack (-K), the user stack (-U) and optionally
|
|
include the virtual address in the stacks as well (-a):
|
|
|
|
# trace.py -U -a 'r::sys_futex "%d", retval'
|
|
PID TID COMM FUNC -
|
|
793922 793951 poller sys_futex 0
|
|
7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
|
|
627fef folly::FunctionScheduler::run()+0x46f [router]
|
|
7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
|
|
7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
|
|
7f6c7223fa7d clone+0x6d [libc-2.23.so]
|
|
|
|
Multiple probes can be combined on the same command line. For example, let's
|
|
trace failed read and write calls on the libc level, and include a time column:
|
|
|
|
# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
|
|
'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
|
|
TIME PID COMM FUNC -
|
|
05:31:57 3388 bash write write failed: -1
|
|
05:32:00 3388 bash write write failed: -1
|
|
^C
|
|
|
|
Note that the retval variable must be cast to int before comparing to zero.
|
|
The reason is that the default type for argN and retval is an unsigned 64-bit
|
|
integer, which can never be smaller than 0.
|
|
|
|
trace has also some basic support for kernel tracepoints. For example, let's
|
|
trace the block:block_rq_complete tracepoint and print out the number of sectors
|
|
transferred:
|
|
|
|
# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
|
|
TIME PID COMM FUNC -
|
|
01:23:51 0 swapper/0 block_rq_complete sectors=8
|
|
01:23:55 10017 kworker/u64: block_rq_complete sectors=1
|
|
01:23:55 0 swapper/0 block_rq_complete sectors=8
|
|
^C
|
|
|
|
To discover the tracepoint structure format (which you can refer to as the "args"
|
|
pointer variable), use the tplist tool. For example:
|
|
|
|
# tplist -v block:block_rq_complete
|
|
block:block_rq_complete
|
|
dev_t dev;
|
|
sector_t sector;
|
|
unsigned int nr_sector;
|
|
int errors;
|
|
char rwbs[8];
|
|
|
|
This output tells you that you can use "args->dev", "args->sector", etc. in your
|
|
predicate and trace arguments.
|
|
|
|
|
|
More and more high-level libraries are instrumented with USDT probe support.
|
|
These probes can be traced by trace just like kernel tracepoints. For example,
|
|
trace new threads being created and their function name, include time column
|
|
and on which CPU it happened:
|
|
|
|
# trace 'u:pthread:pthread_create "%U", arg3' -T -C
|
|
TIME CPU PID TID COMM FUNC -
|
|
13:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
|
|
13:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
|
|
13:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
|
|
13:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
|
|
13:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
|
|
13:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
|
|
^C
|
|
|
|
The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
|
|
if possible. Similarly, use "%K" for kernel symbols.
|
|
|
|
Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
|
|
trace Ruby methods being called (this requires a version of Ruby built with
|
|
the --enable-dtrace configure flag):
|
|
|
|
# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
|
|
TIME PID COMM FUNC -
|
|
12:08:43 18420 irb method__entry IRB::Context.verbose?
|
|
12:08:43 18420 irb method__entry RubyLex.ungetc
|
|
12:08:43 18420 irb method__entry RuxyLex.debug?
|
|
^C
|
|
|
|
In the previous invocation, arg1 and arg2 are the class name and method name
|
|
for the Ruby method being invoked.
|
|
|
|
You can also trace exported functions from shared libraries, or an imported
|
|
function on the actual executable:
|
|
|
|
# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
|
|
# tput -V
|
|
|
|
PID TID COMM FUNC -
|
|
21720 21720 tput curses_version Version=ncurses 6.0.20160709
|
|
^C
|
|
|
|
|
|
Occasionally, it can be useful to filter specific strings. For example, you
|
|
might be interested in open() calls that open a specific file:
|
|
|
|
# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
|
|
TIME PID COMM FUNC -
|
|
01:43:15 10938 cat open opening test.txt
|
|
01:43:20 10939 cat open opening test.txt
|
|
^C
|
|
|
|
|
|
In the preceding example, as well as in many others, readability may be
|
|
improved by providing the function's signature, which names the arguments and
|
|
lets you access structure sub-fields, which is hard with the "arg1", "arg2"
|
|
convention. For example:
|
|
|
|
# trace 'p:c:open(char *filename) "opening %s", filename'
|
|
PID TID COMM FUNC -
|
|
17507 17507 cat open opening FAQ.txt
|
|
^C
|
|
|
|
# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
|
|
PID TID COMM FUNC -
|
|
777 785 automount SyS_nanosleep sleep for 500000000 ns
|
|
777 785 automount SyS_nanosleep sleep for 500000000 ns
|
|
777 785 automount SyS_nanosleep sleep for 500000000 ns
|
|
777 785 automount SyS_nanosleep sleep for 500000000 ns
|
|
^C
|
|
|
|
Remember to use the -I argument include the appropriate header file. We didn't
|
|
need to do that here because `struct timespec` is used internally by the tool,
|
|
so it always includes this header file.
|
|
|
|
|
|
As a final example, let's trace open syscalls for a specific process. By
|
|
default, tracing is system-wide, but the -p switch overrides this:
|
|
|
|
# trace -p 2740 'do_sys_open "%s", arg2' -T
|
|
TIME PID COMM FUNC -
|
|
05:36:16 15872 ls do_sys_open /etc/ld.so.cache
|
|
05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1
|
|
05:36:16 15872 ls do_sys_open /lib64/libcap.so.2
|
|
05:36:16 15872 ls do_sys_open /lib64/libacl.so.1
|
|
05:36:16 15872 ls do_sys_open /lib64/libc.so.6
|
|
05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1
|
|
05:36:16 15872 ls do_sys_open /lib64/libdl.so.2
|
|
05:36:16 15872 ls do_sys_open /lib64/libattr.so.1
|
|
05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0
|
|
05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive
|
|
05:36:16 15872 ls do_sys_open /home/vagrant
|
|
^C
|
|
|
|
In this example, we traced the "ls ~" command as it was opening its shared
|
|
libraries and then accessing the /home/vagrant directory listing.
|
|
|
|
|
|
Lastly, if a high-frequency event is traced you may overflow the perf ring
|
|
buffer. This shows as "Lost N samples":
|
|
|
|
# trace sys_open
|
|
5087 5087 pgrep sys_open
|
|
5087 5087 pgrep sys_open
|
|
5087 5087 pgrep sys_open
|
|
5087 5087 pgrep sys_open
|
|
5087 5087 pgrep sys_open
|
|
Lost 764896 samples
|
|
Lost 764896 samples
|
|
Lost 764896 samples
|
|
|
|
The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
|
|
size and is measured in pages. The value must be a power of two and defaults to
|
|
64 pages.
|
|
|
|
# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
|
|
PID TID COMM FUNC -
|
|
1855611 1863183 worker sys_setsockopt found
|
|
|
|
In this example we are catching setsockopt syscall to change IPv4 IP_TOS
|
|
value only for the cases where new TOS value is equal to 108. we are using
|
|
STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
|
|
against int value of 108 (parametr of setsockopt call) in hex representation
|
|
(little endian format)
|
|
|
|
|
|
|
|
|
|
USAGE message:
|
|
|
|
usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE]
|
|
[-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header]
|
|
probe [probe ...]
|
|
|
|
Attach to functions and print trace messages.
|
|
|
|
positional arguments:
|
|
probe probe specifier (see examples)
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
-b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
|
|
number of pages to use for perf_events ring buffer
|
|
(default: 64)
|
|
-p PID, --pid PID id of the process to trace (optional)
|
|
-L TID, --tid TID id of the thread to trace (optional)
|
|
-v, --verbose print resulting BPF program code before executing
|
|
-Z STRING_SIZE, --string-size STRING_SIZE
|
|
maximum size to read from strings
|
|
-S, --include-self do not filter trace's own pid from the trace
|
|
-M MAX_EVENTS, --max-events MAX_EVENTS
|
|
number of events to print before quitting
|
|
-t, --timestamp print timestamp column (offset from trace start)
|
|
-T, --time print time column
|
|
-C, --print_cpu print CPU id
|
|
-B, --bin_cmp allow to use STRCMP with binary values
|
|
-K, --kernel-stack output kernel stack trace
|
|
-U, --user-stack output user stack trace
|
|
-a, --address print virtual address in stacks
|
|
-I header, --include header
|
|
additional header files to include in the BPF program
|
|
as either full path, or relative to current working directory,
|
|
or relative to default kernel header search path
|
|
|
|
EXAMPLES:
|
|
|
|
trace do_sys_open
|
|
Trace the open syscall and print a default trace message when entered
|
|
trace 'do_sys_open "%s", arg2'
|
|
Trace the open syscall and print the filename being opened
|
|
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
|
|
Trace the read syscall and print a message for reads >20000 bytes
|
|
trace 'r::do_sys_open "%llx", retval'
|
|
Trace the return from the open syscall and print the return value
|
|
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
|
|
Trace the open() call from libc only if the flags (arg2) argument is 42
|
|
trace 'c:malloc "size = %d", arg1'
|
|
Trace malloc calls and print the size being allocated
|
|
trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
|
|
Trace the write() call from libc to monitor writes to STDOUT
|
|
trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
|
|
Trace returns from __kmalloc which returned a null pointer
|
|
trace 'r:c:malloc (retval) "allocated = %x", retval'
|
|
Trace returns from malloc and print non-NULL allocated buffers
|
|
trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
|
|
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
|
|
trace 'u:pthread:pthread_create (arg4 != 0)'
|
|
Trace the USDT probe pthread_create when its 4th argument is non-zero
|
|
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
|
|
Trace the nanosleep syscall and print the sleep duration in ns
|
|
trace -I 'linux/fs.h' \
|
|
'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
|
|
Trace the uprobe_register inode mapping ops, and the symbol can be found
|
|
in /proc/kallsyms
|
|
trace -I 'kernel/sched/sched.h' \
|
|
'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
|
|
Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
|
|
in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
|
|
package. So this command needs to run at the kernel source tree root directory
|
|
so that the added header file can be found by the compiler.
|
|
trace -I 'net/sock.h' \\
|
|
'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
|
|
Trace udpv6 sendmsg calls only if socket's destination port is equal
|
|
to 53 (DNS; 13568 in big endian order)
|
|
trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
|
|
Trace the number of users accessing the file system of the current task
|
|
"
|