You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

329 lines
15 KiB

Demonstrations of trace.
trace probes functions you specify and displays trace messages if a particular
condition is met. You can control the message format to display function
arguments and return values.
For example, suppose you want to trace all commands being exec'd across the
system:
# trace 'sys_execve "%s", arg1'
PID COMM FUNC -
4402 bash sys_execve /usr/bin/man
4411 man sys_execve /usr/local/bin/less
4411 man sys_execve /usr/bin/less
4410 man sys_execve /usr/local/bin/nroff
4410 man sys_execve /usr/bin/nroff
4409 man sys_execve /usr/local/bin/tbl
4409 man sys_execve /usr/bin/tbl
4408 man sys_execve /usr/local/bin/preconv
4408 man sys_execve /usr/bin/preconv
4415 nroff sys_execve /usr/bin/locale
4416 nroff sys_execve /usr/bin/groff
4418 groff sys_execve /usr/bin/grotty
4417 groff sys_execve /usr/bin/troff
^C
The ::sys_execve syntax specifies that you want an entry probe (which is the
default), in a kernel function (which is the default) called sys_execve. Next,
the format string to print is simply "%s", which prints a string. Finally, the
value to print is the first argument to the sys_execve function, which happens
to be the command that is exec'd. The above trace was generated by executing
"man ls" in a separate shell. As you see, man executes a number of additional
programs to finally display the man page.
Next, suppose you are looking for large reads across the system. Let's trace
the read system call and inspect the third argument, which is the number of
bytes to be read:
# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
PID COMM FUNC -
4490 dd sys_read read 1048576 bytes
4490 dd sys_read read 1048576 bytes
4490 dd sys_read read 1048576 bytes
4490 dd sys_read read 1048576 bytes
^C
During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
The individual reads are visible, with the custom format message printed for
each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
evaluated for each invocation of the probe before printing anything.
You can also trace user functions. For example, let's simulate the bashreadline
script, which attaches to the readline function in bash and prints its return
value, effectively snooping all bash shell input across the system:
# trace 'r:bash:readline "%s", retval'
PID COMM FUNC -
2740 bash readline echo hi!
2740 bash readline man ls
^C
The special retval keyword stands for the function's return value, and can
be used only in a retprobe, specified by the 'r' prefix. The next component
of the probe is the library that contains the desired function. It's OK to
specify executables too, as long as they can be found in the PATH. Or, you
can specify the full path to the executable (e.g. "/usr/bin/bash").
Sometimes it can be useful to see where in code the events happen. There are
flags to print the kernel stack (-K), the user stack (-U) and optionally
include the virtual address in the stacks as well (-a):
# trace.py -U -a 'r::sys_futex "%d", retval'
PID TID COMM FUNC -
793922 793951 poller sys_futex 0
7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
627fef folly::FunctionScheduler::run()+0x46f [router]
7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
7f6c7223fa7d clone+0x6d [libc-2.23.so]
Multiple probes can be combined on the same command line. For example, let's
trace failed read and write calls on the libc level, and include a time column:
# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
TIME PID COMM FUNC -
05:31:57 3388 bash write write failed: -1
05:32:00 3388 bash write write failed: -1
^C
Note that the retval variable must be cast to int before comparing to zero.
The reason is that the default type for argN and retval is an unsigned 64-bit
integer, which can never be smaller than 0.
trace has also some basic support for kernel tracepoints. For example, let's
trace the block:block_rq_complete tracepoint and print out the number of sectors
transferred:
# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
TIME PID COMM FUNC -
01:23:51 0 swapper/0 block_rq_complete sectors=8
01:23:55 10017 kworker/u64: block_rq_complete sectors=1
01:23:55 0 swapper/0 block_rq_complete sectors=8
^C
To discover the tracepoint structure format (which you can refer to as the "args"
pointer variable), use the tplist tool. For example:
# tplist -v block:block_rq_complete
block:block_rq_complete
dev_t dev;
sector_t sector;
unsigned int nr_sector;
int errors;
char rwbs[8];
This output tells you that you can use "args->dev", "args->sector", etc. in your
predicate and trace arguments.
More and more high-level libraries are instrumented with USDT probe support.
These probes can be traced by trace just like kernel tracepoints. For example,
trace new threads being created and their function name, include time column
and on which CPU it happened:
# trace 'u:pthread:pthread_create "%U", arg3' -T -C
TIME CPU PID TID COMM FUNC -
13:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
13:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
13:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
13:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
13:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
13:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
^C
The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
if possible. Similarly, use "%K" for kernel symbols.
Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
trace Ruby methods being called (this requires a version of Ruby built with
the --enable-dtrace configure flag):
# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
TIME PID COMM FUNC -
12:08:43 18420 irb method__entry IRB::Context.verbose?
12:08:43 18420 irb method__entry RubyLex.ungetc
12:08:43 18420 irb method__entry RuxyLex.debug?
^C
In the previous invocation, arg1 and arg2 are the class name and method name
for the Ruby method being invoked.
You can also trace exported functions from shared libraries, or an imported
function on the actual executable:
# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
# tput -V
PID TID COMM FUNC -
21720 21720 tput curses_version Version=ncurses 6.0.20160709
^C
Occasionally, it can be useful to filter specific strings. For example, you
might be interested in open() calls that open a specific file:
# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
TIME PID COMM FUNC -
01:43:15 10938 cat open opening test.txt
01:43:20 10939 cat open opening test.txt
^C
In the preceding example, as well as in many others, readability may be
improved by providing the function's signature, which names the arguments and
lets you access structure sub-fields, which is hard with the "arg1", "arg2"
convention. For example:
# trace 'p:c:open(char *filename) "opening %s", filename'
PID TID COMM FUNC -
17507 17507 cat open opening FAQ.txt
^C
# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
PID TID COMM FUNC -
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
^C
Remember to use the -I argument include the appropriate header file. We didn't
need to do that here because `struct timespec` is used internally by the tool,
so it always includes this header file.
As a final example, let's trace open syscalls for a specific process. By
default, tracing is system-wide, but the -p switch overrides this:
# trace -p 2740 'do_sys_open "%s", arg2' -T
TIME PID COMM FUNC -
05:36:16 15872 ls do_sys_open /etc/ld.so.cache
05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1
05:36:16 15872 ls do_sys_open /lib64/libcap.so.2
05:36:16 15872 ls do_sys_open /lib64/libacl.so.1
05:36:16 15872 ls do_sys_open /lib64/libc.so.6
05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1
05:36:16 15872 ls do_sys_open /lib64/libdl.so.2
05:36:16 15872 ls do_sys_open /lib64/libattr.so.1
05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0
05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive
05:36:16 15872 ls do_sys_open /home/vagrant
^C
In this example, we traced the "ls ~" command as it was opening its shared
libraries and then accessing the /home/vagrant directory listing.
Lastly, if a high-frequency event is traced you may overflow the perf ring
buffer. This shows as "Lost N samples":
# trace sys_open
5087 5087 pgrep sys_open
5087 5087 pgrep sys_open
5087 5087 pgrep sys_open
5087 5087 pgrep sys_open
5087 5087 pgrep sys_open
Lost 764896 samples
Lost 764896 samples
Lost 764896 samples
The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
size and is measured in pages. The value must be a power of two and defaults to
64 pages.
# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
PID TID COMM FUNC -
1855611 1863183 worker sys_setsockopt found
In this example we are catching setsockopt syscall to change IPv4 IP_TOS
value only for the cases where new TOS value is equal to 108. we are using
STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
against int value of 108 (parametr of setsockopt call) in hex representation
(little endian format)
USAGE message:
usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE]
[-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-a] [-I header]
probe [probe ...]
Attach to functions and print trace messages.
positional arguments:
probe probe specifier (see examples)
optional arguments:
-h, --help show this help message and exit
-b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
number of pages to use for perf_events ring buffer
(default: 64)
-p PID, --pid PID id of the process to trace (optional)
-L TID, --tid TID id of the thread to trace (optional)
-v, --verbose print resulting BPF program code before executing
-Z STRING_SIZE, --string-size STRING_SIZE
maximum size to read from strings
-S, --include-self do not filter trace's own pid from the trace
-M MAX_EVENTS, --max-events MAX_EVENTS
number of events to print before quitting
-t, --timestamp print timestamp column (offset from trace start)
-T, --time print time column
-C, --print_cpu print CPU id
-B, --bin_cmp allow to use STRCMP with binary values
-K, --kernel-stack output kernel stack trace
-U, --user-stack output user stack trace
-a, --address print virtual address in stacks
-I header, --include header
additional header files to include in the BPF program
as either full path, or relative to current working directory,
or relative to default kernel header search path
EXAMPLES:
trace do_sys_open
Trace the open syscall and print a default trace message when entered
trace 'do_sys_open "%s", arg2'
Trace the open syscall and print the filename being opened
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
Trace the read syscall and print a message for reads >20000 bytes
trace 'r::do_sys_open "%llx", retval'
Trace the return from the open syscall and print the return value
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
Trace the open() call from libc only if the flags (arg2) argument is 42
trace 'c:malloc "size = %d", arg1'
Trace malloc calls and print the size being allocated
trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
Trace the write() call from libc to monitor writes to STDOUT
trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
Trace returns from __kmalloc which returned a null pointer
trace 'r:c:malloc (retval) "allocated = %x", retval'
Trace returns from malloc and print non-NULL allocated buffers
trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
Trace the USDT probe pthread_create when its 4th argument is non-zero
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
Trace the nanosleep syscall and print the sleep duration in ns
trace -I 'linux/fs.h' \
'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
Trace the uprobe_register inode mapping ops, and the symbol can be found
in /proc/kallsyms
trace -I 'kernel/sched/sched.h' \
'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
package. So this command needs to run at the kernel source tree root directory
so that the added header file can be found by the compiler.
trace -I 'net/sock.h' \\
'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
Trace udpv6 sendmsg calls only if socket's destination port is equal
to 53 (DNS; 13568 in big endian order)
trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
Trace the number of users accessing the file system of the current task
"