Using the qemu tracepoints with SystemTap

Post by William Cohen
http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html
$ stap virtqueueleaks.stp
^C
pid elem
19503 1c4af28
19503 1c56f88
19503 1c62fe8
19503 1c6f048
19503 1c7b0a8
19503 1c87108
19503 1c93168
...
I am not that familiar with the internals of qemu. The script seems to indicates qemu is leaking, but is that really the case? If there are resource leaks, what output would help debug those leaks? What enhancements can be done to this script to provide more useful information?

Leak tracing always has corner cases :).

With virtio-blk this would indicate a leak because it uses a
request-response model where the guest initiates I/O and the host
responds. A guest that cleanly shuts down before you exit your
SystemTap script should not leak requests for virtio-blk.

With virtio-net the guest actually hands the host receive buffers and
they are held until we can receive packets into them and return them
to the host. We don't have a virtio_reset trace event, and due to
this we're not accounting for clean shutdown (the guest driver resets
the device to clear all virtqueues).

I am submitting a patch to add virtio_reset() tracing. This will
allow the script to delete all elements belonging to this virtio
device.

Post by William Cohen
Are there other examples of qemu probing people would like to see?

The malloc/realloc/memalign/vmalloc/free/vfree trace events can be
used for a few things:
* Finding memory leaks.
* Finding malloc/vfree or vmalloc/free mismatches. The rules are:
malloc/realloc need free, memalign/vmalloc need vfree. They cannot be
mixed.

Stefan

William Cohen

2011-09-13 16:10:47 UTC

Hi Stefan,

Thanks for the comments.

Post by Stefan Hajnoczi
Leak tracing always has corner cases :).
With virtio-blk this would indicate a leak because it uses a
request-response model where the guest initiates I/O and the host
responds. A guest that cleanly shuts down before you exit your
SystemTap script should not leak requests for virtio-blk.

I stopped the systemtap script while the guest vm was still running. So when the guest vm cleanly shuts down there should be a series of virtqueue_fill operations that will remove those elements?

Qemu uses a thread for each virtual processor, but a single thread to handle all IO. It seems like that might be a possible bottleneck. What would be the path of io event from guest to host back to guest? Is there somthing that a script could do to gauge that delay due to the qemu io thread handling multiple processors?

Post by Stefan Hajnoczi
With virtio-net the guest actually hands the host receive buffers and
they are held until we can receive packets into them and return them
to the host. We don't have a virtio_reset trace event, and due to
this we're not accounting for clean shutdown (the guest driver resets
the device to clear all virtqueues).
I am submitting a patch to add virtio_reset() tracing. This will
allow the script to delete all elements belonging to this virtio
device.

Post by William Cohen
Are there other examples of qemu probing people would like to see?

The malloc/realloc/memalign/vmalloc/free/vfree trace events can be
* Finding memory leaks.
malloc/realloc need free, memalign/vmalloc need vfree. They cannot be
mixed.
Stefan

As a quick and simple experiment to see how often various probes are getting hit I used the following script on RHEL-6 (the probe points are a bit different on Fedora):

global counts
probe qemu.*.*? {counts[pn()]++}
probe end {foreach(n+ in counts) printf ("%s = %d\n", n, counts[n])}

For starting up a fedora 14 guest virtual machine and shutting it down I got the following output:

$ stap ~/research/profiling/examples/qemu_count.s
^Cqemu.kvm.balloon_event = 1
qemu.kvm.bdrv_aio_multiwrite = 155
qemu.kvm.bdrv_aio_readv = 13284
qemu.kvm.bdrv_aio_writev = 998
qemu.kvm.cpu_get_apic_base = 20
qemu.kvm.cpu_in = 94082
qemu.kvm.cpu_out = 165789
qemu.kvm.cpu_set_apic_base = 445752
qemu.kvm.multiwrite_cb = 654
qemu.kvm.paio_submit = 7141
qemu.kvm.qemu_free = 677704
qemu.kvm.qemu_malloc = 683031
qemu.kvm.qemu_memalign = 285
qemu.kvm.qemu_realloc = 47550
qemu.kvm.virtio_blk_handle_write = 504
qemu.kvm.virtio_blk_req_complete = 7146
qemu.kvm.virtio_blk_rw_complete = 7146
qemu.kvm.virtio_notify = 6574
qemu.kvm.virtio_queue_notify = 6680
qemu.kvm.virtqueue_fill = 7146
qemu.kvm.virtqueue_flush = 7146
qemu.kvm.virtqueue_pop = 7147
qemu.kvm.vm_state_notify = 1

See a lot of qemu.kvm.qemu_malloc. This is likely more than systemtap can track if there are thousands of them live at the same time. There are no qemu_vmalloc events because of https://bugzilla.redhat.com/show_bug.cgi?id=714773.

Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

See that qemu.kvm.virtio_blk_req_complete, qemu.kvm.virtio_blk_rw_complete, qemu.kvm.virtqueue_fill, and qemu.kvm.virtqueue_flush all have the same count, 7146. The qemu.kvm.virtqueue_pop is close, at 7147.

-Will

William Cohen

2011-09-13 19:38:07 UTC

Post by William Cohen
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

I found that cpu_in and cpu_out refer to input and output instructions. I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.

It generates output like the following:

cpu_in
port count
0x01f7 3000
0x03d5 120
0xc000 2000
0xc002 3000

cpu_out
port count
0x0080 480
0x01f1 2000
0x01f2 2000
0x01f3 2000
0x01f4 2000
0x01f5 2000
0x01f6 2000
0x01f7 1000
0x03d4 480
0x03d5 120
0x03f6 1000
0xc000 3000
0xc002 2000
0xc004 1000
0xc090 4

Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.

-Will

Stefan Hajnoczi

2011-09-14 14:51:27 UTC

Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

I found that cpu_in and cpu_out refer to input and output instructions. I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.
cpu_in
port count
0x01f7 3000
0x03d5 120
0xc000 2000
0xc002 3000
cpu_out
port count
0x0080 480
0x01f1 2000
0x01f2 2000
0x01f3 2000
0x01f4 2000
0x01f5 2000
0x01f6 2000
0x01f7 1000
0x03d4 480
0x03d5 120
0x03f6 1000
0xc000 3000
0xc002 2000
0xc004 1000
0xc090 4
Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.

I think the cpu_in/cpu_out tracing script can be useful in identifying
performance problems, especially when obscure guest OSes experience
poor performance due to weird ways of interacting with hardware.

Where are you putting your SystemTap scripts? I suggest creating a
public git repo and adding a link from the QEMU wiki tracing page:
http://wiki.qemu.org/Features/Tracing

Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.

Stefan

Frank Ch. Eigler

2011-09-14 15:20:29 UTC

Post by Stefan Hajnoczi
[...]
Where are you putting your SystemTap scripts? I suggest creating a
http://wiki.qemu.org/Features/Tracing
Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.

(We can also or instead ship them within systemtap's bundled sources.)

- FChE

William Cohen

2011-09-14 16:02:39 UTC

Post by William Cohen
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

Glad this example looks useful.

Post by Stefan Hajnoczi
Where are you putting your SystemTap scripts? I suggest creating a
http://wiki.qemu.org/Features/Tracing

Definitely need a more lasting place to put the QEMU examples. SystemTap has an examples directory which try to put things like this into (http://sourceware.org/systemtap/examples/). These are run as part of the testsuite. These qemu examples are not ready for inclusion in the examples. I have set up https://github.com/wcohen/qemu_systemtap for the qemu examples.

Post by Stefan Hajnoczi
Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.

We definitely want to develop scripts that give people a better understanding what is going on. I am just now learning how kvm and qemu work, so the early scripts are just counting events to give people an idea events are frequent. In some cases these could point to issues where some assumptions about event frequency are incorrect. I would like to have scripts that examine the sequences of events and indicate when long latencies occur, but I don't yet have enough understanding of the how kvm and qemu work to implement those.

There are other userspace applications like python and java that have similar probes (http://sourceware.org/systemtap/wiki "Applications with built-in User-Space Markers"). Certainly making the systemtap scripts easily available would be good. Having the scripts with qemu is one way to do it. The other way would be to include in the systemtap examples. Advantage with including with qemu would be scripts would match the version of the qemu running and avoid problems with changes in the qemu probe points. Having the scripts in systemtap would allow regular testing of the scripts.

-Will

Stefan Hajnoczi

2011-09-15 10:17:05 UTC

Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

Glad this example looks useful.

Post by Stefan Hajnoczi
Where are you putting your SystemTap scripts? I suggest creating a
http://wiki.qemu.org/Features/Tracing

Added a link from http://qemu.org/Features/Tracing.

The vmexit/entry things I mentioned in the mail above are events that
can be traced in the host kernel (kvm:kvm_exit and kvm:kvm_entry).
The QEMU-side equivalent is the ioctl(KVM_RUN) but we do not have any
trace events in kvm-all.c yet (!!). A trace event before the ioctl
and one after could be used to trace vcpu code execution from QEMU's
perspective.

Stefan

Stefan Hajnoczi

2011-09-14 15:32:50 UTC