Discussion:
Using the qemu tracepoints with SystemTap
William Cohen
2011-09-12 15:33:02 UTC
Permalink
Hi All,

The RHEL-6 version of qemu-kvm makes the tracepoints available to SystemTap. I have been working on useful examples for the SystemTap tracepoints in qemu. There doesn't seem to be a great number of examples showing the utility of the tracepoints in diagnosing problems. However, I came across the following blog entry that had several examples:

http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html

I reimplemented the VirtqueueRequestTracker example from the blog in SystemTap (the attached virtqueueleaks.stp). I can run it on RHEL-6's qemu-kvm-0.12.1.2-2.160.el6_1.8.x86_64 and get output like the following. It outputs the pid and the address of the elem that leaked when the script is stopped like the following:

$ stap virtqueueleaks.stp
^C
pid elem
19503 1c4af28
19503 1c56f88
19503 1c62fe8
19503 1c6f048
19503 1c7b0a8
19503 1c87108
19503 1c93168
...

I am not that familiar with the internals of qemu. The script seems to indicates qemu is leaking, but is that really the case? If there are resource leaks, what output would help debug those leaks? What enhancements can be done to this script to provide more useful information?

Are there other examples of qemu probing people would like to see?

-Will
Stefan Hajnoczi
2011-09-13 10:03:46 UTC
Permalink
Post by William Cohen
http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html
$ stap virtqueueleaks.stp
^C
    pid     elem
  19503  1c4af28
  19503  1c56f88
  19503  1c62fe8
  19503  1c6f048
  19503  1c7b0a8
  19503  1c87108
  19503  1c93168
...
I am not that familiar with the internals of qemu. The script seems to indicates qemu is leaking, but is that really the case?  If there are resource leaks, what output would help debug those leaks? What enhancements can be done to this script to provide more useful information?
Leak tracing always has corner cases :).

With virtio-blk this would indicate a leak because it uses a
request-response model where the guest initiates I/O and the host
responds. A guest that cleanly shuts down before you exit your
SystemTap script should not leak requests for virtio-blk.

With virtio-net the guest actually hands the host receive buffers and
they are held until we can receive packets into them and return them
to the host. We don't have a virtio_reset trace event, and due to
this we're not accounting for clean shutdown (the guest driver resets
the device to clear all virtqueues).

I am submitting a patch to add virtio_reset() tracing. This will
allow the script to delete all elements belonging to this virtio
device.
Post by William Cohen
Are there other examples of qemu probing people would like to see?
The malloc/realloc/memalign/vmalloc/free/vfree trace events can be
used for a few things:
* Finding memory leaks.
* Finding malloc/vfree or vmalloc/free mismatches. The rules are:
malloc/realloc need free, memalign/vmalloc need vfree. They cannot be
mixed.

Stefan
William Cohen
2011-09-13 16:10:47 UTC
Permalink
Post by Stefan Hajnoczi
Post by William Cohen
http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html
$ stap virtqueueleaks.stp
^C
pid elem
19503 1c4af28
19503 1c56f88
19503 1c62fe8
19503 1c6f048
19503 1c7b0a8
19503 1c87108
19503 1c93168
...
I am not that familiar with the internals of qemu. The script seems to indicates qemu is leaking, but is that really the case? If there are resource leaks, what output would help debug those leaks? What enhancements can be done to this script to provide more useful information?
Hi Stefan,

Thanks for the comments.
Post by Stefan Hajnoczi
Leak tracing always has corner cases :).
With virtio-blk this would indicate a leak because it uses a
request-response model where the guest initiates I/O and the host
responds. A guest that cleanly shuts down before you exit your
SystemTap script should not leak requests for virtio-blk.
I stopped the systemtap script while the guest vm was still running. So when the guest vm cleanly shuts down there should be a series of virtqueue_fill operations that will remove those elements?

Qemu uses a thread for each virtual processor, but a single thread to handle all IO. It seems like that might be a possible bottleneck. What would be the path of io event from guest to host back to guest? Is there somthing that a script could do to gauge that delay due to the qemu io thread handling multiple processors?
Post by Stefan Hajnoczi
With virtio-net the guest actually hands the host receive buffers and
they are held until we can receive packets into them and return them
to the host. We don't have a virtio_reset trace event, and due to
this we're not accounting for clean shutdown (the guest driver resets
the device to clear all virtqueues).
I am submitting a patch to add virtio_reset() tracing. This will
allow the script to delete all elements belonging to this virtio
device.
Post by William Cohen
Are there other examples of qemu probing people would like to see?
The malloc/realloc/memalign/vmalloc/free/vfree trace events can be
* Finding memory leaks.
malloc/realloc need free, memalign/vmalloc need vfree. They cannot be
mixed.
Stefan
As a quick and simple experiment to see how often various probes are getting hit I used the following script on RHEL-6 (the probe points are a bit different on Fedora):

global counts
probe qemu.*.*? {counts[pn()]++}
probe end {foreach(n+ in counts) printf ("%s = %d\n", n, counts[n])}

For starting up a fedora 14 guest virtual machine and shutting it down I got the following output:

$ stap ~/research/profiling/examples/qemu_count.s
^Cqemu.kvm.balloon_event = 1
qemu.kvm.bdrv_aio_multiwrite = 155
qemu.kvm.bdrv_aio_readv = 13284
qemu.kvm.bdrv_aio_writev = 998
qemu.kvm.cpu_get_apic_base = 20
qemu.kvm.cpu_in = 94082
qemu.kvm.cpu_out = 165789
qemu.kvm.cpu_set_apic_base = 445752
qemu.kvm.multiwrite_cb = 654
qemu.kvm.paio_submit = 7141
qemu.kvm.qemu_free = 677704
qemu.kvm.qemu_malloc = 683031
qemu.kvm.qemu_memalign = 285
qemu.kvm.qemu_realloc = 47550
qemu.kvm.virtio_blk_handle_write = 504
qemu.kvm.virtio_blk_req_complete = 7146
qemu.kvm.virtio_blk_rw_complete = 7146
qemu.kvm.virtio_notify = 6574
qemu.kvm.virtio_queue_notify = 6680
qemu.kvm.virtqueue_fill = 7146
qemu.kvm.virtqueue_flush = 7146
qemu.kvm.virtqueue_pop = 7147
qemu.kvm.vm_state_notify = 1


See a lot of qemu.kvm.qemu_malloc. This is likely more than systemtap can track if there are thousands of them live at the same time. There are no qemu_vmalloc events because of https://bugzilla.redhat.com/show_bug.cgi?id=714773.

Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.

See that qemu.kvm.virtio_blk_req_complete, qemu.kvm.virtio_blk_rw_complete, qemu.kvm.virtqueue_fill, and qemu.kvm.virtqueue_flush all have the same count, 7146. The qemu.kvm.virtqueue_pop is close, at 7147.

-Will
William Cohen
2011-09-13 19:38:07 UTC
Permalink
Post by William Cohen
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.
I found that cpu_in and cpu_out refer to input and output instructions. I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.

It generates output like the following:


cpu_in
port count
0x01f7 3000
0x03d5 120
0xc000 2000
0xc002 3000

cpu_out
port count
0x0080 480
0x01f1 2000
0x01f2 2000
0x01f3 2000
0x01f4 2000
0x01f5 2000
0x01f6 2000
0x01f7 1000
0x03d4 480
0x03d5 120
0x03f6 1000
0xc000 3000
0xc002 2000
0xc004 1000
0xc090 4

Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.

-Will
Stefan Hajnoczi
2011-09-14 14:51:27 UTC
Permalink
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up?  There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.
I found that cpu_in and cpu_out refer to input and output instructions.  I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.
cpu_in
 port    count
0x01f7     3000
0x03d5      120
0xc000     2000
0xc002     3000
cpu_out
 port    count
0x0080      480
0x01f1     2000
0x01f2     2000
0x01f3     2000
0x01f4     2000
0x01f5     2000
0x01f6     2000
0x01f7     1000
0x03d4      480
0x03d5      120
0x03f6     1000
0xc000     3000
0xc002     2000
0xc004     1000
0xc090        4
Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.
I think the cpu_in/cpu_out tracing script can be useful in identifying
performance problems, especially when obscure guest OSes experience
poor performance due to weird ways of interacting with hardware.

Where are you putting your SystemTap scripts? I suggest creating a
public git repo and adding a link from the QEMU wiki tracing page:
http://wiki.qemu.org/Features/Tracing

Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.

Stefan
Frank Ch. Eigler
2011-09-14 15:20:29 UTC
Permalink
Post by Stefan Hajnoczi
[...]
Where are you putting your SystemTap scripts? I suggest creating a
http://wiki.qemu.org/Features/Tracing
Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.
(We can also or instead ship them within systemtap's bundled sources.)

- FChE
William Cohen
2011-09-14 16:02:39 UTC
Permalink
Post by Stefan Hajnoczi
Post by William Cohen
Post by William Cohen
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up? There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.
I found that cpu_in and cpu_out refer to input and output instructions. I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.
cpu_in
port count
0x01f7 3000
0x03d5 120
0xc000 2000
0xc002 3000
cpu_out
port count
0x0080 480
0x01f1 2000
0x01f2 2000
0x01f3 2000
0x01f4 2000
0x01f5 2000
0x01f6 2000
0x01f7 1000
0x03d4 480
0x03d5 120
0x03f6 1000
0xc000 3000
0xc002 2000
0xc004 1000
0xc090 4
Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.
I think the cpu_in/cpu_out tracing script can be useful in identifying
performance problems, especially when obscure guest OSes experience
poor performance due to weird ways of interacting with hardware.
Glad this example looks useful.
Post by Stefan Hajnoczi
Where are you putting your SystemTap scripts? I suggest creating a
http://wiki.qemu.org/Features/Tracing
Definitely need a more lasting place to put the QEMU examples. SystemTap has an examples directory which try to put things like this into (http://sourceware.org/systemtap/examples/). These are run as part of the testsuite. These qemu examples are not ready for inclusion in the examples. I have set up https://github.com/wcohen/qemu_systemtap for the qemu examples.
Post by Stefan Hajnoczi
Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them. But before we worry about
that we need a useful set of things that can be observed.
We definitely want to develop scripts that give people a better understanding what is going on. I am just now learning how kvm and qemu work, so the early scripts are just counting events to give people an idea events are frequent. In some cases these could point to issues where some assumptions about event frequency are incorrect. I would like to have scripts that examine the sequences of events and indicate when long latencies occur, but I don't yet have enough understanding of the how kvm and qemu work to implement those.

There are other userspace applications like python and java that have similar probes (http://sourceware.org/systemtap/wiki "Applications with built-in User-Space Markers"). Certainly making the systemtap scripts easily available would be good. Having the scripts with qemu is one way to do it. The other way would be to include in the systemtap examples. Advantage with including with qemu would be scripts would match the version of the qemu running and avoid problems with changes in the qemu probe points. Having the scripts in systemtap would allow regular testing of the scripts.

-Will
Stefan Hajnoczi
2011-09-15 10:17:05 UTC
Permalink
Post by William Cohen
Post by Stefan Hajnoczi
Should the qemu.kvm.cpu_in and qemu.kvm.cpu_out match up?  There are a lot more qemu.kvm.cpu_out than qemu.kvm.cpu_in count.
I found that cpu_in and cpu_out refer to input and output instructions.  I wrote a little script tally up the input and output operations on each port to run on a qemu on fc15 machine.
cpu_in
 port    count
0x01f7     3000
0x03d5      120
0xc000     2000
0xc002     3000
cpu_out
 port    count
0x0080      480
0x01f1     2000
0x01f2     2000
0x01f3     2000
0x01f4     2000
0x01f5     2000
0x01f6     2000
0x01f7     1000
0x03d4      480
0x03d5      120
0x03f6     1000
0xc000     3000
0xc002     2000
0xc004     1000
0xc090        4
Looks like lots of touching the ide device ports (0x01f0-0x01ff) and some vga controller (0x03d0-0x3df). This is kind of what would be expected when the machine is doing a fsck and selinux relabel on the guest virtual machines. Look like some pci device access (http://www.tech-pro.net/intro_pci.html) also.
I think the cpu_in/cpu_out tracing script can be useful in identifying
performance problems, especially when obscure guest OSes experience
poor performance due to weird ways of interacting with hardware.
Glad this example looks useful.
Post by Stefan Hajnoczi
Where are you putting your SystemTap scripts?  I suggest creating a
http://wiki.qemu.org/Features/Tracing
Definitely need a more lasting place to put the QEMU examples. SystemTap has an examples directory which try to put things like this into (http://sourceware.org/systemtap/examples/). These are run as part of the testsuite. These qemu examples are not ready for inclusion in the examples. I have set up https://github.com/wcohen/qemu_systemtap for the qemu examples.
Added a link from http://qemu.org/Features/Tracing.
Post by William Cohen
Post by Stefan Hajnoczi
Perhaps we could even include them in a contrib/ or similar directory
in qemu.git so that distros can ship them.  But before we worry about
that we need a useful set of things that can be observed.
We definitely want to develop scripts that give people a better understanding what is going on. I am just now learning how kvm and qemu work, so the early scripts are just counting events to give people an idea events are frequent. In some cases these could point to issues where some assumptions about event frequency are incorrect. I would like to have scripts that examine the sequences of events and indicate when long latencies occur, but I don't yet have enough understanding of the how kvm and qemu work to implement those.
The vmexit/entry things I mentioned in the mail above are events that
can be traced in the host kernel (kvm:kvm_exit and kvm:kvm_entry).
The QEMU-side equivalent is the ioctl(KVM_RUN) but we do not have any
trace events in kvm-all.c yet (!!). A trace event before the ioctl
and one after could be used to trace vcpu code execution from QEMU's
perspective.

Stefan

Stefan Hajnoczi
2011-09-14 15:32:50 UTC
Permalink
Post by William Cohen
Post by Stefan Hajnoczi
Post by William Cohen
http://blog.vmsplice.net/2011/03/how-to-write-trace-analysis-scripts-for.html
$ stap virtqueueleaks.stp
^C
    pid     elem
  19503  1c4af28
  19503  1c56f88
  19503  1c62fe8
  19503  1c6f048
  19503  1c7b0a8
  19503  1c87108
  19503  1c93168
...
I am not that familiar with the internals of qemu. The script seems to indicates qemu is leaking, but is that really the case?  If there are resource leaks, what output would help debug those leaks? What enhancements can be done to this script to provide more useful information?
Hi Stefan,
Thanks for the comments.
Post by Stefan Hajnoczi
Leak tracing always has corner cases :).
With virtio-blk this would indicate a leak because it uses a
request-response model where the guest initiates I/O and the host
responds.  A guest that cleanly shuts down before you exit your
SystemTap script should not leak requests for virtio-blk.
I stopped the systemtap script while the guest vm was still running. So when the guest vm cleanly shuts down there should be a series of virtqueue_fill operations that will remove those elements?
In the case of virtio-blk we only see pop and fill when the guest
tells the host to process an I/O request. That means once the request
is complete we've done the fill and there is no outstanding virtqueue
element anymore. When a guest shuts down cleanly it will have no
pending virtio-blk I/O requests and hence the pops and fills balance
out (there is no leak).

I tried to explain that with virtio-net the situation is different
since the guest gives us a bunch of virtqueue elements that point to
receive packet buffers. When the guest shuts down it will reset the
virtio device, which will clear all virtqueues and hence the popped
elements are no longer relevant. I CCed you on a patch that adds a
virtio_set_status() trace event so we see when the guest driver resets
the virtio device.

The upshot of this is that you can use these trace events to check
that QEMU is returning all virtqueue elements to the guest and not
holding on to (leaking) them. It's easy for virtio-blk.
Post by William Cohen
Qemu uses a thread for each virtual processor, but a single thread to handle all IO. It seems like that might be a possible bottleneck. What would be the path of io event from guest to host back to guest? Is there somthing that a script could do to gauge that delay due to the qemu io thread handling multiple processors?
I think it would be fantastic to look at the vm_exit and vm_entry
events. These are the events that get raised each time a vcpu exits
guest mode and returns to the kvm.ko host kernel module and every time
we re-enter guest mode. There are a couple of different ways to work
with this data:

1. Average, min, max, std deviation. Answers what the general
vm_exit->vm_entry latency looks like.
2. Heat map of vm_exit->vm_entry latencies. We can see more in a
heatmap, including how the latencies change over time and which are
most frequent.
3. Detailed vm_exit->vm_entry trace logs together with the events that
lead to the exit (e.g. I/O register access by guest). This
information is useful for finding out why certain exits happened but
can already be captured today, so I would focus on 1 & 2 above which
can make use of SystemTap's power.

This sort of information can help track down instances where the vcpu
is being held up, either because QEMU's global mutex is held by
another thread or for various other reasons.

Stefan
Loading...