Discussion:
[Qemu-devel] Lock contention in QEMU
Weiwei Jia
2016-12-14 05:58:11 UTC
Permalink
Hi Stefan,

I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.

Workload settings:
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS.

Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0

When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.

7097537 Dec 13 11:22:33 mobius04 kernel: [39163.015789] Call Trace:
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f


If true, I think this may be a scalability problem caused by QEMU I/O
part. Do we have a feature in QEMU to avoid this? Would you please
give me some suggestions about how to make the timeslice of vCPU2
thread stable even though there are lots of I/O Read requests on it.
Thank you.

Best,
Weiwei Jia
Stefan Hajnoczi
2016-12-14 19:31:07 UTC
Permalink
Post by Weiwei Jia
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS.
Did you pin the QEMU main loop to pCPU5? This is the QEMU process' main
thread and it handles ioeventfd (virtqueue kick) and thread pool
completions.
Post by Weiwei Jia
Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0
When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f
If true, I think this may be a scalability problem caused by QEMU I/O
part. Do we have a feature in QEMU to avoid this? Would you please
give me some suggestions about how to make the timeslice of vCPU2
thread stable even though there are lots of I/O Read requests on it.
Yes, there is a way to reduce jitter caused by the QEMU global mutex:

qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0

Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.

This feature is called virtio-blk dataplane.

You can query IOThread thread IDs using the query-iothreads QMP command.
This will allow you to pin iothread0 to pCPU5.

Please let us know if this helps.

Stefan
Weiwei Jia
2016-12-15 01:06:10 UTC
Permalink
Hi Stefan,

Thanks for your reply. Please see the inline replies.
Post by Stefan Hajnoczi
Post by Weiwei Jia
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS.
Did you pin the QEMU main loop to pCPU5? This is the QEMU process' main
thread and it handles ioeventfd (virtqueue kick) and thread pool
completions.
No, I did not pin main loop to pCPU5. Do you mean If I pin QEMU main
loop to pCPU5 under above workload, the timeslice of vCPU2 thread will
be stable even though there are lots of I/O requests? I didn't use
virtio for VM and I use SCSI. My whole VM xml configuration file is as
follows.

<domain type='kvm' id='2'>
<name>kvm1</name>
<uuid>8e9c4603-c4b5-fa41-b251-1dc4ffe1872c</uuid>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>4</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
</cputune>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-i440fx-2.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/bin/kvm-spice</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='/home/images/kvm1.img'/>
<target dev='hda' bus='scsi'/>
<alias name='scsi0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<disk type='block' device='cdrom'>
<driver name='qemu' type='raw'/>
<target dev='hdc' bus='ide'/>
<readonly/>
<alias name='ide0-1-0'/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>
<controller type='usb' index='0'>
<alias name='usb0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<controller type='scsi' index='0'>
<alias name='scsi0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
</controller>
<controller type='ide' index='0'>
<alias name='ide0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
</controller>
<interface type='network'>
<mac address='52:54:00:01:ab:ca'/>
<source network='default'/>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/13'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/13'>
<source path='/dev/pts/13'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
<listen type='address' address='127.0.0.1'/>
</graphics>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02'
function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
</memballoon>
</devices>
<seclabel type='none'/>
</domain>
Post by Stefan Hajnoczi
Post by Weiwei Jia
Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0
When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f
If true, I think this may be a scalability problem caused by QEMU I/O
part. Do we have a feature in QEMU to avoid this? Would you please
give me some suggestions about how to make the timeslice of vCPU2
thread stable even though there are lots of I/O Read requests on it.
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.
This feature is called virtio-blk dataplane.
You can query IOThread thread IDs using the query-iothreads QMP command.
This will allow you to pin iothread0 to pCPU5.
Please let us know if this helps.
Does this feature only work for VirtIO? Does it work for SCSI or IDE?

Thank you,
Weiwei Jia
Weiwei Jia
2016-12-15 05:17:09 UTC
Permalink
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?

qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0


Thank you,
Weiwei Jia
Stefan Hajnoczi
2016-12-15 08:06:34 UTC
Permalink
Post by Weiwei Jia
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation

See also <cputune><iothreadpin> and <driver iothread=>.
Weiwei Jia
2016-12-15 16:04:23 UTC
Permalink
Post by Stefan Hajnoczi
Post by Weiwei Jia
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
See also <cputune><iothreadpin> and <driver iothread=>.
It seems that the libvirt XML configuration in above link [1] is
different from the configuration which you said in your blog [2]. Your
blog just said how to use x-data-plane for virtio-blk but libvirt XML
configuration doesn't say anything about x-data-plane. Does it in
default support by latest QEMU version. If I want to test virtio-blk
or virtio-scsi data-plane to make timeslice more stable, what version
of QEMU/KVM should I use? Would you please give me some suggestions?
Thank you.

[1] https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
[2] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html

Cheers,
Weiwei Jia
Stefan Hajnoczi
2016-12-16 09:48:50 UTC
Permalink
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
See also <cputune><iothreadpin> and <driver iothread=>.
It seems that the libvirt XML configuration in above link [1] is
different from the configuration which you said in your blog [2]. Your
blog just said how to use x-data-plane for virtio-blk but libvirt XML
configuration doesn't say anything about x-data-plane. Does it in
default support by latest QEMU version. If I want to test virtio-blk
or virtio-scsi data-plane to make timeslice more stable, what version
of QEMU/KVM should I use? Would you please give me some suggestions?
Thank you.
[1] https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
[2] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
I will update the blog post (from 2013) with the modern libvirt XML
syntax.

Please use the
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
documentation (unless you are using a really old QEMU and libvirt!).

Stefan
Weiwei Jia
2016-12-16 14:32:14 UTC
Permalink
Post by Stefan Hajnoczi
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
See also <cputune><iothreadpin> and <driver iothread=>.
It seems that the libvirt XML configuration in above link [1] is
different from the configuration which you said in your blog [2]. Your
blog just said how to use x-data-plane for virtio-blk but libvirt XML
configuration doesn't say anything about x-data-plane. Does it in
default support by latest QEMU version. If I want to test virtio-blk
or virtio-scsi data-plane to make timeslice more stable, what version
of QEMU/KVM should I use? Would you please give me some suggestions?
Thank you.
[1] https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
[2] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
I will update the blog post (from 2013) with the modern libvirt XML
syntax.
Please use the
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
documentation (unless you are using a really old QEMU and libvirt!).
I have tried the old libvirt XML way in your blog with QEMU v2.2.0 and
Libvirt v1.2.2. It seems that everything works well for me. I will try
the modern libvirt XML syntax with latest QEMU and libvirt.

Thank you,
Weiwei Jia
Weiwei Jia
2016-12-16 21:42:54 UTC
Permalink
Hi Stefan,

I still have another concern like following.

Has x-data-plane been used (or accepted) widely in systems. I have
this concern since if it hasn't been widely accepted, it may
have/cause some problems we don't know. Do you know some hidden
problems which may caused by QEMU x-data-plane feature in systems?

Thanks,
Weiwei Jia
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
BTW, do we have an example to show users how to config following
virtio-blk dataplane commands into XML configuration file?
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
See also <cputune><iothreadpin> and <driver iothread=>.
It seems that the libvirt XML configuration in above link [1] is
different from the configuration which you said in your blog [2]. Your
blog just said how to use x-data-plane for virtio-blk but libvirt XML
configuration doesn't say anything about x-data-plane. Does it in
default support by latest QEMU version. If I want to test virtio-blk
or virtio-scsi data-plane to make timeslice more stable, what version
of QEMU/KVM should I use? Would you please give me some suggestions?
Thank you.
[1] https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
[2] http://blog.vmsplice.net/2013/03/new-in-qemu-14-high-performance-virtio.html
I will update the blog post (from 2013) with the modern libvirt XML
syntax.
Please use the
https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation
documentation (unless you are using a really old QEMU and libvirt!).
I have tried the old libvirt XML way in your blog with QEMU v2.2.0 and
Libvirt v1.2.2. It seems that everything works well for me. I will try
the modern libvirt XML syntax with latest QEMU and libvirt.
Thank you,
Weiwei Jia
Stefan Hajnoczi
2016-12-15 08:04:53 UTC
Permalink
Post by Weiwei Jia
Hi Stefan,
Thanks for your reply. Please see the inline replies.
Post by Stefan Hajnoczi
Post by Weiwei Jia
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS.
Did you pin the QEMU main loop to pCPU5? This is the QEMU process' main
thread and it handles ioeventfd (virtqueue kick) and thread pool
completions.
No, I did not pin main loop to pCPU5. Do you mean If I pin QEMU main
loop to pCPU5 under above workload, the timeslice of vCPU2 thread will
be stable even though there are lots of I/O requests? I didn't use
virtio for VM and I use SCSI. My whole VM xml configuration file is as
follows.
Pinning the main loop will probably not solve the problem but it might
help a bit. I just noticed it while reading your email because you
pinned everything carefully except the main loop, which is an important
thread.
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0
When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f
If true, I think this may be a scalability problem caused by QEMU I/O
part. Do we have a feature in QEMU to avoid this? Would you please
give me some suggestions about how to make the timeslice of vCPU2
thread stable even though there are lots of I/O Read requests on it.
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.
This feature is called virtio-blk dataplane.
You can query IOThread thread IDs using the query-iothreads QMP command.
This will allow you to pin iothread0 to pCPU5.
Please let us know if this helps.
Does this feature only work for VirtIO? Does it work for SCSI or IDE?
This only works for virtio-blk and virtio-scsi. The virtio-scsi
dataplane support is more recent and I don't remember if it is complete.
I've CCed Fam and Paolo who worked on virtio-scsi dataplane.

Now that you have mentioned that you aren't using virtio devices, there
is another source of lock contention that you will encounter. I/O
request submission takes place in the vcpu thread when ioeventfd is not
used. Only virtio uses ioeventfd so your current QEMU configuration is
unable to let the vcpu continue execution during I/O request submission.

If you care about performance then using virtio devices is probably the
best choice. Try comparing against virtio-scsi dataplane - you should
see a lot less jitter.

Stefan
Paolo Bonzini
2016-12-15 12:13:37 UTC
Permalink
Post by Stefan Hajnoczi
Post by Weiwei Jia
Does this feature only work for VirtIO? Does it work for SCSI or IDE?
This only works for virtio-blk and virtio-scsi. The virtio-scsi
dataplane support is more recent and I don't remember if it is complete.
I've CCed Fam and Paolo who worked on virtio-scsi dataplane.
Now that you have mentioned that you aren't using virtio devices, there
is another source of lock contention that you will encounter. I/O
request submission takes place in the vcpu thread when ioeventfd is not
used. Only virtio uses ioeventfd so your current QEMU configuration is
unable to let the vcpu continue execution during I/O request submission.
If you care about performance then using virtio devices is probably the
best choice. Try comparing against virtio-scsi dataplane - you should
see a lot less jitter.
Yes, virtio-scsi dataplane is complete.

Paolo
Weiwei Jia
2016-12-15 16:05:11 UTC
Permalink
I will try it later on. Thank you.

Best,
Weiwei Jia
Post by Paolo Bonzini
Post by Stefan Hajnoczi
Post by Weiwei Jia
Does this feature only work for VirtIO? Does it work for SCSI or IDE?
This only works for virtio-blk and virtio-scsi. The virtio-scsi
dataplane support is more recent and I don't remember if it is complete.
I've CCed Fam and Paolo who worked on virtio-scsi dataplane.
Now that you have mentioned that you aren't using virtio devices, there
is another source of lock contention that you will encounter. I/O
request submission takes place in the vcpu thread when ioeventfd is not
used. Only virtio uses ioeventfd so your current QEMU configuration is
unable to let the vcpu continue execution during I/O request submission.
If you care about performance then using virtio devices is probably the
best choice. Try comparing against virtio-scsi dataplane - you should
see a lot less jitter.
Yes, virtio-scsi dataplane is complete.
Paolo
Weiwei Jia
2016-12-15 15:52:08 UTC
Permalink
Post by Stefan Hajnoczi
Post by Weiwei Jia
Hi Stefan,
Thanks for your reply. Please see the inline replies.
Post by Stefan Hajnoczi
Post by Weiwei Jia
I find the timeslice of vCPU thread in QEMU/KVM is unstable when there
are lots of read requests (for example, read 4KB each time (8GB in
total) from one file) from Guest OS. I also find that this phenomenon
may be caused by lock contention in QEMU layer. I find this problem
under following workload.
In VMM, there are 6 pCPUs which are pCPU0, pCPU1, pCPU2, pCPU3, pCPU4,
pCPU5. There are two Kernel Virtual Machines (VM1 and VM2) upon VMM.
In each VM, there are 5 vritual CPUs (vCPU0, vCPU1, vCPU2, vCPU3,
vCPU4). vCPU0 in VM1 and vCPU0 in VM2 are pinned to pCPU0 and pCPU5
separately to handle interrupts dedicatedly. vCPU1 in VM1 and vCPU1 in
VM2 are pinned to pCPU1; vCPU2 in VM1 and vCPU2 in VM2 are pinned to
pCPU2; vCPU3 in VM1 and vCPU3 in VM2 are pinned to pCPU3; vCPU4 in VM1
and vCPU4 in VM2 are pinned to pCPU4. Besides vCPU0 in VM2 (pinned to
pCPU5), other vCPUs all have one CPU intensive thread (while(1){i++})
upon each of them in VM1 and VM2 to avoid the vCPU to be idle. In VM1,
I start one I/O thread on vCPU2, which the I/O thread reads 4KB from
one file each time (reads 8GB in total). The I/O scheduler in VM1 and
VM2 is NOOP. The I/O scheduler in VMM is CFQ. I also pinned the I/O
worker threads launched by QEMU to pCPU5 (note: there is no CPU
intensive thread on pCPU5 so the I/O requests will be handled by QEMU
I/O thread workers ASAP). The process scheduling class in VM and VMM
is CFS.
Did you pin the QEMU main loop to pCPU5? This is the QEMU process' main
thread and it handles ioeventfd (virtqueue kick) and thread pool
completions.
No, I did not pin main loop to pCPU5. Do you mean If I pin QEMU main
loop to pCPU5 under above workload, the timeslice of vCPU2 thread will
be stable even though there are lots of I/O requests? I didn't use
virtio for VM and I use SCSI. My whole VM xml configuration file is as
follows.
Pinning the main loop will probably not solve the problem but it might
help a bit. I just noticed it while reading your email because you
pinned everything carefully except the main loop, which is an important
thread.
Yes, even pin main loop to dedicated pCPU, the timeslice still jitters
very much.
Post by Stefan Hajnoczi
Post by Weiwei Jia
Post by Stefan Hajnoczi
Post by Weiwei Jia
Linux Kernel version for VMM is: 3.16.39
Linux Kernel version for VM1 and VM2 is: 4.7.4
QEMU emulator version is: 2.0.0
When I test above workload, I find the timeslice of vCPU2 thread
jitters very much. I suspect this is triggered by lock contention in
QEMU layer since my debug log in front of VMM Linux Kernel's
schedule->__schedule->context_switch is like following. Once the
timeslice jitters very much, following debug information will appear.
7097538 Dec 13 11:22:33 mobius04 kernel: [39163.015791]
[<ffffffff8176b2f0>] dump_stack+0x64/0x84
7097539 Dec 13 11:22:33 mobius04 kernel: [39163.015793]
[<ffffffff8176bf85>] __schedule+0x5b5/0x960
7097540 Dec 13 11:22:33 mobius04 kernel: [39163.015794]
[<ffffffff8176c409>] schedule+0x29/0x70
7097541 Dec 13 11:22:33 mobius04 kernel: [39163.015796]
[<ffffffff810ef4d8>] futex_wait_queue_me+0xd8/0x150
7097542 Dec 13 11:22:33 mobius04 kernel: [39163.015798]
[<ffffffff810ef6fb>] futex_wait+0x1ab/0x2b0
7097543 Dec 13 11:22:33 mobius04 kernel: [39163.015800]
[<ffffffff810eef00>] ? get_futex_key+0x2d0/0x2e0
7097544 Dec 13 11:22:33 mobius04 kernel: [39163.015804]
[<ffffffffc0290105>] ? __vmx_load_host_state+0x125/0x170 [kv
m_intel]
7097545 Dec 13 11:22:33 mobius04 kernel: [39163.015805]
[<ffffffff810f1275>] do_futex+0xf5/0xd20
7097546 Dec 13 11:22:33 mobius04 kernel: [39163.015813]
[<ffffffffc0222690>] ? kvm_vcpu_ioctl+0x100/0x560 [kvm]
7097547 Dec 13 11:22:33 mobius04 kernel: [39163.015816]
[<ffffffff810b06f0>] ? __dequeue_entity+0x30/0x50
7097548 Dec 13 11:22:33 mobius04 kernel: [39163.015818]
[<ffffffff81013d06>] ? __switch_to+0x596/0x690
7097549 Dec 13 11:22:33 mobius04 kernel: [39163.015820]
[<ffffffff811f9f23>] ? do_vfs_ioctl+0x93/0x520
7097550 Dec 13 11:22:33 mobius04 kernel: [39163.015822]
[<ffffffff810f1f1d>] SyS_futex+0x7d/0x170
7097551 Dec 13 11:22:33 mobius04 kernel: [39163.015824]
[<ffffffff8116d1b2>] ? fire_user_return_notifiers+0x42/0x50
7097552 Dec 13 11:22:33 mobius04 kernel: [39163.015826]
[<ffffffff810154b5>] ? do_notify_resume+0xc5/0x100
7097553 Dec 13 11:22:33 mobius04 kernel: [39163.015828]
[<ffffffff81770a8d>] system_call_fastpath+0x1a/0x1f
If true, I think this may be a scalability problem caused by QEMU I/O
part. Do we have a feature in QEMU to avoid this? Would you please
give me some suggestions about how to make the timeslice of vCPU2
thread stable even though there are lots of I/O Read requests on it.
qemu -object iothread,id=iothread0 \
-drive if=none,id=drive0,file=test.img,format=raw,cache=none \
-device virtio-blk-pci,iothread=iothread0,drive=drive0
Now the ioeventfd and thread pool completions will be processed in
iothread0 instead of the QEMU main loop thread. This thread does not
take the QEMU global mutex so vcpu execution is not hindered.
This feature is called virtio-blk dataplane.
You can query IOThread thread IDs using the query-iothreads QMP command.
This will allow you to pin iothread0 to pCPU5.
Please let us know if this helps.
Does this feature only work for VirtIO? Does it work for SCSI or IDE?
This only works for virtio-blk and virtio-scsi. The virtio-scsi
dataplane support is more recent and I don't remember if it is complete.
I've CCed Fam and Paolo who worked on virtio-scsi dataplane.
Now that you have mentioned that you aren't using virtio devices, there
is another source of lock contention that you will encounter. I/O
request submission takes place in the vcpu thread when ioeventfd is not
used. Only virtio uses ioeventfd so your current QEMU configuration is
unable to let the vcpu continue execution during I/O request submission.
If you care about performance then using virtio devices is probably the
best choice. Try comparing against virtio-scsi dataplane - you should
see a lot less jitter.
I will try virtio-scsi or virtio-blk dataplane solution. Thank you.


Cheers,
Weiwei Jia
Loading...