Discussion:
[PATCH] kvm: clear guest TSC on reset
Fernando Luis Vázquez Cao
2013-12-05 06:08:42 UTC
Permalink
I realized that the TSC reset should be done in QEMU
so I will be replying with a QEMU patch instead of a
KVM one.

- Fernando
I think there is a problem with the current patch, so please
ignore for the moment. I will be replying with an update ASAP.
Sorry for the noise.
- Fernando
VCPU TSC is not cleared by a warm reset (*), which leaves many Linux
guests vulnerable to the overflow in cyc2ns_offset fixed by upstream
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix=20
overflow
in cyc2ns_offset").
To put it in a nutshell, if a Linux guest without the patch above=20
applied
has been up more than 208 days and attempts a warm reset chances are=
=20
that
the newly booted kernel will panic or hang.
(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel=C2=AE Xeo=
n=C2=AE
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."
---
diff -urNp linux-3.13-rc2-orig/arch/x86/kvm/x86.c=20
linux-3.13-rc2/arch/x86/kvm/x86.c
--- linux-3.13-rc2-orig/arch/x86/kvm/x86.c 2013-11-30=20
05:57:14.000000000 +0900
+++ linux-3.13-rc2/arch/x86/kvm/x86.c 2013-12-03=20
14:51:53.747600839 +0900
@@ -6716,18 +6716,24 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu
return r;
}
-int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+static void kvm_tsc_reset(struct kvm_vcpu *vcpu)
{
- int r;
struct msr_data msr;
- r =3D vcpu_load(vcpu);
- if (r)
- return r;
msr.data =3D 0x0;
msr.index =3D MSR_IA32_TSC;
msr.host_initiated =3D true;
kvm_write_tsc(vcpu, &msr);
+}
+
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+ int r;
+
+ r =3D vcpu_load(vcpu);
+ if (r)
+ return r;
+ kvm_tsc_reset(vcpu);
vcpu_put(vcpu);
return r;
@@ -6770,6 +6776,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcp
kvm_pmu_reset(vcpu);
+ kvm_tsc_reset(vcpu);
+ if (guest_cpuid_has_tsc_adjust(vcpu))
+ vcpu->arch.ia32_tsc_adjust_msr =3D 0x0;
+
memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
vcpu->arch.regs_avail =3D ~0;
vcpu->arch.regs_dirty =3D ~0;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-05 06:15:04 UTC
Permalink
VCPU TSC is not cleared by a warm reset (*), which leaves many Linux
guests vulnerable to the overflow in cyc2ns_offset fixed by upstream
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix overfl=
ow
in cyc2ns_offset").

To put it in a nutshell, if a Linux guest without the patch above appli=
ed
has been up more than 208 days and attempts a warm reset chances are th=
at
the newly booted kernel will panic or hang.

(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel=C2=AE Xeon=C2=
=AE
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."

Cc: ***@vger.kernel.org
Cc: Will Auld <***@intel.com>
Cc: Marcelo Tosatti <***@redhat.com>
Signed-off-by: Fernando Luis Vazquez Cao <***@oss.ntt.co.jp>
---

--- qemu-orig/target-i386/kvm.c 2013-11-28 07:02:45.000000000 +0900
+++ qemu/target-i386/kvm.c 2013-12-05 14:47:03.085738175 +0900
@@ -1125,6 +1125,8 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
}
if (has_msr_tsc_adjust) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc_adjust =3D 0;
kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adjust)=
;
}
if (has_msr_misc_enable) {
@@ -1139,22 +1141,22 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_LSTAR, env->lstar);
}
#endif
- if (level =3D=3D KVM_PUT_FULL_STATE) {
+ /*
+ * The following MSRs have side effects on the guest or are too he=
avy
+ * for normal writeback. Limit them to reset or full state updates=
=2E
+ */
+ if (level >=3D KVM_PUT_RESET_STATE) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc =3D 0;
/*
* KVM is yet unable to synchronize TSC values of multiple VCP=
Us on
* writeback. Until this is fixed, we only write the offset to=
SMP
* guests after migration, desynchronizing the VCPUs, but avoi=
ding
* huge jump-backs that would occur without any writeback at a=
ll.
*/
- if (smp_cpus =3D=3D 1 || env->tsc !=3D 0) {
+ if (smp_cpus =3D=3D 1 || env->tsc !=3D 0 || level =3D=3D KVM_P=
UT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
- }
- /*
- * The following MSRs have side effects on the guest or are too he=
avy
- * for normal writeback. Limit them to reset or full state updates=
=2E
- */
- if (level >=3D KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
env->system_time_msr);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wall_cl=
ock_msr);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-05 09:28:18 UTC
Permalink
VCPU TSC is not cleared by a warm reset (*), which leaves many Linux
guests vulnerable to the overflow in cyc2ns_offset fixed by upstream
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix over=
flow
in cyc2ns_offset").
=20
To put it in a nutshell, if a Linux guest without the patch above app=
lied
has been up more than 208 days and attempts a warm reset chances are =
that
the newly booted kernel will panic or hang.
=20
(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel=C2=AE Xeon=C2=
=AE
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."
=20
I agree that the bug is in QEMU. One small nit in your patch is that
you should reset env->tsc_adjust and env->tsc in x86_cpu_reset. This
would already be pretty good.

However, a bigger problem is that env->tsc is a useless duplicate of
"cpu_get_ticks() + env->tsc_adjust". It would be nice to drop env->tsc
completely except for migration backwards compatibility. Thus you can:

- fill in env->tsc as mentioned above from target-i386/machine.c's
cpu_pre_save function. This guarantees backwards compatibility.

- add a function cpu_set_ticks(int64_t ticks) to cpus.c. The function
does nothing if use_icount is true, otherwise it needs to have (roughly=
)
the opposite logic compared to cpu_get_ticks. You then call this
function from x86_cpu_reset instead of setting env->tsc. You can
similarly call this function from kvm_get_msrs.

- add a function kvm_set_ticks(int64_t ticks) to kvm-all.c and
kvm-stub.c. For kvm-all.c it calls kvm_arch_set_ticks(CPUState *cpu,
int64_t ticks) in target-*/kvm.c. The kvm_arch_set_tsc() function has =
a
dummy implementation for all architectures except x86. For x86 it call=
s
KVM_SET_MSRS passing "ticks + env->tsc_offset".

- call kvm_set_ticks() from cpu_set_ticks() and cpu_enable_ticks()

Can you do this?

Thanks,

Paolo
---
=20
--- qemu-orig/target-i386/kvm.c 2013-11-28 07:02:45.000000000 +0900
+++ qemu/target-i386/kvm.c 2013-12-05 14:47:03.085738175 +0900
@@ -1125,6 +1125,8 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave=
);
}
if (has_msr_tsc_adjust) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc_adjust =3D 0;
kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adjus=
t);
}
if (has_msr_misc_enable) {
@@ -1139,22 +1141,22 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_LSTAR, env->lstar);
}
#endif
- if (level =3D=3D KVM_PUT_FULL_STATE) {
+ /*
+ * The following MSRs have side effects on the guest or are too =
heavy
+ * for normal writeback. Limit them to reset or full state updat=
es.
+ */
+ if (level >=3D KVM_PUT_RESET_STATE) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc =3D 0;
/*
* KVM is yet unable to synchronize TSC values of multiple V=
CPUs on
* writeback. Until this is fixed, we only write the offset =
to SMP
* guests after migration, desynchronizing the VCPUs, but av=
oiding
* huge jump-backs that would occur without any writeback at=
all.
*/
- if (smp_cpus =3D=3D 1 || env->tsc !=3D 0) {
+ if (smp_cpus =3D=3D 1 || env->tsc !=3D 0 || level =3D=3D KVM=
_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
- }
- /*
- * The following MSRs have side effects on the guest or are too =
heavy
- * for normal writeback. Limit them to reset or full state updat=
es.
- */
- if (level >=3D KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
env->system_time_msr);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wall_=
clock_msr);
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
=20
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vazquez Cao
2013-12-05 13:15:06 UTC
Permalink
Post by Paolo Bonzini
VCPU TSC is not cleared by a warm reset (*), which leaves many Linux
guests vulnerable to the overflow in cyc2ns_offset fixed by upstream
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix overflow
in cyc2ns_offset").
To put it in a nutshell, if a Linux guest without the patch above applied
has been up more than 208 days and attempts a warm reset chances are that
the newly booted kernel will panic or hang.
(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel® Xeon®
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."
I agree that the bug is in QEMU. One small nit in your patch is that
you should reset env->tsc_adjust and env->tsc in x86_cpu_reset. This
would already be pretty good.
Yes, that is certainly cleaner (I should try not to take shortcuts...).
I am attaching
an updated patch (I apologize for not sending it inline - for reasons
better left
untold I am writing this on a problematic email client :) ).
Post by Paolo Bonzini
However, a bigger problem is that env->tsc is a useless duplicate of
"cpu_get_ticks() + env->tsc_adjust". It would be nice to drop env->tsc
- fill in env->tsc as mentioned above from target-i386/machine.c's
cpu_pre_save function. This guarantees backwards compatibility.
- add a function cpu_set_ticks(int64_t ticks) to cpus.c. The function
does nothing if use_icount is true, otherwise it needs to have (roughly)
the opposite logic compared to cpu_get_ticks. You then call this
function from x86_cpu_reset instead of setting env->tsc. You can
similarly call this function from kvm_get_msrs.
- add a function kvm_set_ticks(int64_t ticks) to kvm-all.c and
kvm-stub.c. For kvm-all.c it calls kvm_arch_set_ticks(CPUState *cpu,
int64_t ticks) in target-*/kvm.c. The kvm_arch_set_tsc() function has a
dummy implementation for all architectures except x86. For x86 it calls
KVM_SET_MSRS passing "ticks + env->tsc_offset".
- call kvm_set_ticks() from cpu_set_ticks() and cpu_enable_ticks()
Can you do this?
Can you pick my original fix first? I can do what you suggest in a follow-up
patch.

Thanks,
Fernando
Paolo Bonzini
2013-12-05 13:53:05 UTC
Permalink
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.

The main point of my outlined solution is that you only have one value
that is tracked, not one per VCPU (which in the case of migration adds
unpredictable latencies---for example due to emptying the migration
buffers). We already save that value; all that's left is to use it
instead of env->tsc.
- add a function kvm_set_ticks(int64_t ticks) to kvm-all.c and
kvm-stub.c. For kvm-all.c it calls kvm_arch_set_ticks(CPUState *cpu,
int64_t ticks) in target-*/kvm.c. The kvm_arch_set_tsc() function has a
dummy implementation for all architectures except x86. For x86 it calls
KVM_SET_MSRS passing "ticks + env->tsc_offset".
Instead you can make kvm_{,arch_}update_ticks() and pass
"cpu_get_ticks() + env->tsc_offset" to KVM_SET_MSRS (looping across all
VCPUs). Assuming the TSC is synchronized to begin with on host CPUs,
and the latency is similar for all CPUs from the invocation of the ioctl
to the time TSC_OFFSET is written, the synchronization should be decent.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vazquez Cao
2013-12-05 15:42:44 UTC
Permalink
Post by Paolo Bonzini
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
Post by Paolo Bonzini
The main point of my outlined solution is that you only have one value
that is tracked, not one per VCPU (which in the case of migration adds
unpredictable latencies---for example due to emptying the migration
buffers). We already save that value; all that's left is to use it
instead of env->tsc.
I understand the benefits of what you are proposing but, since it is
wider is scope and it would be more difficult to backport, I would
prefer to implement it as a follow-up patch, unless you think that
the current patch as a standalone fix does more harm than good.

- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-05 16:02:02 UTC
Permalink
Post by Fernando Luis Vazquez Cao
Post by Paolo Bonzini
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
No, probably not.
Post by Fernando Luis Vazquez Cao
I understand the benefits of what you are proposing but, since it is
wider is scope and it would be more difficult to backport, I would
prefer to implement it as a follow-up patch, unless you think that
the current patch as a standalone fix does more harm than good.
It does some harm in that it introduces a case where KVM_PUT_RESET_STATE
restores something, but KVM_PUT_FULL_STATE doesn't.

If it really usually works, there shouldn't be a need for this "if"
statement at all.

Marcelo, what do you think?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2013-12-05 16:40:00 UTC
Permalink
Post by Paolo Bonzini
Post by Fernando Luis Vazquez Cao
Post by Paolo Bonzini
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
No, probably not.
Post by Fernando Luis Vazquez Cao
I understand the benefits of what you are proposing but, since it is
wider is scope and it would be more difficult to backport, I would
prefer to implement it as a follow-up patch, unless you think that
the current patch as a standalone fix does more harm than good.
It does some harm in that it introduces a case where KVM_PUT_RESET_STATE
restores something, but KVM_PUT_FULL_STATE doesn't.
If it really usually works, there shouldn't be a need for this "if"
statement at all.
Marcelo, what do you think?
Paolo
Its OK to drop it, provided the following is tested on SMP guests:

1. initialization.
2. reboot.
3. migration.

With both stable and unstable TSC hosts (use wrmsr tool to write TSC on
a given host CPU, to make it an unstable TSC host). (A=rdmsr ; sleep 1s;
wrmsr A).

To make sure the code is not securing against a kvm_write_tsc
cornercase.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2013-12-05 17:06:55 UTC
Permalink
Post by Marcelo Tosatti
Post by Paolo Bonzini
Post by Fernando Luis Vazquez Cao
Post by Paolo Bonzini
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
No, probably not.
Post by Fernando Luis Vazquez Cao
I understand the benefits of what you are proposing but, since it is
wider is scope and it would be more difficult to backport, I would
prefer to implement it as a follow-up patch, unless you think that
the current patch as a standalone fix does more harm than good.
It does some harm in that it introduces a case where KVM_PUT_RESET_STATE
restores something, but KVM_PUT_FULL_STATE doesn't.
If it really usually works, there shouldn't be a need for this "if"
statement at all.
Marcelo, what do you think?
Paolo
1. initialization.
2. reboot.
3. migration.
With both stable and unstable TSC hosts (use wrmsr tool to write TSC on
a given host CPU, to make it an unstable TSC host). (A=rdmsr ; sleep 1s;
wrmsr A).
To make sure the code is not securing against a kvm_write_tsc
cornercase.
The TSCs should start synchronized, and remain synchronized across
reboot and migration for stable TSC host case.

It is not necessary to test the unstable TSC host case.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2013-12-05 16:17:07 UTC
Permalink
Post by Fernando Luis Vazquez Cao
Post by Paolo Bonzini
/*
* KVM is yet unable to synchronize TSC values of multiple VCPUs on
* writeback. Until this is fixed, we only write the offset to SMP
* guests after migration, desynchronizing the VCPUs, but avoiding
* huge jump-backs that would occur without any writeback at all.
*/
- if (smp_cpus == 1 || env->tsc != 0) {
+ if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
This is still a bit ugly, and desynchronizes the VCPUs on reset.
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
Right, modern kernels (see kvm_write_tsc) perform synchronization, so in
theory the "KVM is yet unable to synchronize ..." code is not necessary
anymore.

I vote for dropping the thing entirely.
Post by Fernando Luis Vazquez Cao
Post by Paolo Bonzini
The main point of my outlined solution is that you only have one value
that is tracked, not one per VCPU (which in the case of migration adds
unpredictable latencies---for example due to emptying the migration
buffers). We already save that value; all that's left is to use it
instead of env->tsc.
I understand the benefits of what you are proposing but, since it is
wider is scope and it would be more difficult to backport, I would
prefer to implement it as a follow-up patch, unless you think that
the current patch as a standalone fix does more harm than good.
- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-05 16:38:33 UTC
Permalink
Post by Marcelo Tosatti
Post by Fernando Luis Vazquez Cao
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
Right, modern kernels (see kvm_write_tsc) perform synchronization, so in
theory the "KVM is yet unable to synchronize ..." code is not necessary
anymore.
I vote for dropping the thing entirely.
If it can be dropped entirely, I certainly have no problem with starting
with a simple patch first.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-06 08:24:18 UTC
Permalink
Post by Paolo Bonzini
Post by Marcelo Tosatti
Post by Fernando Luis Vazquez Cao
I agree it is a bit ugly, but in my testing QEMU seemed to loop over all
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not mark
the TSC unstable due to the TSCs being unsynchronized). Am I missing
something?
Right, modern kernels (see kvm_write_tsc) perform synchronization, so in
theory the "KVM is yet unable to synchronize ..." code is not necessary
anymore.
I vote for dropping the thing entirely.
When I was writing the original patch I was tempted to do that,
but I feared that it could break older kernels that do not have
TSC synchronization code. Should we care about such uses
(recent QEMU user space + old kernel)?

I also wanted to make sure that the initialization that we do
in kvm_arch_vcpu_postcreate on power up and the subsequent
TSC writeback work well together, but I didn't have time to
test it (reading the code, I would say that the TSC generation
counter may end up being increased a few times but the TSCs
would eventually converge).
Post by Paolo Bonzini
If it can be dropped entirely, I certainly have no problem with starting
with a simple patch first.
Could we start with the patch that I already sent? It's been
tested, it is conservative in the sense that it does the minimum
necessary to fix an existing bug, and should be easy to
backport. I will be replying to this email with an updated
version that has a more appropriate and less scary patch
description.

I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.

As a follow-up effort we can work on Paolo's suggestions.

Is this an acceptable way forward?

Thanks,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-06 08:33:01 UTC
Permalink
VCPU TSC is not cleared by a warm reset (*), which leaves some types of=
Linux
guests (non-pvops guests and those with the kernel parameter no-kvmclo=
ck set)
vulnerable to the overflow in cyc2ns_offset fixed by upstream commit
9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix overflow in
cyc2ns_offset").

To put it in a nutshell, if such a Linux guest without the patch above =
applied
has been up more than 208 days and attempts a warm reset chances are th=
at
the newly booted kernel will panic or hang.

(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel=C2=AE Xeon=C2=
=AE
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."

Cc: Will Auld <***@intel.com>
Cc: Marcelo Tosatti <***@redhat.com>
Signed-off-by: Fernando Luis Vazquez Cao <***@oss.ntt.co.jp>
---

diff -urNp qemu-orig/target-i386/cpu.c qemu/target-i386/cpu.c
--- qemu-orig/target-i386/cpu.c 2013-11-28 07:02:45.000000000 +0900
+++ qemu/target-i386/cpu.c 2013-12-05 21:45:19.980156320 +0900
@@ -2446,6 +2446,9 @@ static void x86_cpu_reset(CPUState *s)
cpu_breakpoint_remove_all(env, BP_CPU);
cpu_watchpoint_remove_all(env, BP_CPU);
=20
+ env->tsc_adjust =3D 0;
+ env->tsc =3D 0;
+
#if !defined(CONFIG_USER_ONLY)
/* We hard-wire the BSP to the first CPU. */
if (s->cpu_index =3D=3D 0) {
diff -urNp qemu-orig/target-i386/kvm.c qemu/target-i386/kvm.c
--- qemu-orig/target-i386/kvm.c 2013-11-28 07:02:45.000000000 +0900
+++ qemu/target-i386/kvm.c 2013-12-05 21:45:28.900200552 +0900
@@ -1139,22 +1139,20 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_LSTAR, env->lstar);
}
#endif
- if (level =3D=3D KVM_PUT_FULL_STATE) {
+ /*
+ * The following MSRs have side effects on the guest or are too he=
avy
+ * for normal writeback. Limit them to reset or full state updates=
=2E
+ */
+ if (level >=3D KVM_PUT_RESET_STATE) {
/*
* KVM is yet unable to synchronize TSC values of multiple VCP=
Us on
* writeback. Until this is fixed, we only write the offset to=
SMP
* guests after migration, desynchronizing the VCPUs, but avoi=
ding
* huge jump-backs that would occur without any writeback at a=
ll.
*/
- if (smp_cpus =3D=3D 1 || env->tsc !=3D 0) {
+ if (smp_cpus =3D=3D 1 || env->tsc !=3D 0 || level =3D=3D KVM_P=
UT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
- }
- /*
- * The following MSRs have side effects on the guest or are too he=
avy
- * for normal writeback. Limit them to reset or full state updates=
=2E
- */
- if (level >=3D KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
env->system_time_msr);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wall_cl=
ock_msr);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-06 08:38:24 UTC
Permalink
Newer kernels are capable of synchronizing TSC values of multiple VCPUs
on writeback, but we were excluding the power up case, which is not needed
anymore.

Signed-off-by: Fernando Luis Vazquez Cao <***@oss.ntt.co.jp>
---

diff -urNp qemu-orig/target-i386/kvm.c qemu/target-i386/kvm.c
--- qemu-orig/target-i386/kvm.c 2013-12-06 16:12:03.201953736 +0900
+++ qemu/target-i386/kvm.c 2013-12-06 16:13:53.897955184 +0900
@@ -1144,15 +1144,7 @@ static int kvm_put_msrs(X86CPU *cpu, int
* for normal writeback. Limit them to reset or full state updates.
*/
if (level >= KVM_PUT_RESET_STATE) {
- /*
- * KVM is yet unable to synchronize TSC values of multiple VCPUs on
- * writeback. Until this is fixed, we only write the offset to SMP
- * guests after migration, desynchronizing the VCPUs, but avoiding
- * huge jump-backs that would occur without any writeback at all.
- */
- if (smp_cpus == 1 || env->tsc != 0 || level == KVM_PUT_RESET_STATE) {
- kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
- }
+ kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
env->system_time_msr);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wall_clock_msr);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-06 08:36:07 UTC
Permalink
Post by Fernando Luis Vázquez Cao
=20
Could we start with the patch that I already sent? It's been
tested, it is conservative in the sense that it does the minimum
necessary to fix an existing bug, and should be easy to
backport. I will be replying to this email with an updated
version that has a more appropriate and less scary patch
description.
=20
I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.
If you test it, I can drop the "if" myself from your patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-06 08:56:28 UTC
Permalink
Post by Paolo Bonzini
Post by Fernando Luis Vázquez Cao
Could we start with the patch that I already sent? It's been
tested, it is conservative in the sense that it does the minimum
necessary to fix an existing bug, and should be easy to
backport. I will be replying to this email with an updated
version that has a more appropriate and less scary patch
description.
I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.
If you test it, I can drop the "if" myself from your patch.
Unfortunately I will not have time to test patch 2 properly
for a while, so I think it would make sense to merge patch
1 first.

- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-06 09:08:00 UTC
Permalink
Post by Fernando Luis Vázquez Cao
Post by Paolo Bonzini
Post by Fernando Luis Vázquez Cao
I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.
If you test it, I can drop the "if" myself from your patch.
=20
Unfortunately I will not have time to test patch 2 properly
for a while, so I think it would make sense to merge patch
1 first.
As I said, I'm not sure that patch 1 is an improvement... I'll try to
test it myself.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vazquez Cao
2013-12-06 09:20:04 UTC
Permalink
Post by Paolo Bonzini
Post by Fernando Luis Vázquez Cao
Post by Paolo Bonzini
Post by Fernando Luis Vázquez Cao
I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.
If you test it, I can drop the "if" myself from your patch.
Unfortunately I will not have time to test patch 2 properly
for a while, so I think it would make sense to merge patch
1 first.
As I said, I'm not sure that patch 1 is an improvement...
Well it fixes and existing bug (without desynchronizing) and
I think it does not make the existing code uglier (it even gets
rid of one if statement! :) ). I guess the latter depends on
the eye of the beholder though.
Post by Paolo Bonzini
I'll try to=E3=80=80test it myself.
Thank you. Same here if time permits.

- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2013-12-06 14:22:52 UTC
Permalink
On Fri, Dec 06, 2013 at 05:24:18PM +0900, Fernando Luis V=E1zquez Cao w=
Post by Fernando Luis Vázquez Cao
I agree it is a bit ugly, but in my testing QEMU seemed to loop o=
ver all
Post by Fernando Luis Vázquez Cao
the VCPUS fast enough for the kernel side kvm_write_tsc() to do a
reasonable job of matching the offsets (the Linux guest did not m=
ark
Post by Fernando Luis Vázquez Cao
the TSC unstable due to the TSCs being unsynchronized). Am I miss=
ing
Post by Fernando Luis Vázquez Cao
something?
Right, modern kernels (see kvm_write_tsc) perform synchronization, =
so in
Post by Fernando Luis Vázquez Cao
theory the "KVM is yet unable to synchronize ..." code is not neces=
sary
Post by Fernando Luis Vázquez Cao
anymore.
I vote for dropping the thing entirely.
=20
When I was writing the original patch I was tempted to do that,
but I feared that it could break older kernels that do not have
TSC synchronization code. Should we care about such uses
(recent QEMU user space + old kernel)?
Unfortunately there is no clean way to detect kernels that support TSC
write synchronization (not directly at least).

However the combination of recent qemu, pre-2010 kernel, and no kvmcloc=
k
Linux guest can be ignored in my POV.
Post by Fernando Luis Vázquez Cao
I also wanted to make sure that the initialization that we do
in kvm_arch_vcpu_postcreate on power up and the subsequent
TSC writeback work well together, but I didn't have time to
test it (reading the code, I would say that the TSC generation
counter may end up being increased a few times but the TSCs
would eventually converge).
A basic test should be fine, because the matching code is in use=20
today.
Post by Fernando Luis Vázquez Cao
If it can be dropped entirely, I certainly have no problem with star=
ting
Post by Fernando Luis Vázquez Cao
with a simple patch first.
=20
Could we start with the patch that I already sent? It's been
tested, it is conservative in the sense that it does the minimum
necessary to fix an existing bug, and should be easy to
backport. I will be replying to this email with an updated
version that has a more appropriate and less scary patch
description.
=20
I will also be sending a patch that makes the TSC writeback
unconditional, but this one should probably be kept on hold
until it is properly tested.
=20
As a follow-up effort we can work on Paolo's suggestions.
=20
Is this an acceptable way forward?
=20
Thanks,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-09 08:50:17 UTC
Permalink
On Fri, Dec 06, 2013 at 05:24:18PM +0900, Fernando Luis V=E1zquez Cao=
Post by Fernando Luis Vázquez Cao
I also wanted to make sure that the initialization that we do
in kvm_arch_vcpu_postcreate on power up and the subsequent
TSC writeback work well together, but I didn't have time to
test it (reading the code, I would say that the TSC generation
counter may end up being increased a few times but the TSCs
would eventually converge).
A basic test should be fine, because the matching code is in use
today.
I applied my two patches to QEMU and I did some testing
with SMP guests (4 VCPUs).

When the host's TSC is stable all the TSC offsets are matched
as expected:

[ 4544.779699] kvm: VCPU 0 new tsc generation 1, clock 0
[ 4544.799691] kvm: VCPU 1 matched tsc offset for 0
[ 4544.835687] kvm: VCPU 2 matched tsc offset for 0
[ 4544.882229] kvm: VCPU 3 matched tsc offset for 0
[ 4544.983740] kvm: VCPU 0 matched tsc offset for 0
[ 4545.015727] kvm: VCPU 1 matched tsc offset for 0
[ 4545.031762] kvm: VCPU 2 matched tsc offset for 0
[ 4545.043756] kvm: VCPU 3 matched tsc offset for 0
[ 4545.382113] kvm: VCPU 0 matched tsc offset for 0
[ 4545.382138] kvm: VCPU 1 matched tsc offset for 0
[ 4545.382155] kvm: VCPU 2 matched tsc offset for 0
[ 4545.382171] kvm: VCPU 3 matched tsc offset for 0

Regarding unstable TSC hosts, I created one by executing

TSCBSP=3D`rdmsr -d -p 0 16`; sleep 1s; wrmsr -p 0 16 $TSCBSP

as you suggested. After doing this KVM will adjust the TSC
offsets to make up for the deltas on the host side:

[ 5232.172074] kvm: VCPU 0 new tsc generation 1, clock 0
[ 5232.180759] kvm: SMP vm created on host with unstable TSC; guest=
=20
TSC will not be reliable
[ 5232.204069] kvm: VCPU 1 adjusted tsc offset by 105344
[ 5232.268070] kvm: VCPU 2 adjusted tsc offset by 210721
[ 5232.332066] kvm: VCPU 3 adjusted tsc offset by 210708
[ 5232.444127] kvm: VCPU 0 adjusted tsc offset by 368959
[ 5232.458448] kvm: VCPU 1 adjusted tsc offset by 47158
[ 5232.470400] kvm: VCPU 2 adjusted tsc offset by 39352
[ 5232.482359] kvm: VCPU 3 adjusted tsc offset by 39371
[ 5232.875343] kvm: VCPU 0 adjusted tsc offset by 1293878
[ 5232.875371] kvm: VCPU 1 adjusted tsc offset by 121
[ 5232.875392] kvm: VCPU 2 adjusted tsc offset by 69
[ 5232.875412] kvm: VCPU 3 adjusted tsc offset by 62

But despite KVM's efforts on the guest side the check in
check_tsc_sync_source() fails and the TSC is marked unstable:

tsc: Marking TSC unstable due to check_tsc_sync_source failed

By the way, without my patches applied to QEMU the end result
is the same:

[ 266.300068] kvm: VCPU 0 new tsc generation 1, clock 0
[ 266.308746] kvm: SMP vm created on host with unstable TSC; guest=
=20
TSC will not be reliable
[ 266.332067] kvm: VCPU 1 adjusted tsc offset by 105354
[ 266.396065] kvm: VCPU 2 adjusted tsc offset by 210714
[ 266.428273] kvm: VCPU 3 adjusted tsc offset by 106045

In other words, things seem to be working as expected.

- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fernando Luis Vázquez Cao
2013-12-12 02:52:22 UTC
Permalink
Post by Fernando Luis Vázquez Cao
On Fri, Dec 06, 2013 at 05:24:18PM +0900, Fernando Luis V=E1zquez Ca=
o=20
Post by Fernando Luis Vázquez Cao
Post by Fernando Luis Vázquez Cao
I also wanted to make sure that the initialization that we do
in kvm_arch_vcpu_postcreate on power up and the subsequent
TSC writeback work well together, but I didn't have time to
test it (reading the code, I would say that the TSC generation
counter may end up being increased a few times but the TSCs
would eventually converge).
A basic test should be fine, because the matching code is in use
today.
I applied my two patches to QEMU and I did some testing
with SMP guests (4 VCPUs).
By the way, do you want me to merge the two patches I sent into
one and resend?

Thanks,
=46ernando
Post by Fernando Luis Vázquez Cao
When the host's TSC is stable all the TSC offsets are matched
[ 4544.779699] kvm: VCPU 0 new tsc generation 1, clock 0
[ 4544.799691] kvm: VCPU 1 matched tsc offset for 0
[ 4544.835687] kvm: VCPU 2 matched tsc offset for 0
[ 4544.882229] kvm: VCPU 3 matched tsc offset for 0
[ 4544.983740] kvm: VCPU 0 matched tsc offset for 0
[ 4545.015727] kvm: VCPU 1 matched tsc offset for 0
[ 4545.031762] kvm: VCPU 2 matched tsc offset for 0
[ 4545.043756] kvm: VCPU 3 matched tsc offset for 0
[ 4545.382113] kvm: VCPU 0 matched tsc offset for 0
[ 4545.382138] kvm: VCPU 1 matched tsc offset for 0
[ 4545.382155] kvm: VCPU 2 matched tsc offset for 0
[ 4545.382171] kvm: VCPU 3 matched tsc offset for 0
Regarding unstable TSC hosts, I created one by executing
TSCBSP=3D`rdmsr -d -p 0 16`; sleep 1s; wrmsr -p 0 16 $TSCBSP
as you suggested. After doing this KVM will adjust the TSC
[ 5232.172074] kvm: VCPU 0 new tsc generation 1, clock 0
[ 5232.180759] kvm: SMP vm created on host with unstable TSC; gues=
t=20
Post by Fernando Luis Vázquez Cao
TSC will not be reliable
[ 5232.204069] kvm: VCPU 1 adjusted tsc offset by 105344
[ 5232.268070] kvm: VCPU 2 adjusted tsc offset by 210721
[ 5232.332066] kvm: VCPU 3 adjusted tsc offset by 210708
[ 5232.444127] kvm: VCPU 0 adjusted tsc offset by 368959
[ 5232.458448] kvm: VCPU 1 adjusted tsc offset by 47158
[ 5232.470400] kvm: VCPU 2 adjusted tsc offset by 39352
[ 5232.482359] kvm: VCPU 3 adjusted tsc offset by 39371
[ 5232.875343] kvm: VCPU 0 adjusted tsc offset by 1293878
[ 5232.875371] kvm: VCPU 1 adjusted tsc offset by 121
[ 5232.875392] kvm: VCPU 2 adjusted tsc offset by 69
[ 5232.875412] kvm: VCPU 3 adjusted tsc offset by 62
But despite KVM's efforts on the guest side the check in
tsc: Marking TSC unstable due to check_tsc_sync_source failed
By the way, without my patches applied to QEMU the end result
[ 266.300068] kvm: VCPU 0 new tsc generation 1, clock 0
[ 266.308746] kvm: SMP vm created on host with unstable TSC; gues=
t=20
Post by Fernando Luis Vázquez Cao
TSC will not be reliable
[ 266.332067] kvm: VCPU 1 adjusted tsc offset by 105354
[ 266.396065] kvm: VCPU 2 adjusted tsc offset by 210714
[ 266.428273] kvm: VCPU 3 adjusted tsc offset by 106045
In other words, things seem to be working as expected.
- Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-12 12:18:24 UTC
Permalink
Post by Fernando Luis Vázquez Cao
Post by Fernando Luis Vázquez Cao
On Fri, Dec 06, 2013 at 05:24:18PM +0900, Fernando Luis V=E1zquez C=
ao
Post by Fernando Luis Vázquez Cao
Post by Fernando Luis Vázquez Cao
Post by Fernando Luis Vázquez Cao
I also wanted to make sure that the initialization that we do
in kvm_arch_vcpu_postcreate on power up and the subsequent
TSC writeback work well together, but I didn't have time to
test it (reading the code, I would say that the TSC generation
counter may end up being increased a few times but the TSCs
would eventually converge).
A basic test should be fine, because the matching code is in use
today.
I applied my two patches to QEMU and I did some testing
with SMP guests (4 VCPUs).
=20
By the way, do you want me to merge the two patches I sent into
one and resend?
No, I now "swapped" them (reversed the order) and applied them to uq/ma=
ster.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti
2013-12-05 16:12:34 UTC
Permalink
Post by Paolo Bonzini
VCPU TSC is not cleared by a warm reset (*), which leaves many Linu=
x
Post by Paolo Bonzini
guests vulnerable to the overflow in cyc2ns_offset fixed by upstrea=
m
Post by Paolo Bonzini
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 ("sched/x86: Fix ov=
erflow
Post by Paolo Bonzini
in cyc2ns_offset").
=20
To put it in a nutshell, if a Linux guest without the patch above a=
pplied
Post by Paolo Bonzini
has been up more than 208 days and attempts a warm reset chances ar=
e that
Post by Paolo Bonzini
the newly booted kernel will panic or hang.
=20
(*) Intel Xeon E5 processors show the same broken behavior due to
the errata "TSC is Not Affected by Warm Reset" (Intel=AE Xeon=AE
Processor E5 Family Specification Update - August 2013): "The
TSC (Time Stamp Counter MSR 10H) should be cleared on
reset. Due to this erratum the TSC is not affected by warm
reset."
=20
=20
I agree that the bug is in QEMU. One small nit in your patch is that
you should reset env->tsc_adjust and env->tsc in x86_cpu_reset. This
would already be pretty good.
=20
However, a bigger problem is that env->tsc is a useless duplicate of
"cpu_get_ticks() + env->tsc_adjust". It would be nice to drop env->t=
sc
Post by Paolo Bonzini
completely except for migration backwards compatibility. Thus you ca=
=20
- fill in env->tsc as mentioned above from target-i386/machine.c's
cpu_pre_save function. This guarantees backwards compatibility.
=20
- add a function cpu_set_ticks(int64_t ticks) to cpus.c. The functio=
n
Post by Paolo Bonzini
does nothing if use_icount is true, otherwise it needs to have (rough=
ly)
Post by Paolo Bonzini
the opposite logic compared to cpu_get_ticks. You then call this
function from x86_cpu_reset instead of setting env->tsc. You can
similarly call this function from kvm_get_msrs.
=20
- add a function kvm_set_ticks(int64_t ticks) to kvm-all.c and
kvm-stub.c. For kvm-all.c it calls kvm_arch_set_ticks(CPUState *cpu,
int64_t ticks) in target-*/kvm.c. The kvm_arch_set_tsc() function ha=
s a
Post by Paolo Bonzini
dummy implementation for all architectures except x86. For x86 it ca=
lls
Post by Paolo Bonzini
KVM_SET_MSRS passing "ticks + env->tsc_offset".
=20
- call kvm_set_ticks() from cpu_set_ticks() and cpu_enable_ticks()
env->tsc is just a placeholder for the vcpu TSC.

A vcpus TSC from QEMU's point of view is a register initialized to zero=
,
which requires read/write from KVM, and migration.

Not sure what is the point of your idea.
Post by Paolo Bonzini
=20
Can you do this?
=20
Thanks,
=20
Paolo
=20
---
=20
--- qemu-orig/target-i386/kvm.c 2013-11-28 07:02:45.000000000 +0900
+++ qemu/target-i386/kvm.c 2013-12-05 14:47:03.085738175 +0900
@@ -1125,6 +1125,8 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsa=
ve);
Post by Paolo Bonzini
}
if (has_msr_tsc_adjust) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc_adjust =3D 0;
kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adj=
ust);
Post by Paolo Bonzini
}
if (has_msr_misc_enable) {
@@ -1139,22 +1141,22 @@ static int kvm_put_msrs(X86CPU *cpu, int
kvm_msr_entry_set(&msrs[n++], MSR_LSTAR, env->lstar);
}
#endif
- if (level =3D=3D KVM_PUT_FULL_STATE) {
+ /*
+ * The following MSRs have side effects on the guest or are to=
o heavy
Post by Paolo Bonzini
+ * for normal writeback. Limit them to reset or full state upd=
ates.
Post by Paolo Bonzini
+ */
+ if (level >=3D KVM_PUT_RESET_STATE) {
+ if (level =3D=3D KVM_PUT_RESET_STATE)
+ env->tsc =3D 0;
/*
* KVM is yet unable to synchronize TSC values of multiple=
VCPUs on
Post by Paolo Bonzini
* writeback. Until this is fixed, we only write the offse=
t to SMP
Post by Paolo Bonzini
* guests after migration, desynchronizing the VCPUs, but =
avoiding
Post by Paolo Bonzini
* huge jump-backs that would occur without any writeback =
at all.
Post by Paolo Bonzini
*/
- if (smp_cpus =3D=3D 1 || env->tsc !=3D 0) {
+ if (smp_cpus =3D=3D 1 || env->tsc !=3D 0 || level =3D=3D K=
VM_PUT_RESET_STATE) {
Post by Paolo Bonzini
kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
}
- }
- /*
- * The following MSRs have side effects on the guest or are to=
o heavy
Post by Paolo Bonzini
- * for normal writeback. Limit them to reset or full state upd=
ates.
Post by Paolo Bonzini
- */
- if (level >=3D KVM_PUT_RESET_STATE) {
kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
env->system_time_msr);
kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wal=
l_clock_msr);
Post by Paolo Bonzini
=20
=20
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
=20
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2013-12-05 16:32:54 UTC
Permalink
Post by Marcelo Tosatti
Post by Paolo Bonzini
- call kvm_set_ticks() from cpu_set_ticks() and cpu_enable_ticks()
env->tsc is just a placeholder for the vcpu TSC.
A vcpus TSC from QEMU's point of view is a register initialized to zero,
which requires read/write from KVM, and migration.
QEMU already tracks the TSC in cpu_get_ticks(). So far this is used
only for TCG, but for example the code is there that preserves the TSC
when you stop/resume the VM and when you migrate the VM. Reset is not
yet there, which is a bug similar to the one Fernando is trying to solve
for KVM.

So, from QEMU's point of view the TSC should be a global value across
the whole system (timer_state.cpu_ticks_offset) + a per-VCPU TSC offset
(env->tsc_adjust). When talking to KVM, the per-VCPU TSC offset in turn
has two parts, both set with KVM_SET_MSRS: one is computed from
MSR_IA32_TSC, the other comes from MSR_IA32_TSC_ADJUST.

The point here would be to treat it as such.

With this change, env->tsc need not be migrated. The global value
timer_state.cpu_ticks_offset is migrated already. The host-side TSC
adjust can be computed from rdtsc()-timer_state.cpu_ticks_offset on the
destination machine and/or at reset time. The guest-side TSC adjust is
env->tsc_adjust as it is now.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...