[Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

Discussion:

Peter Xu

2018-05-04 03:08:01 UTC

v2:
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
[Jason, Kevin]

Online repo:

https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
code is buggy in that we might skip the notification while actually
we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
we don't really need to do page walking for PSIs, we can directly
deliver the notification down. For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
in the future, vDPA as well). The problem is that, now for domain
invalidations we do this to make sure the shadow page tables are
correctly synced:

1. unmap the whole address space
2. replay the whole address space, map existing pages

However during step 1 and 2 there will be a very tiny window (it can
be as big as 3ms) that the shadow page table is either invalid or
incomplete (since we're rebuilding it up). That's fatal error since
devices never know that happending and it's still possible to DMA to
memories.

Patch 1 fixes issue 1. I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4. Here a very simple interval tree is
implemented based on Gtree. It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses). However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests. With this series applied
(and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
with assigned devices and things will work. We can't before.

- vhost smoke test for regression.

Please review. Thanks,

Peter Xu (10):
intel-iommu: send PSI always even if across PDEs
intel-iommu: remove IntelIOMMUNotifierNode
intel-iommu: add iommu lock
intel-iommu: only do page walk for MAP notifiers
intel-iommu: introduce vtd_page_walk_info
intel-iommu: pass in address space when page walk
util: implement simple interval tree logic
intel-iommu: maintain per-device iova ranges
intel-iommu: don't unmap all for shadow page table
intel-iommu: remove notify_unmap for page walk

include/hw/i386/intel_iommu.h | 19 ++-
include/qemu/interval-tree.h | 130 +++++++++++++++
hw/i386/intel_iommu.c | 306 +++++++++++++++++++++++++---------
util/interval-tree.c | 208 +++++++++++++++++++++++
hw/i386/trace-events | 3 +-
util/Makefile.objs | 1 +
6 files changed, 579 insertions(+), 88 deletions(-)
create mode 100644 include/qemu/interval-tree.h
create mode 100644 util/interval-tree.c

--
2.17.0

Peter Xu

2018-05-04 03:08:02 UTC

Permalink

During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more. In the past, we skip that entry and we don't
notify the IOMMU notifiers. This is not correct. We should send UNMAP
notification to registered UNMAP notifiers in this case.

For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid. For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.

This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).

Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:

qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
0x7f89a920d000) = -17 (File exists)

Acked-by: Jason Wang <***@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <***@redhat.com>
---
hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb31de9416..b359efd6f9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,

typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);

+static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
+ vtd_page_walk_hook hook_fn, void *private)
+{
+ assert(hook_fn);
+ trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
+ entry->addr_mask, entry->perm);
+ return hook_fn(entry, private);
+}
+
/**
* vtd_page_walk_level - walk over specific level for IOVA range
*
@@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
*/
entry_valid = read_cur | write_cur;

+ entry.target_as = &address_space_memory;
+ entry.iova = iova & subpage_mask;
+ entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+ entry.addr_mask = ~subpage_mask;
+
if (vtd_is_last_slpte(slpte, level)) {
- entry.target_as = &address_space_memory;
- entry.iova = iova & subpage_mask;
/* NOTE: this is only meaningful if entry_valid == true */
entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
- entry.addr_mask = ~subpage_mask;
- entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
if (!entry_valid && !notify_unmap) {
trace_vtd_page_walk_skip_perm(iova, iova_next);
goto next;
}
- trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
- entry.addr_mask, entry.perm);
- if (hook_fn) {
- ret = hook_fn(&entry, private);
- if (ret < 0) {
- return ret;
- }
+ ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+ if (ret < 0) {
+ return ret;
}
} else {
if (!entry_valid) {
- trace_vtd_page_walk_skip_perm(iova, iova_next);
+ if (notify_unmap) {
+ /*
+ * The whole entry is invalid; unmap it all.
+ * Translated address is meaningless, zero it.
+ */
+ entry.translated_addr = 0x0;
+ ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+ if (ret < 0) {
+ return ret;
+ }
+ } else {
+ trace_vtd_page_walk_skip_perm(iova, iova_next);
+ }
goto next;
}
ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,

--
2.17.0

Peter Xu

2018-05-04 03:08:05 UTC

Permalink

For UNMAP-only IOMMU notifiers, we don't really need to walk the page
tables. Fasten that procedure by skipping the page table walk. That
should boost performance for UNMAP-only notifiers like vhost.

Signed-off-by: Peter Xu <***@redhat.com>
---
include/hw/i386/intel_iommu.h | 2 ++
hw/i386/intel_iommu.c | 43 +++++++++++++++++++++++++++++++----
2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ee517704e7..9e0a6c1c6a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -93,6 +93,8 @@ struct VTDAddressSpace {
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
QLIST_ENTRY(VTDAddressSpace) next;
+ /* Superset of notifier flags that this address space has */
+ IOMMUNotifierFlag notifier_flags;
};

struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 112971638d..9a418abfb6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -138,6 +138,12 @@ static inline void vtd_iommu_unlock(IntelIOMMUState *s)
qemu_mutex_unlock(&s->iommu_lock);
}

+/* Whether the address space needs to notify new mappings */
+static inline gboolean vtd_as_notify_mappings(VTDAddressSpace *as)
+{
+ return as->notifier_flags & IOMMU_NOTIFIER_MAP;
+}
+
/* GHashTable functions */
static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
{
@@ -1433,14 +1439,35 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
VTDAddressSpace *vtd_as;
VTDContextEntry ce;
int ret;
+ hwaddr size = (1 << am) * VTD_PAGE_SIZE;

QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
vtd_as->devfn, &ce);
if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
- vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
- vtd_page_invalidate_notify_hook,
- (void *)&vtd_as->iommu, true, s->aw_bits);
+ if (vtd_as_notify_mappings(vtd_as)) {
+ /*
+ * For MAP-inclusive notifiers, we need to walk the
+ * page table to sync the shadow page table.
+ */
+ vtd_page_walk(&ce, addr, addr + size,
+ vtd_page_invalidate_notify_hook,
+ (void *)&vtd_as->iommu, true, s->aw_bits);
+ } else {
+ /*
+ * For UNMAP-only notifiers, we don't need to walk the
+ * page tables. We just deliver the PSI down to
+ * invalidate caches.
+ */
+ IOMMUTLBEntry entry = {
+ .target_as = &address_space_memory,
+ .iova = addr,
+ .translated_addr = 0,
+ .addr_mask = size - 1,
+ .perm = IOMMU_NONE,
+ };
+ memory_region_notify_iommu(&vtd_as->iommu, entry);
+ }
}
}
}
@@ -2380,6 +2407,9 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
exit(1);
}

+ /* Update per-address-space notifier flags */
+ vtd_as->notifier_flags = new;
+
if (old == IOMMU_NOTIFIER_NONE) {
/* Insert new ones */
QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
@@ -2890,8 +2920,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
PCI_FUNC(vtd_as->devfn),
VTD_CONTEXT_ENTRY_DID(ce.hi),
ce.hi, ce.lo);
- vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
- s->aw_bits);
+ if (vtd_as_notify_mappings(vtd_as)) {
+ /* This is required only for MAP typed notifiers */
+ vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+ s->aw_bits);
+ }
} else {
trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
PCI_FUNC(vtd_as->devfn));

--
2.17.0

Peter Xu

2018-05-04 03:08:03 UTC

Permalink

That is not really necessary. Removing that node struct and put the
list entry directly into VTDAddressSpace. It simplfies the code a lot.

Signed-off-by: Peter Xu <***@redhat.com>
---
include/hw/i386/intel_iommu.h | 9 ++------
hw/i386/intel_iommu.c | 41 ++++++++++-------------------------
2 files changed, 14 insertions(+), 36 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 45ec8919b6..220697253f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -67,7 +67,6 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
typedef struct VTDIrq VTDIrq;
typedef struct VTD_MSIMessage VTD_MSIMessage;
-typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;

/* Context-Entry */
struct VTDContextEntry {
@@ -93,6 +92,7 @@ struct VTDAddressSpace {
MemoryRegion iommu_ir; /* Interrupt region: 0xfeeXXXXX */
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
+ QLIST_ENTRY(VTDAddressSpace) next;
};

struct VTDBus {
@@ -253,11 +253,6 @@ struct VTD_MSIMessage {
/* When IR is enabled, all MSI/MSI-X data bits should be zero */
#define VTD_IR_MSI_DATA (0)

-struct IntelIOMMUNotifierNode {
- VTDAddressSpace *vtd_as;
- QLIST_ENTRY(IntelIOMMUNotifierNode) next;
-};
-
/* The iommu (DMAR) device state struct */
struct IntelIOMMUState {
X86IOMMUState x86_iommu;
@@ -295,7 +290,7 @@ struct IntelIOMMUState {
GHashTable *vtd_as_by_busptr; /* VTDBus objects indexed by PCIBus* reference */
VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
/* list of registered notifiers */
- QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
+ QLIST_HEAD(, VTDAddressSpace) notifiers_list;

/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b359efd6f9..5987b48d43 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1248,10 +1248,10 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)

static void vtd_iommu_replay_all(IntelIOMMUState *s)
{
- IntelIOMMUNotifierNode *node;
+ VTDAddressSpace *vtd_as;

- QLIST_FOREACH(node, &s->notifiers_list, next) {
- memory_region_iommu_replay_all(&node->vtd_as->iommu);
+ QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
+ memory_region_iommu_replay_all(&vtd_as->iommu);
}
}

@@ -1372,7 +1372,6 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)

static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
{
- IntelIOMMUNotifierNode *node;
VTDContextEntry ce;
VTDAddressSpace *vtd_as;

@@ -1381,8 +1380,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
&domain_id);

- QLIST_FOREACH(node, &s->notifiers_list, next) {
- vtd_as = node->vtd_as;
+ QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
vtd_as->devfn, &ce) &&
domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -1402,12 +1400,11 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
uint16_t domain_id, hwaddr addr,
uint8_t am)
{
- IntelIOMMUNotifierNode *node;
+ VTDAddressSpace *vtd_as;
VTDContextEntry ce;
int ret;

- QLIST_FOREACH(node, &(s->notifiers_list), next) {
- VTDAddressSpace *vtd_as = node->vtd_as;
+ QLIST_FOREACH(vtd_as, &(s->notifiers_list), next) {
ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
vtd_as->devfn, &ce);
if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
@@ -2344,8 +2341,6 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
{
VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
IntelIOMMUState *s = vtd_as->iommu_state;
- IntelIOMMUNotifierNode *node = NULL;
- IntelIOMMUNotifierNode *next_node = NULL;

if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
error_report("We need to set caching-mode=1 for intel-iommu to enable "
@@ -2354,21 +2349,11 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
}

if (old == IOMMU_NOTIFIER_NONE) {
- node = g_malloc0(sizeof(*node));
- node->vtd_as = vtd_as;
- QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
- return;
- }
-
- /* update notifier node with new flags */
- QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
- if (node->vtd_as == vtd_as) {
- if (new == IOMMU_NOTIFIER_NONE) {
- QLIST_REMOVE(node, next);
- g_free(node);
- }
- return;
- }
+ /* Insert new ones */
+ QLIST_INSERT_HEAD(&s->notifiers_list, vtd_as, next);
+ } else if (new == IOMMU_NOTIFIER_NONE) {
+ /* Remove old ones */
+ QLIST_REMOVE(vtd_as, next);
}
}

@@ -2838,12 +2823,10 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)

static void vtd_address_space_unmap_all(IntelIOMMUState *s)
{
- IntelIOMMUNotifierNode *node;
VTDAddressSpace *vtd_as;
IOMMUNotifier *n;

- QLIST_FOREACH(node, &s->notifiers_list, next) {
- vtd_as = node->vtd_as;
+ QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
vtd_address_space_unmap(vtd_as, n);
}

--
2.17.0

Peter Xu

2018-05-04 03:08:04 UTC

Permalink

Add a per-iommu big lock to protect IOMMU status. Currently the only
thing to be protected is the IOTLB/context cache, since that can be
accessed even without BQL, e.g., in IO dataplane.

Note that we don't need to protect device page tables since that's fully
controlled by the guest kernel. However there is still possibility that
malicious drivers will program the device to not obey the rule. In that
case QEMU can't really do anything useful, instead the guest itself will
be responsible for all uncertainties.

Reported-by: Fam Zheng <***@redhat.com>
Signed-off-by: Peter Xu <***@redhat.com>
---
include/hw/i386/intel_iommu.h | 6 +++++
hw/i386/intel_iommu.c | 43 +++++++++++++++++++++++++++++++----
2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 220697253f..ee517704e7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -300,6 +300,12 @@ struct IntelIOMMUState {
OnOffAuto intr_eim; /* Toggle for EIM cabability */
bool buggy_eim; /* Force buggy EIM unless eim=off */
uint8_t aw_bits; /* Host/IOVA address width (in bits) */
+
+ /*
+ * Protects IOMMU states in general. Currently it protects the
+ * per-IOMMU IOTLB cache, and context entry cache in VTDAddressSpace.
+ */
+ QemuMutex iommu_lock;
};

/* Find the VTD Address space associated with the given bus pointer,
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5987b48d43..112971638d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -128,6 +128,16 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
return new_val;
}

+static inline void vtd_iommu_lock(IntelIOMMUState *s)
+{
+ qemu_mutex_lock(&s->iommu_lock);
+}
+
+static inline void vtd_iommu_unlock(IntelIOMMUState *s)
+{
+ qemu_mutex_unlock(&s->iommu_lock);
+}
+
/* GHashTable functions */
static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
{
@@ -172,7 +182,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
}

/* Reset all the gen of VTDAddressSpace to zero and set the gen of
- * IntelIOMMUState to 1.
+ * IntelIOMMUState to 1. Must be with IOMMU lock held.
*/
static void vtd_reset_context_cache(IntelIOMMUState *s)
{
@@ -197,12 +207,19 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
s->context_cache_gen = 1;
}

-static void vtd_reset_iotlb(IntelIOMMUState *s)
+static void vtd_reset_iotlb_locked(IntelIOMMUState *s)
{
assert(s->iotlb);
g_hash_table_remove_all(s->iotlb);
}

+static void vtd_reset_iotlb(IntelIOMMUState *s)
+{
+ vtd_iommu_lock(s);
+ vtd_reset_iotlb_locked(s);
+ vtd_iommu_unlock(s);
+}
+
static uint64_t vtd_get_iotlb_key(uint64_t gfn, uint16_t source_id,
uint32_t level)
{
@@ -215,6 +232,7 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
}

+/* Must be with IOMMU lock held */
static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t source_id,
hwaddr addr)
{
@@ -235,6 +253,7 @@ out:
return entry;
}

+/* Must be with IOMMU lock held */
static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
uint16_t domain_id, hwaddr addr, uint64_t slpte,
uint8_t access_flags, uint32_t level)
@@ -246,7 +265,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
trace_vtd_iotlb_reset("iotlb exceeds size limit");
- vtd_reset_iotlb(s);
+ vtd_reset_iotlb_locked(s);
}

entry->gfn = gfn;
@@ -1106,7 +1125,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
IntelIOMMUState *s = vtd_as->iommu_state;
VTDContextEntry ce;
uint8_t bus_num = pci_bus_num(bus);
- VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+ VTDContextCacheEntry *cc_entry;
uint64_t slpte, page_mask;
uint32_t level;
uint16_t source_id = vtd_make_source_id(bus_num, devfn);
@@ -1123,6 +1142,10 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
*/
assert(!vtd_is_interrupt_addr(addr));

+ vtd_iommu_lock(s);
+
+ cc_entry = &vtd_as->context_cache_entry;
+
/* Try to fetch slpte form IOTLB */
iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
if (iotlb_entry) {
@@ -1182,7 +1205,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
* IOMMU region can be swapped back.
*/
vtd_pt_enable_fast_path(s, source_id);
-
+ vtd_iommu_unlock(s);
return true;
}

@@ -1203,6 +1226,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
access_flags, level);
out:
+ vtd_iommu_unlock(s);
entry->iova = addr & page_mask;
entry->translated_addr = vtd_get_slpte_addr(slpte, s->aw_bits) & page_mask;
entry->addr_mask = ~page_mask;
@@ -1210,6 +1234,7 @@ out:
return true;

error:
+ vtd_iommu_unlock(s);
entry->iova = 0;
entry->translated_addr = 0;
entry->addr_mask = 0;
@@ -1258,10 +1283,13 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
static void vtd_context_global_invalidate(IntelIOMMUState *s)
{
trace_vtd_inv_desc_cc_global();
+ /* Protects context cache */
+ vtd_iommu_lock(s);
s->context_cache_gen++;
if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
vtd_reset_context_cache(s);
}
+ vtd_iommu_unlock(s);
vtd_switch_address_space_all(s);
/*
* From VT-d spec 6.5.2.1, a global context entry invalidation
@@ -1377,8 +1405,10 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)

trace_vtd_inv_desc_iotlb_domain(domain_id);

+ vtd_iommu_lock(s);
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_domain,
&domain_id);
+ vtd_iommu_unlock(s);

QLIST_FOREACH(vtd_as, &s->notifiers_list, next) {
if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
@@ -1426,7 +1456,9 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
info.domain_id = domain_id;
info.addr = addr;
info.mask = ~((1 << am) - 1);
+ vtd_iommu_lock(s);
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
+ vtd_iommu_unlock(s);
vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
}

@@ -3072,6 +3104,7 @@ static void vtd_realize(DeviceState *dev, Error **errp)
}

QLIST_INIT(&s->notifiers_list);
+ qemu_mutex_init(&s->iommu_lock);
memset(s->vtd_as_by_bus_num, 0, sizeof(s->vtd_as_by_bus_num));
memory_region_init_io(&s->csrmem, OBJECT(s), &vtd_mem_ops, s,
"intel_iommu", DMAR_REG_SIZE);

--
2.17.0

Peter Xu

2018-05-04 03:08:08 UTC

Permalink

Introduce a simplest interval tree implementation based on GTree.
Current implementation is mostly tailored to maintain and trace device
mapped IOVA ranges, but still it might be useful to other modules in the
future.

It is naive in that it even does not allow user to pass in private
structs along with the ranges. However it's good in that the tree can
do mergings of ranges when necessary when without such information.

Signed-off-by: Peter Xu <***@redhat.com>
---
include/qemu/interval-tree.h | 130 ++++++++++++++++++++++
util/interval-tree.c | 208 +++++++++++++++++++++++++++++++++++
util/Makefile.objs | 1 +
3 files changed, 339 insertions(+)
create mode 100644 include/qemu/interval-tree.h
create mode 100644 util/interval-tree.c

diff --git a/include/qemu/interval-tree.h b/include/qemu/interval-tree.h
new file mode 100644
index 0000000000..14d364455b
--- /dev/null
+++ b/include/qemu/interval-tree.h
@@ -0,0 +1,130 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ * Peter Xu <***@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+#ifndef INTERVAL_TREE_H
+#define INTERVAL_TREE_H
+
+/*
+ * Currently the interval tree will only allow to keep ranges
+ * information, and no extra user data is allowed for each element. A
+ * benefit is that we can merge adjacent ranges internally within the
+ * tree. It can save a lot of memory when the ranges are splitted but
+ * mostly continuous.
+ *
+ * Note that current implementation does not provide any thread
+ * protections. Callers of the interval tree should be responsible
+ * for the thread safety issue.
+ */
+
+#include <glib.h>
+
+#define IT_OK (0)
+#define IT_ERR_OVERLAP (-1)
+
+typedef unsigned long long ITValue;
+typedef struct ITTree ITTree;
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
+
+struct ITRange {
+ ITValue start;
+ ITValue end;
+};
+typedef struct ITRange ITRange;
+
+/**
+ * it_tree_new:
+ *
+ * Create a new interval tree.
+ *
+ * Returns: the tree pointer when succeeded, or NULL if error.
+ */
+ITTree *it_tree_new(void);
+
+/**
+ * it_tree_insert:
+ *
+ * @tree: the interval tree to insert
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Insert an interval range to the tree. If there is overlapped
+ * ranges, IT_ERR_OVERLAP will be returned.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_remove:
+ *
+ * @tree: the interval tree to remove range from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Remove an range from the tree. The range does not need to be
+ * exactly what has inserted. All the ranges that are included in the
+ * provided range will be removed from the tree.
+ *
+ * Return: 0 if succeeded, or <0 if error.
+ */
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find:
+ *
+ * @tree: the interval tree to search from
+ * @start: the start of range, inclusive
+ * @end: the end of range, inclusive
+ *
+ * Search for a range in the interval tree that overlaps with the
+ * range specified. Only the first found range will be returned.
+ *
+ * Return: ITRange if found, or NULL if not found. Note: the returned
+ * ITRange pointer is maintained internally. User should only read
+ * the content but never modify or free the content.
+ */
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end);
+
+/**
+ * it_tree_find_value:
+ *
+ * @tree: the interval tree to search from
+ * @value: the value to find
+ *
+ * Similar to it_tree_find(), but it tries to find range (value, value).
+ *
+ * Return: same as it_tree_find().
+ */
+ITRange *it_tree_find_value(ITTree *tree, ITValue value);
+
+/**
+ * it_tree_foreach:
+ *
+ * @tree: the interval tree to iterate on
+ * @iterator: the interator for the ranges, return true to stop
+ *
+ * Search for a range in the interval tree.
+ *
+ * Return: 1 if found any overlap, 0 if not, <0 if error.
+ */
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator);
+
+/**
+ * it_tree_destroy:
+ *
+ * @tree: the interval tree to destroy
+ *
+ * Destroy an existing interval tree.
+ *
+ * Return: None.
+ */
+void it_tree_destroy(ITTree *tree);
+
+#endif
diff --git a/util/interval-tree.c b/util/interval-tree.c
new file mode 100644
index 0000000000..3d2bb951aa
--- /dev/null
+++ b/util/interval-tree.c
@@ -0,0 +1,208 @@
+/*
+ * An very simplified interval tree implementation based on GTree.
+ *
+ * Copyright 2018 Red Hat, Inc.
+ *
+ * Authors:
+ * Peter Xu <***@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ */
+
+#include <glib.h>
+#include "qemu/interval-tree.h"
+
+/*
+ * Each element of the internal tree is an ITRange. It is shared
+ * between the key and value of the element, or we can see it a tree
+ * with keys only but no values.
+ */
+
+struct ITTree {
+ GTree *tree;
+};
+
+static int it_tree_compare(gconstpointer a, gconstpointer b, gpointer data)
+{
+ const ITRange *r1 = a, *r2 = b;
+
+ if (r1->start > r2->end) {
+ return 1;
+ }
+
+ if (r1->end < r2->start) {
+ return -1;
+ }
+
+ /* Overlapped */
+ return 0;
+}
+
+/* Find out intersection of range A and B, put into OUT */
+static inline void it_range_and(ITRange *out, ITRange *a, ITRange *b)
+{
+ out->start = MAX(a->start, b->start);
+ out->end = MIN(a->end, b->end);
+}
+
+static inline gboolean it_range_equal(ITRange *a, ITRange *b)
+{
+ return a->start == b->start && a->end == b->end;
+}
+
+/* Whether ITRange A is superset of B? */
+static inline gboolean it_range_cover(ITRange *a, ITRange *b)
+{
+ return a->start <= b->start && a->end >= b->end;
+}
+
+ITTree *it_tree_new(void)
+{
+ ITTree *ittree = g_new0(ITTree, 1);
+
+ /* We don't have values actually, no need to free */
+ ittree->tree = g_tree_new_full(it_tree_compare, NULL, g_free, NULL);
+
+ return ittree;
+}
+
+ITRange *it_tree_find(ITTree *tree, ITValue start, ITValue end)
+{
+ ITRange range;
+
+ g_assert(tree);
+
+ range.start = start;
+ range.end = end;
+
+ return g_tree_lookup(tree->tree, &range);
+}
+
+ITRange *it_tree_find_value(ITTree *tree, ITValue value)
+{
+ return it_tree_find(tree, value, value);
+}
+
+static inline void it_tree_insert_internal(GTree *gtree, ITRange *range)
+{
+ /* Key and value are sharing the same range data */
+ g_tree_insert(gtree, range, range);
+}
+
+int it_tree_insert(ITTree *tree, ITValue start, ITValue end)
+{
+ ITRange range, *new, *overlap;
+ GTree *gtree;
+
+ g_assert(tree);
+ g_assert(start <= end);
+
+ gtree = tree->tree;
+
+ range.start = start;
+ range.end = end;
+
+ /* We don't allow to insert range that overlaps with existings */
+ if (g_tree_lookup(gtree, &range)) {
+ return IT_ERR_OVERLAP;
+ }
+
+ /* Merge left adjacent range */
+ overlap = it_tree_find_value(tree, start - 1);
+ if (overlap) {
+ range.start = overlap->start;
+ g_tree_remove(gtree, overlap);
+ }
+
+ /* Merge right adjacent range */
+ overlap = it_tree_find_value(tree, end + 1);
+ if (overlap) {
+ range.end = overlap->end;
+ g_tree_remove(gtree, overlap);
+ }
+
+ new = g_new0(ITRange, 1);
+ new->start = range.start;
+ new->end = range.end;
+ it_tree_insert_internal(gtree, new);
+
+ return IT_OK;
+}
+
+static gboolean it_tree_traverse(gpointer key, gpointer value,
+ gpointer data)
+{
+ it_tree_iterator iterator = data;
+ ITRange *range = key;
+
+ g_assert(key == value);
+
+ return iterator(range->start, range->end);
+}
+
+void it_tree_foreach(ITTree *tree, it_tree_iterator iterator)
+{
+ g_assert(tree && iterator);
+ g_tree_foreach(tree->tree, it_tree_traverse, iterator);
+}
+
+/* Remove subset `range', which is part of `overlap'. */
+static void it_tree_remove_subset(GTree *gtree, const ITRange *overlap,
+ const ITRange *range)
+{
+ ITRange *range1, *range2;
+
+ if (overlap->start < range->start) {
+ range1 = g_new0(ITRange, 1);
+ range1->start = overlap->start;
+ range1->end = range->start - 1;
+ } else {
+ range1 = NULL;
+ }
+ if (range->end < overlap->end) {
+ range2 = g_new0(ITRange, 1);
+ range2->start = range->end + 1;
+ range2->end = overlap->end;
+ } else {
+ range2 = NULL;
+ }
+
+ g_tree_remove(gtree, overlap);
+
+ if (range1) {
+ it_tree_insert_internal(gtree, range1);
+ }
+ if (range2) {
+ it_tree_insert_internal(gtree, range2);
+ }
+}
+
+int it_tree_remove(ITTree *tree, ITValue start, ITValue end)
+{
+ ITRange range = { .start = start, .end = end }, *overlap, and;
+ GTree *gtree;
+
+ g_assert(tree);
+
+ gtree = tree->tree;
+ while ((overlap = g_tree_lookup(gtree, &range))) {
+ if (it_range_cover(overlap, &range)) {
+ /* Split existing range into two if needed; done */
+ it_tree_remove_subset(gtree, overlap, &range);
+ break;
+ } else {
+ /* Remove intersection and continue */
+ it_range_and(&and, overlap, &range);
+ g_assert(and.start <= and.end);
+ it_tree_remove_subset(gtree, overlap, &and);
+ }
+ }
+
+ return IT_OK;
+}
+
+void it_tree_destroy(ITTree *tree)
+{
+ g_tree_destroy(tree->tree);
+ g_free(tree);
+}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 728c3541db..4ac33910ed 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -47,4 +47,5 @@ util-obj-y += qht.o
util-obj-y += range.o
util-obj-y += stats64.o
util-obj-y += systemd.o
+util-obj-y += interval-tree.o
util-obj-$(CONFIG_LINUX) += vfio-helpers.o

--
2.17.0

Peter Xu

2018-05-04 03:08:07 UTC

Permalink

We pass in the VTDAddressSpace to replace the aw bits when doing page
walk. The VTDAddressSpace contains the aw bits information, meanwhile
we'll need to do something more in the follow up patches regarding to
the address spaces.

Signed-off-by: Peter Xu <***@redhat.com>
---
hw/i386/intel_iommu.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b2b2a0a441..83769f2b8c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -753,13 +753,13 @@ typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
* @hook_fn: hook func to be called when detected page
* @private: private data to be passed into hook func
* @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @as: VT-d address space of the device
*/
typedef struct {
+ VTDAddressSpace *as;
vtd_page_walk_hook hook_fn;
void *private;
bool notify_unmap;
- uint8_t aw;
} vtd_page_walk_info;

static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -796,6 +796,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
uint64_t iova = start;
uint64_t iova_next;
int ret = 0;
+ uint8_t aw = info->as->iommu_state->aw_bits;

trace_vtd_page_walk_level(addr, level, start, end);

@@ -836,7 +837,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,

if (vtd_is_last_slpte(slpte, level)) {
/* NOTE: this is only meaningful if entry_valid == true */
- entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+ entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
if (!entry_valid && !info->notify_unmap) {
trace_vtd_page_walk_skip_perm(iova, iova_next);
goto next;
@@ -862,7 +863,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
}
goto next;
}
- ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+ ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
iova, MIN(iova_next, end), level - 1,
read_cur, write_cur, info);
if (ret < 0) {
@@ -885,19 +886,20 @@ next:
* @end: IOVA range end address (start <= addr < end)
* @hook_fn: the hook that to be called for each detected area
* @private: private data for the hook function
- * @aw: maximum address width
+ * @as: the VT-d address space of the device
*/
static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
vtd_page_walk_hook hook_fn, void *private,
- bool notify_unmap, uint8_t aw)
+ bool notify_unmap, VTDAddressSpace *as)
{
dma_addr_t addr = vtd_ce_get_slpt_base(ce);
uint32_t level = vtd_ce_get_level(ce);
+ uint8_t aw = as->iommu_state->aw_bits;
vtd_page_walk_info info = {
.hook_fn = hook_fn,
.private = private,
.notify_unmap = notify_unmap,
- .aw = aw,
+ .as = as,
};

if (!vtd_iova_range_check(start, ce, aw)) {
@@ -1470,7 +1472,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
*/
vtd_page_walk(&ce, addr, addr + size,
vtd_page_invalidate_notify_hook,
- (void *)&vtd_as->iommu, true, s->aw_bits);
+ (void *)&vtd_as->iommu, true, vtd_as);
} else {
/*
* For UNMAP-only notifiers, we don't need to walk the
@@ -2941,7 +2943,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
if (vtd_as_notify_mappings(vtd_as)) {
/* This is required only for MAP typed notifiers */
vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
- s->aw_bits);
+ vtd_as);
}
} else {
trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),

--
2.17.0

Peter Xu

2018-05-04 03:08:09 UTC

Permalink

For each VTDAddressSpace, now we maintain what IOVA ranges we have
mapped and what we have not. With that information, now we only send
MAP or UNMAP when necessary. Say, we don't send MAP notifies if we know
we have already mapped the range, meanwhile we don't send UNMAP notifies
if we know we never mapped the range at all.

Signed-off-by: Peter Xu <***@redhat.com>
---
include/hw/i386/intel_iommu.h | 2 +
hw/i386/intel_iommu.c | 73 +++++++++++++++++++++++++++++++++++
hw/i386/trace-events | 2 +
3 files changed, 77 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 9e0a6c1c6a..5e54f4fae2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -27,6 +27,7 @@
#include "hw/i386/ioapic.h"
#include "hw/pci/msi.h"
#include "hw/sysbus.h"
+#include "qemu/interval-tree.h"

#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
#define INTEL_IOMMU_DEVICE(obj) \
@@ -95,6 +96,7 @@ struct VTDAddressSpace {
QLIST_ENTRY(VTDAddressSpace) next;
/* Superset of notifier flags that this address space has */
IOMMUNotifierFlag notifier_flags;
+ ITTree *iova_tree; /* Traces mapped IOVA ranges */
};

struct VTDBus {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83769f2b8c..16b3514afb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -765,12 +765,82 @@ typedef struct {
static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
vtd_page_walk_info *info)
{
+ VTDAddressSpace *as = info->as;
vtd_page_walk_hook hook_fn = info->hook_fn;
void *private = info->private;
+ ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
+ entry->iova + entry->addr_mask);

assert(hook_fn);
+
+ /* Update local IOVA mapped ranges */
+ if (entry->perm) {
+ if (mapped) {
+ /*
+ * Skip since we have already mapped this range.
+ *
+ * NOTE: here we didn't solve the modify-PTE problem. For
+ * example:
+ *
+ * (1) map iova1 to 4K page P1
+ * (2) send PSI on (iova1, iova1 + 4k)
+ * (3) modify iova1 to 4K page P2
+ * (4) send PSI on (iova1, iova1 + 4k)
+ *
+ * Logically QEMU as emulator should atomically modify the
+ * shadow page table PTE to follow what has been changed.
+ * However here actually we will never have a way to
+ * emulate this "atomic PTE modification" procedure.
+ * Because even we know that this PTE is changed we'll
+ * still need to unmap it first then remap it (please
+ * refer to VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA
+ * VFIO APIs as an example). And there is no such API to
+ * atomically update an IOMMU PTE on the host. Then we'll
+ * have a small window that we still have invalid page
+ * mapping (while ideally we shoudn't).
+ *
+ * With out current code, on step (4) we just ignored the
+ * PTE update and we'll skip the map update, which will
+ * leave stale mappings.
+ *
+ * Modifying PTEs won't happen on general OSs (e.g.,
+ * Linux) - it should be prohibited since the device may
+ * be using the page table during the modification then
+ * the behavior of modifying PTEs could be undefined.
+ * However it's still possible for other guests to do so.
+ *
+ * Let's mark this as TODO. After all it should never
+ * happen on general OSs like Linux/Windows/... Meanwhile
+ * even if the guest is running a private and problematic
+ * OS, the stale page (P1) will definitely still be a
+ * valid page for current L1 QEMU, so the worst case is
+ * that L1 guest will be at risk on its own on that single
+ * device only. It'll never harm the rest of L1 guest,
+ * the host memory or other guests on the host.
+ *
+ * Let's solve this once-and-for-all when we really
+ * needed, and when we are capable of (for now, we can't).
+ * Though I would suspect that won't happen soon.
+ */
+ trace_vtd_page_walk_one_skip_map(entry->iova, entry->addr_mask,
+ mapped->start, mapped->end);
+ return 0;
+ }
+ it_tree_insert(as->iova_tree, entry->iova,
+ entry->iova + entry->addr_mask);
+ } else {
+ if (!mapped) {
+ /* Skip since we didn't map this range at all */
+ trace_vtd_page_walk_one_skip_unmap(entry->iova, entry->addr_mask);
+ return 0;
+ }
+ it_tree_remove(as->iova_tree, entry->iova,
+ entry->iova + entry->addr_mask);
+ }
+
trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
entry->addr_mask, entry->perm);
+
return hook_fn(entry, private);
}

@@ -2804,6 +2874,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
vtd_dev_as->devfn = (uint8_t)devfn;
vtd_dev_as->iommu_state = s;
vtd_dev_as->context_cache_entry.context_cache_gen = 0;
+ vtd_dev_as->iova_tree = it_tree_new();

/*
* Memory region relationships looks like (Address range shows
@@ -2900,6 +2971,8 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
VTD_PCI_FUNC(as->devfn),
entry.iova, size);

+ it_tree_remove(as->iova_tree, entry.iova, entry.iova + entry.addr_mask);
+
memory_region_notify_one(n, &entry);
}

diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 22d44648af..677f83420d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -40,6 +40,8 @@ vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint6
vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
+vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"

--
2.17.0

Peter Xu

2018-05-04 03:08:06 UTC

Permalink

During the recursive page walking of IOVA page tables, some stack
variables are constant variables and never changed during the whole page
walking procedure. Isolate them into a struct so that we don't need to
pass those contants down the stack every time and multiple times.

Signed-off-by: Peter Xu <***@redhat.com>
---
hw/i386/intel_iommu.c | 56 ++++++++++++++++++++++++++++---------------
1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9a418abfb6..b2b2a0a441 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -747,9 +747,27 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,

typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);

+/**
+ * Constant information used during page walking
+ *
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @notify_unmap: whether we should notify invalid entries
+ * @aw: maximum address width
+ */
+typedef struct {
+ vtd_page_walk_hook hook_fn;
+ void *private;
+ bool notify_unmap;
+ uint8_t aw;
+} vtd_page_walk_info;
+
static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
- vtd_page_walk_hook hook_fn, void *private)
+ vtd_page_walk_info *info)
{
+ vtd_page_walk_hook hook_fn = info->hook_fn;
+ void *private = info->private;
+
assert(hook_fn);
trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
entry->addr_mask, entry->perm);
@@ -762,17 +780,13 @@ static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
* @addr: base GPA addr to start the walk
* @start: IOVA range start address
* @end: IOVA range end address (start <= addr < end)
- * @hook_fn: hook func to be called when detected page
- * @private: private data to be passed into hook func
* @read: whether parent level has read permission
* @write: whether parent level has write permission
- * @notify_unmap: whether we should notify invalid entries
- * @aw: maximum address width
+ * @info: constant information for the page walk
*/
static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
- uint64_t end, vtd_page_walk_hook hook_fn,
- void *private, uint32_t level, bool read,
- bool write, bool notify_unmap, uint8_t aw)
+ uint64_t end, uint32_t level, bool read,
+ bool write, vtd_page_walk_info *info)
{
bool read_cur, write_cur, entry_valid;
uint32_t offset;
@@ -822,24 +836,24 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,

if (vtd_is_last_slpte(slpte, level)) {
/* NOTE: this is only meaningful if entry_valid == true */
- entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
- if (!entry_valid && !notify_unmap) {
+ entry.translated_addr = vtd_get_slpte_addr(slpte, info->aw);
+ if (!entry_valid && !info->notify_unmap) {
trace_vtd_page_walk_skip_perm(iova, iova_next);
goto next;
}
- ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+ ret = vtd_page_walk_one(&entry, level, info);
if (ret < 0) {
return ret;
}
} else {
if (!entry_valid) {
- if (notify_unmap) {
+ if (info->notify_unmap) {
/*
* The whole entry is invalid; unmap it all.
* Translated address is meaningless, zero it.
*/
entry.translated_addr = 0x0;
- ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+ ret = vtd_page_walk_one(&entry, level, info);
if (ret < 0) {
return ret;
}
@@ -848,10 +862,9 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
}
goto next;
}
- ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
- MIN(iova_next, end), hook_fn, private,
- level - 1, read_cur, write_cur,
- notify_unmap, aw);
+ ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
+ iova, MIN(iova_next, end), level - 1,
+ read_cur, write_cur, info);
if (ret < 0) {
return ret;
}
@@ -880,6 +893,12 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
{
dma_addr_t addr = vtd_ce_get_slpt_base(ce);
uint32_t level = vtd_ce_get_level(ce);
+ vtd_page_walk_info info = {
+ .hook_fn = hook_fn,
+ .private = private,
+ .notify_unmap = notify_unmap,
+ .aw = aw,
+ };

if (!vtd_iova_range_check(start, ce, aw)) {
return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -890,8 +909,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
end = vtd_iova_limit(ce, aw);
}

- return vtd_page_walk_level(addr, start, end, hook_fn, private,
- level, true, true, notify_unmap, aw);
+ return vtd_page_walk_level(addr, start, end, level, true, true, &info);
}

/* Map a device to its corresponding domain (context-entry) */

--
2.17.0

Peter Xu

2018-05-04 03:08:10 UTC

Permalink

IOMMU replay was carried out before in many use cases, e.g., context
cache invalidations, domain flushes. We used this mechanism to sync the
shadow page table by firstly (1) unmap the whole address space, then
(2) walk the page table to remap what's in the table.

This is very dangerous.

The problem is that we'll have a very small window (in my measurement,
it can be about 3ms) during above step (1) and (2) that the device will
see no (or incomplete) device page table. Howerver the device never
knows that. This can cause DMA error of devices, who assumes the page
table is always there.

So the point is that, for MAP typed notifiers (vfio-pci, for example)
they'll need the mapped page entries always be there. We can never
unmap any existing page entries like what we did in (1) above.

The only solution is to remove step (1). We can't do that before since
we didn't know what device page was mapped and what was not, so we unmap
them all. Now with the new IOVA tree QEMU knows what has mapped and
what has not. We don't need this step (1) any more. Remove it.

Note that after removing that global unmap flushing, we'll need to
notify unmap now during page walkings.

This should fix the DMA error problem that Jintack Lim reported with
nested device assignment. This problem won't not happen always, e.g., I
cannot reproduce the error. However after collecting logs it shows that
this is the possible cause to Jintack's problem.

Reported-by: Jintack Lim <***@cs.columbia.edu>
Signed-off-by: Peter Xu <***@redhat.com>
---
hw/i386/intel_iommu.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 16b3514afb..9439103cac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)

/*
* The replay can be triggered by either a invalidation or a newly
- * created entry. No matter what, we release existing mappings
- * (it means flushing caches for UNMAP-only registers).
+ * created entry.
*/
- vtd_address_space_unmap(vtd_as, n);

if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
ce.hi, ce.lo);
if (vtd_as_notify_mappings(vtd_as)) {
/* This is required only for MAP typed notifiers */
- vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+ vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
vtd_as);
+ } else {
+ vtd_address_space_unmap(vtd_as, n);
}
} else {
trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),

--
2.17.0

Peter Xu

2018-05-04 03:08:11 UTC

Permalink

Now after previous changes, we will always pass that parameter as true
now. Remove it. After removing that variable, now we can greatly
simplify the page walking logic.

No functional change at all.

Signed-off-by: Peter Xu <***@redhat.com>
---
hw/i386/intel_iommu.c | 70 +++++++++++++++++--------------------------
hw/i386/trace-events | 1 -
2 files changed, 28 insertions(+), 43 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9439103cac..c156235135 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -759,7 +759,6 @@ typedef struct {
VTDAddressSpace *as;
vtd_page_walk_hook hook_fn;
void *private;
- bool notify_unmap;
} vtd_page_walk_info;

static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
@@ -900,45 +899,34 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
*/
entry_valid = read_cur | write_cur;

- entry.target_as = &address_space_memory;
- entry.iova = iova & subpage_mask;
- entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
- entry.addr_mask = ~subpage_mask;
-
- if (vtd_is_last_slpte(slpte, level)) {
- /* NOTE: this is only meaningful if entry_valid == true */
- entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
- if (!entry_valid && !info->notify_unmap) {
- trace_vtd_page_walk_skip_perm(iova, iova_next);
- goto next;
- }
- ret = vtd_page_walk_one(&entry, level, info);
- if (ret < 0) {
- return ret;
- }
- } else {
- if (!entry_valid) {
- if (info->notify_unmap) {
- /*
- * The whole entry is invalid; unmap it all.
- * Translated address is meaningless, zero it.
- */
- entry.translated_addr = 0x0;
- ret = vtd_page_walk_one(&entry, level, info);
- if (ret < 0) {
- return ret;
- }
- } else {
- trace_vtd_page_walk_skip_perm(iova, iova_next);
- }
- goto next;
- }
+ if (!vtd_is_last_slpte(slpte, level) && entry_valid) {
+ /*
+ * This is a valid PDE (or even bigger than PDE). We need
+ * to walk one further level.
+ */
ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw),
iova, MIN(iova_next, end), level - 1,
read_cur, write_cur, info);
- if (ret < 0) {
- return ret;
- }
+ } else {
+ /*
+ * This means we are either:
+ *
+ * (1) the real page entry (either 4K page, or huge page)
+ * (2) the whole range is invalid
+ *
+ * In either case, we send an IOTLB notification down.
+ */
+ entry.target_as = &address_space_memory;
+ entry.iova = iova & subpage_mask;
+ entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+ entry.addr_mask = ~subpage_mask;
+ /* NOTE: this is only meaningful if entry_valid == true */
+ entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
+ ret = vtd_page_walk_one(&entry, level, info);
+ }
+
+ if (ret < 0) {
+ return ret;
}

next:
@@ -960,7 +948,7 @@ next:
*/
static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
vtd_page_walk_hook hook_fn, void *private,
- bool notify_unmap, VTDAddressSpace *as)
+ VTDAddressSpace *as)
{
dma_addr_t addr = vtd_ce_get_slpt_base(ce);
uint32_t level = vtd_ce_get_level(ce);
@@ -968,7 +956,6 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
vtd_page_walk_info info = {
.hook_fn = hook_fn,
.private = private,
- .notify_unmap = notify_unmap,
.as = as,
};

@@ -1542,7 +1529,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
*/
vtd_page_walk(&ce, addr, addr + size,
vtd_page_invalidate_notify_hook,
- (void *)&vtd_as->iommu, true, vtd_as);
+ (void *)&vtd_as->iommu, vtd_as);
} else {
/*
* For UNMAP-only notifiers, we don't need to walk the
@@ -3013,8 +3000,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
ce.hi, ce.lo);
if (vtd_as_notify_mappings(vtd_as)) {
/* This is required only for MAP typed notifiers */
- vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
- vtd_as);
+ vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, vtd_as);
} else {
vtd_address_space_unmap(vtd_as, n);
}
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 677f83420d..eb23ace5fb 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -43,7 +43,6 @@ vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, in
vtd_page_walk_one_skip_map(uint64_t iova, uint64_t mask, uint64_t start, uint64_t end) "iova 0x%"PRIx64" mask 0x%"PRIx64" start 0x%"PRIx64" end 0x%"PRIx64
vtd_page_walk_one_skip_unmap(uint64_t iova, uint64_t mask) "iova 0x%"PRIx64" mask 0x%"PRIx64
vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
-vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)"
vtd_as_unmap_whole(uint8_t bus, uint8_t slot, uint8_t fn, uint64_t iova, uint64_t size) "Device %02x:%02x.%x start 0x%"PRIx64" size 0x%"PRIx64

--
2.17.0

n***@patchew.org

2018-05-04 03:20:33 UTC

Permalink

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180504030811.28111-1-***@redhat.com
Subject: [Qemu-devel] [PATCH v2 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
t [tag update] patchew/***@gimli.home -> patchew/***@gimli.home
* [new tag] patchew/20180504030811.28111-1-***@redhat.com -> patchew/20180504030811.28111-1-***@redhat.com
Switched to a new branch 'test'
55b6c6b1f5 intel-iommu: remove notify_unmap for page walk
72ffe9314a intel-iommu: don't unmap all for shadow page table
7bb663682b intel-iommu: maintain per-device iova ranges
e77921e63f util: implement simple interval tree logic
356489cd59 intel-iommu: pass in address space when page walk
27fcd3062f intel-iommu: introduce vtd_page_walk_info
953a6db666 intel-iommu: only do page walk for MAP notifiers
0ccfedc1e7 intel-iommu: add iommu lock
47a29fd165 intel-iommu: remove IntelIOMMUNotifierNode
aacd552d3a intel-iommu: send PSI always even if across PDEs

=== OUTPUT BEGIN ===
Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
Checking PATCH 3/10: intel-iommu: add iommu lock...
Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
Checking PATCH 7/10: util: implement simple interval tree logic...
ERROR: space prohibited between function name and open parenthesis '('
#56: FILE: include/qemu/interval-tree.h:33:
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);

total: 1 errors, 0 warnings, 343 lines checked

Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 8/10: intel-iommu: maintain per-device iova ranges...
Checking PATCH 9/10: intel-iommu: don't unmap all for shadow page table...
Checking PATCH 10/10: intel-iommu: remove notify_unmap for page walk...
=== OUTPUT END ===

Test command exited with code: 1

---
Email generated automatically by Patchew [http://patchew.org/].
Ple

Peter Xu

2018-05-04 03:40:42 UTC

Permalink

On Thu, May 03, 2018 at 08:20:33PM -0700, no-***@patchew.org wrote:

[...]

Post by n***@patchew.org
=== OUTPUT BEGIN ===
Checking PATCH 1/10: intel-iommu: send PSI always even if across PDEs...
Checking PATCH 2/10: intel-iommu: remove IntelIOMMUNotifierNode...
Checking PATCH 3/10: intel-iommu: add iommu lock...
Checking PATCH 4/10: intel-iommu: only do page walk for MAP notifiers...
Checking PATCH 5/10: intel-iommu: introduce vtd_page_walk_info...
Checking PATCH 6/10: intel-iommu: pass in address space when page walk...
Checking PATCH 7/10: util: implement simple interval tree logic...
ERROR: space prohibited between function name and open parenthesis '('
+typedef gboolean (*it_tree_iterator)(ITValue start, ITValue end);
total: 1 errors, 0 warnings, 343 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

This is false positive, and should be dissolved after below patch
merged (now in Paolo's queue):

[PATCH v2] checkpatch.pl: add common glib defines to typelist

--
Peter Xu

Peter Xu

2018-05-16 06:30:09 UTC

Permalink

Post by Peter Xu
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
[Jason, Kevin]

We can hold a bit on reviewing this series. Jintack reported a scp
DMAR issue that might happen even with L1 guest with this series, and
the scp can stall after copied tens or hundreds of MBs randomly. I'm
still investigating the problem. This problem should be related to
deferred flushing of VT-d kernel driver, since the problem will go
away if we use "intel_iommu=on,strict". However I'm still trying to
figure out what's the thing behind the scene even with that deferred
flushing feature.

Meanwhile, during the investigation I found another "possibly valid"
use case about the modify-PTE problem that Jason has mentioned when
with deferred flushing:

vcpu1 vcpu2
map page A
explicitly send PSI for A
queue unmap page A [1]
map the same page A [2]
explcitly send PSI for A [3]
flush unmap page A [4]

Due to deferred flushing, the UNMAP PSI might be postponed (or it can
be finally a DSI) from step [1] to step [4]. If we allocate the same
page somewhere else, we might trigger this modify-PTE at [2] since we
haven't yet received the deferred PSI to unmap A from vcpu1.

Note that this will not happen with latest upstream Linux, since the
IOVA allocation algorithm in current Linux kernel made sure that the
IOVA range won't be freed until [4], so we can't really allocate the
same page address at [2]. However this let me tend to agree with
Jason and Kevin's worry on future potential issues if that can be
triggered easily by common guest kernel bugs. So now I'm considering
to drop my mergable interval tree but just use a simpler tree to cache
everything including translated addresses. The metadata will possibly
take 2% of managed memory if with that.

--
Peter Xu

Jason Wang

2018-05-16 13:57:40 UTC

Permalink

Post by Peter Xu

I vaguely remember recent upstream vfio support delayed flush, maybe
it's related.

Post by Peter Xu
Meanwhile, during the investigation I found another "possibly valid"
use case about the modify-PTE problem that Jason has mentioned when
vcpu1 vcpu2
map page A
explicitly send PSI for A
queue unmap page A [1]
map the same page A [2]
explcitly send PSI for A [3]
flush unmap page A [4]
Due to deferred flushing, the UNMAP PSI might be postponed (or it can
be finally a DSI) from step [1] to step [4]. If we allocate the same
page somewhere else, we might trigger this modify-PTE at [2] since we
haven't yet received the deferred PSI to unmap A from vcpu1.
Note that this will not happen with latest upstream Linux, since the
IOVA allocation algorithm in current Linux kernel made sure that the
IOVA range won't be freed until [4], so we can't really allocate the
same page address at [2].

Yes, so the vfio + vIOMMU work will probably uncover more bugs in the
IOMMU driver (especially CM mode). I suspect CM mode does not have
sufficient test (since it probably wasn't used in any production
environment before the vfio + vIOMMU work).

Post by Peter Xu
However this let me tend to agree with
Jason and Kevin's worry on future potential issues if that can be
triggered easily by common guest kernel bugs. So now I'm considering
to drop my mergable interval tree but just use a simpler tree to cache
everything including translated addresses. The metadata will possibly
take 2% of managed memory if with that.

Good to know this, we can start from the way we know correct for sure
then optimizations on top.

Thanks