Discussion:
[Qemu-devel] [RESEND PATCH 0/6] Introduce new iommu notifier framework
Liu, Yi L
2017-11-03 12:01:50 UTC
Permalink
Hi,

Resend it due to the build error reported by the auto-test. No functional
change compared with the previous sending.

This patchset is a follow-up of Peter Xu's patchset as the link below.
In brief, Peter's patchset is to introduce a common IOMMU object which
is not depending on platform (x86/ppc/...), or bus (PCI/...). And based
on it, a iommu object based notifier framework is introduced and also
AddressSpaceOps is added to provide methods like getting IOMMUObject
behind an AddressSpace. It could be used to detect the exposure of
vIOMMU.

https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05360.html

Here let me try to address why we need such change.

I'm working on virt-SVM enabling for passthru devices on Intel platform.
This work is to extend the existing intel iommu emulator in Qemu. Among
the extensions, there are two requirements which ae related to the topic
we are talking here.

* intel iommu emulator needs to propagate a guest pasid table pointer
to host through VFIO. So that host intel iommu driver could set it to
its ctx table. With guest pasid table pointer set, host would be able
to get guest CR3 table after guest calls intel_svm_bind_mm(). Then HW
iommu could do nested translation to get GVA->GPA GPA->HPA. Thus enables
Shared Virtual Memory in guest.

* intel iommu emulator needs to propagate guest's iotlb(1st level cache)
flush to host through VFIO.

Since the two requirements need to talk with VFIO, so notifiers are
needed. Meanwhile, the notifiers should be registered as long as there
is vIOMMU exposed to guest.

Qemu has an existing notifier framework based on MemoryRegion. And we
are using it for MAP/UNMAP. However, we cannot use it here. Reason is
as below:

* IOMMU MemoryRegion notifiers depends on IOMMU MemoryRegion. If guest
iommu driver configs to bypass the IOVA adress translation. The address
space would be system ram address space. The MemoryRegion would be the
RAM MemoryRegion. Details can be got in Peter's patch to allow dynamic
switch of IOMMU region.
https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg02690.html

* virt-SVM requires guest to config to bypass the IOVA address translation
With such config, we can make sure host would have a GPA->HPA mapping,
and meanwhile intel iommu emulator could propagate the guest CR3 table
(GVA->GPA) to host. With nested translation, we are able to achieve
GVA->GPA and then GPA->HPA translation. However, if so, the IOMMU
MemoryRegion notifiers would not be registered. It means for virt-SVM,
we need another notifier framework.

Based on Peter's patch, I did some clean up and fulfill the notifier
framework based on IOMMUObject and also provide an example of the newly
introduced notifier framework. The notifier framework introduced here
is going to be used in my virt-SVM patchset.

For virt-SVM design details, you may refer to svm RFC patch.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04925.html

Liu, Yi L (3):
vfio: rename GuestIOMMU to be GuestIOMMUMR
vfio/pci: add notify framework based on IOMMUObject
vfio/pci: register vfio_iommu_bind_pasidtbl_notify notifier

Peter Xu (3):
memory: rename existing iommu notifier to be iommu mr notifier
memory: introduce AddressSpaceOps and IOMMUObject
intel_iommu: provide AddressSpaceOps.iommu_get instance

hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 ++++++++++++++++++++++++++++++++
hw/i386/amd_iommu.c | 6 ++--
hw/i386/intel_iommu.c | 41 +++++++++++++----------
hw/ppc/spapr_iommu.c | 8 ++---
hw/s390x/s390-pci-bus.c | 2 +-
hw/vfio/common.c | 25 +++++++-------
hw/vfio/pci.c | 53 ++++++++++++++++++++++++++++-
hw/virtio/vhost.c | 10 +++---
include/exec/memory.h | 77 ++++++++++++++++++++++++++++---------------
include/hw/core/iommu.h | 73 ++++++++++++++++++++++++++++++++++++++++
include/hw/i386/intel_iommu.h | 10 +++---
include/hw/vfio/vfio-common.h | 16 ++++++---
include/hw/virtio/vhost.h | 4 +--
memory.c | 47 +++++++++++++++-----------
15 files changed, 331 insertions(+), 100 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h
--
1.9.1
Liu, Yi L
2017-11-03 12:01:51 UTC
Permalink
From: Peter Xu <***@redhat.com>

IOMMU notifiers before are mostly used for [dev-]IOTLB stuffs. It is not
suitable for other kind of notifiers (one example would be the future
virt-svm support). Considering that current notifiers are targeted for
per memory region, renaming the iommu notifier definitions.

* all the notifier types from IOMMU_NOTIFIER_* prefix into IOMMU_MR_EVENT_*
to better show its usage (for memory regions).
* rename IOMMUNotifier to IOMMUMRNotifier
* rename iommu_notifier to iommu_mr_notifier

Signed-off-by: Peter Xu <***@redhat.com>
Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/i386/amd_iommu.c | 6 ++---
hw/i386/intel_iommu.c | 34 +++++++++++++-------------
hw/ppc/spapr_iommu.c | 8 +++----
hw/s390x/s390-pci-bus.c | 2 +-
hw/vfio/common.c | 10 ++++----
hw/virtio/vhost.c | 10 ++++----
include/exec/memory.h | 55 ++++++++++++++++++++++---------------------
include/hw/i386/intel_iommu.h | 8 +++----
include/hw/vfio/vfio-common.h | 2 +-
include/hw/virtio/vhost.h | 4 ++--
memory.c | 37 +++++++++++++++--------------
11 files changed, 89 insertions(+), 87 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index ad8155c..8f756e8 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1072,12 +1072,12 @@ static const MemoryRegionOps mmio_mem_ops = {
};

static void amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
- IOMMUNotifierFlag old,
- IOMMUNotifierFlag new)
+ IOMMUMREventFlag old,
+ IOMMUMREventFlag new)
{
AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);

- if (new & IOMMU_NOTIFIER_MAP) {
+ if (new & IOMMU_MR_EVENT_MAP) {
error_report("device %02x.%02x.%x requires iommu notifier which is not "
"currently supported", as->bus_num, PCI_SLOT(as->devfn),
PCI_FUNC(as->devfn));
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3a5bb0b..e81c706 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1234,7 +1234,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)

static void vtd_iommu_replay_all(IntelIOMMUState *s)
{
- IntelIOMMUNotifierNode *node;
+ IntelIOMMUMRNotifierNode *node;

QLIST_FOREACH(node, &s->notifiers_list, next) {
memory_region_iommu_replay_all(&node->vtd_as->iommu);
@@ -1308,7 +1308,7 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
/*
* So a device is moving out of (or moving into) a
* domain, a replay() suites here to notify all the
- * IOMMU_NOTIFIER_MAP registers about this change.
+ * IOMMU_MR_EVENT_MAP registers about this change.
* This won't bring bad even if we have no such
* notifier registered - the IOMMU notification
* framework will skip MAP notifications if that
@@ -1358,7 +1358,7 @@ static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)

static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
{
- IntelIOMMUNotifierNode *node;
+ IntelIOMMUMRNotifierNode *node;
VTDContextEntry ce;
VTDAddressSpace *vtd_as;

@@ -1388,7 +1388,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
uint16_t domain_id, hwaddr addr,
uint8_t am)
{
- IntelIOMMUNotifierNode *node;
+ IntelIOMMUMRNotifierNode *node;
VTDContextEntry ce;
int ret;

@@ -2318,21 +2318,21 @@ static IOMMUTLBEntry vtd_iommu_translate(IOMMUMemoryRegion *iommu, hwaddr addr,
}

static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
- IOMMUNotifierFlag old,
- IOMMUNotifierFlag new)
+ IOMMUMREventFlag old,
+ IOMMUMREventFlag new)
{
VTDAddressSpace *vtd_as = container_of(iommu, VTDAddressSpace, iommu);
IntelIOMMUState *s = vtd_as->iommu_state;
- IntelIOMMUNotifierNode *node = NULL;
- IntelIOMMUNotifierNode *next_node = NULL;
+ IntelIOMMUMRNotifierNode *node = NULL;
+ IntelIOMMUMRNotifierNode *next_node = NULL;

- if (!s->caching_mode && new & IOMMU_NOTIFIER_MAP) {
+ if (!s->caching_mode && new & IOMMU_MR_EVENT_MAP) {
error_report("We need to set cache_mode=1 for intel-iommu to enable "
"device assignment with IOMMU protection.");
exit(1);
}

- if (old == IOMMU_NOTIFIER_NONE) {
+ if (old == IOMMU_MR_EVENT_NONE) {
node = g_malloc0(sizeof(*node));
node->vtd_as = vtd_as;
QLIST_INSERT_HEAD(&s->notifiers_list, node, next);
@@ -2342,7 +2342,7 @@ static void vtd_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
/* update notifier node with new flags */
QLIST_FOREACH_SAFE(node, &s->notifiers_list, next, next_node) {
if (node->vtd_as == vtd_as) {
- if (new == IOMMU_NOTIFIER_NONE) {
+ if (new == IOMMU_MR_EVENT_NONE) {
QLIST_REMOVE(node, next);
g_free(node);
}
@@ -2759,7 +2759,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
}

/* Unmap the whole range in the notifier's scope. */
-static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
+static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUMRNotifier *n)
{
IOMMUTLBEntry entry;
hwaddr size;
@@ -2814,13 +2814,13 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)

static void vtd_address_space_unmap_all(IntelIOMMUState *s)
{
- IntelIOMMUNotifierNode *node;
+ IntelIOMMUMRNotifierNode *node;
VTDAddressSpace *vtd_as;
- IOMMUNotifier *n;
+ IOMMUMRNotifier *n;

QLIST_FOREACH(node, &s->notifiers_list, next) {
vtd_as = node->vtd_as;
- IOMMU_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
+ IOMMU_MR_NOTIFIER_FOREACH(n, &vtd_as->iommu) {
vtd_address_space_unmap(vtd_as, n);
}
}
@@ -2828,11 +2828,11 @@ static void vtd_address_space_unmap_all(IntelIOMMUState *s)

static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
{
- memory_region_notify_one((IOMMUNotifier *)private, entry);
+ memory_region_notify_one((IOMMUMRNotifier *)private, entry);
return 0;
}

-static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUMRNotifier *n)
{
VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
IntelIOMMUState *s = vtd_as->iommu_state;
diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 5ccd785..088f614 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -161,14 +161,14 @@ static uint64_t spapr_tce_get_min_page_size(IOMMUMemoryRegion *iommu)
}

static void spapr_tce_notify_flag_changed(IOMMUMemoryRegion *iommu,
- IOMMUNotifierFlag old,
- IOMMUNotifierFlag new)
+ IOMMUMREventFlag old,
+ IOMMUMREventFlag new)
{
struct sPAPRTCETable *tbl = container_of(iommu, sPAPRTCETable, iommu);

- if (old == IOMMU_NOTIFIER_NONE && new != IOMMU_NOTIFIER_NONE) {
+ if (old == IOMMU_MR_EVENT_NONE && new != IOMMU_MR_EVENT_NONE) {
spapr_tce_set_need_vfio(tbl, true);
- } else if (old != IOMMU_NOTIFIER_NONE && new == IOMMU_NOTIFIER_NONE) {
+ } else if (old != IOMMU_MR_EVENT_NONE && new == IOMMU_MR_EVENT_NONE) {
spapr_tce_set_need_vfio(tbl, false);
}
}
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index e7a58e8..10b7020 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -398,7 +398,7 @@ static IOMMUTLBEntry s390_translate_iommu(IOMMUMemoryRegion *mr, hwaddr addr,
}

static void s390_pci_iommu_replay(IOMMUMemoryRegion *iommu,
- IOMMUNotifier *notifier)
+ IOMMUMRNotifier *notifier)
{
/* It's impossible to plug a pci device on s390x that already has iommu
* mappings which need to be replayed, that is due to the "one iommu per
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7b2924c..1f7d516 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -346,7 +346,7 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
return true;
}

-static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+static void vfio_iommu_map_notify(IOMMUMRNotifier *n, IOMMUTLBEntry *iotlb)
{
VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
VFIOContainer *container = giommu->container;
@@ -496,10 +496,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
llend = int128_add(int128_make64(section->offset_within_region),
section->size);
llend = int128_sub(llend, int128_one());
- iommu_notifier_init(&giommu->n, vfio_iommu_map_notify,
- IOMMU_NOTIFIER_ALL,
- section->offset_within_region,
- int128_get64(llend));
+ iommu_mr_notifier_init(&giommu->n, vfio_iommu_map_notify,
+ IOMMU_MR_EVENT_ALL,
+ section->offset_within_region,
+ int128_get64(llend));
QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);

memory_region_register_iommu_notifier(section->mr, &giommu->n);
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ddc42f0..e2c1228 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -719,7 +719,7 @@ static void vhost_region_del(MemoryListener *listener,
}
}

-static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+static void vhost_iommu_unmap_notify(IOMMUMRNotifier *n, IOMMUTLBEntry *iotlb)
{
struct vhost_iommu *iommu = container_of(n, struct vhost_iommu, n);
struct vhost_dev *hdev = iommu->hdev;
@@ -747,10 +747,10 @@ static void vhost_iommu_region_add(MemoryListener *listener,
end = int128_add(int128_make64(section->offset_within_region),
section->size);
end = int128_sub(end, int128_one());
- iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
- IOMMU_NOTIFIER_UNMAP,
- section->offset_within_region,
- int128_get64(end));
+ iommu_mr_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
+ IOMMU_MR_EVENT_UNMAP,
+ section->offset_within_region,
+ int128_get64(end));
iommu->mr = section->mr;
iommu->iommu_offset = section->offset_within_address_space -
section->offset_within_region;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5ed4042..03595e3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -75,36 +75,36 @@ struct IOMMUTLBEntry {
};

/*
- * Bitmap for different IOMMUNotifier capabilities. Each notifier can
+ * Bitmap for different IOMMUMRNotifier capabilities. Each notifier can
* register with one or multiple IOMMU Notifier capability bit(s).
*/
typedef enum {
- IOMMU_NOTIFIER_NONE = 0,
+ IOMMU_MR_EVENT_NONE = 0,
/* Notify cache invalidations */
- IOMMU_NOTIFIER_UNMAP = 0x1,
+ IOMMU_MR_EVENT_UNMAP = 0x1,
/* Notify entry changes (newly created entries) */
- IOMMU_NOTIFIER_MAP = 0x2,
-} IOMMUNotifierFlag;
+ IOMMU_MR_EVENT_MAP = 0x2,
+} IOMMUMREventFlag;

-#define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
+#define IOMMU_MR_EVENT_ALL (IOMMU_MR_EVENT_MAP | IOMMU_MR_EVENT_UNMAP)

-struct IOMMUNotifier;
-typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
+struct IOMMUMRNotifier;
+typedef void (*IOMMUMRNotify)(struct IOMMUMRNotifier *notifier,
IOMMUTLBEntry *data);

-struct IOMMUNotifier {
- IOMMUNotify notify;
- IOMMUNotifierFlag notifier_flags;
+struct IOMMUMRNotifier {
+ IOMMUMRNotify notify;
+ IOMMUMREventFlag notifier_flags;
/* Notify for address space range start <= addr <= end */
hwaddr start;
hwaddr end;
- QLIST_ENTRY(IOMMUNotifier) node;
+ QLIST_ENTRY(IOMMUMRNotifier) node;
};
-typedef struct IOMMUNotifier IOMMUNotifier;
+typedef struct IOMMUMRNotifier IOMMUMRNotifier;

-static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
- IOMMUNotifierFlag flags,
- hwaddr start, hwaddr end)
+static inline void iommu_mr_notifier_init(IOMMUMRNotifier *n, IOMMUMRNotify fn,
+ IOMMUMREventFlag flags,
+ hwaddr start, hwaddr end)
{
n->notify = fn;
n->notifier_flags = flags;
@@ -206,10 +206,10 @@ typedef struct IOMMUMemoryRegionClass {
uint64_t (*get_min_page_size)(IOMMUMemoryRegion *iommu);
/* Called when IOMMU Notifier flag changed */
void (*notify_flag_changed)(IOMMUMemoryRegion *iommu,
- IOMMUNotifierFlag old_flags,
- IOMMUNotifierFlag new_flags);
+ IOMMUMREventFlag old_flags,
+ IOMMUMREventFlag new_flags);
/* Set this up to provide customized IOMMU replay function */
- void (*replay)(IOMMUMemoryRegion *iommu, IOMMUNotifier *notifier);
+ void (*replay)(IOMMUMemoryRegion *iommu, IOMMUMRNotifier *notifier);
} IOMMUMemoryRegionClass;

typedef struct CoalescedMemoryRange CoalescedMemoryRange;
@@ -259,11 +259,11 @@ struct MemoryRegion {
struct IOMMUMemoryRegion {
MemoryRegion parent_obj;

- QLIST_HEAD(, IOMMUNotifier) iommu_notify;
- IOMMUNotifierFlag iommu_notify_flags;
+ QLIST_HEAD(, IOMMUMRNotifier) iommu_notify;
+ IOMMUMREventFlag iommu_notify_flags;
};

-#define IOMMU_NOTIFIER_FOREACH(n, mr) \
+#define IOMMU_MR_NOTIFIER_FOREACH(n, mr) \
QLIST_FOREACH((n), &(mr)->iommu_notify, node)

/**
@@ -879,7 +879,7 @@ void memory_region_notify_iommu(IOMMUMemoryRegion *iommu_mr,
* replaces all old entries for the same virtual I/O address range.
* Deleted entries have ***@perm == 0.
*/
-void memory_region_notify_one(IOMMUNotifier *notifier,
+void memory_region_notify_one(IOMMUMRNotifier *notifier,
IOMMUTLBEntry *entry);

/**
@@ -887,12 +887,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
* IOMMU translation entries.
*
* @mr: the memory region to observe
- * @n: the IOMMUNotifier to be added; the notify callback receives a
+ * @n: the IOMMUMRNotifier to be added; the notify callback receives a
* pointer to an #IOMMUTLBEntry as the opaque value; the pointer
* ceases to be valid on exit from the notifier.
*/
void memory_region_register_iommu_notifier(MemoryRegion *mr,
- IOMMUNotifier *n);
+ IOMMUMRNotifier *n);

/**
* memory_region_iommu_replay: replay existing IOMMU translations to
@@ -902,7 +902,8 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,
* @iommu_mr: the memory region to observe
* @n: the notifier to which to replay iommu mappings
*/
-void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n);
+void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr,
+ IOMMUMRNotifier *n);

/**
* memory_region_iommu_replay_all: replay existing IOMMU translations
@@ -921,7 +922,7 @@ void memory_region_iommu_replay_all(IOMMUMemoryRegion *iommu_mr);
* @n: the notifier to be removed.
*/
void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
- IOMMUNotifier *n);
+ IOMMUMRNotifier *n);

/**
* memory_region_name: get a memory region's name
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ac15e6b..c85f9ff 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -65,7 +65,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
typedef struct VTDIrq VTDIrq;
typedef struct VTD_MSIMessage VTD_MSIMessage;
-typedef struct IntelIOMMUNotifierNode IntelIOMMUNotifierNode;
+typedef struct IntelIOMMUMRNotifierNode IntelIOMMUMRNotifierNode;

/* Context-Entry */
struct VTDContextEntry {
@@ -251,9 +251,9 @@ struct VTD_MSIMessage {
/* When IR is enabled, all MSI/MSI-X data bits should be zero */
#define VTD_IR_MSI_DATA (0)

-struct IntelIOMMUNotifierNode {
+struct IntelIOMMUMRNotifierNode {
VTDAddressSpace *vtd_as;
- QLIST_ENTRY(IntelIOMMUNotifierNode) next;
+ QLIST_ENTRY(IntelIOMMUMRNotifierNode) next;
};

/* The iommu (DMAR) device state struct */
@@ -293,7 +293,7 @@ struct IntelIOMMUState {
GHashTable *vtd_as_by_busptr; /* VTDBus objects indexed by PCIBus* reference */
VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by bus number */
/* list of registered notifiers */
- QLIST_HEAD(, IntelIOMMUNotifierNode) notifiers_list;
+ QLIST_HEAD(, IntelIOMMUMRNotifierNode) notifiers_list;

/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index f3a2ac9..865e3e7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -97,7 +97,7 @@ typedef struct VFIOGuestIOMMU {
VFIOContainer *container;
IOMMUMemoryRegion *iommu;
hwaddr iommu_offset;
- IOMMUNotifier n;
+ IOMMUMRNotifier n;
QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
} VFIOGuestIOMMU;

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 467dc77..ffe9d9f 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -42,7 +42,7 @@ struct vhost_iommu {
struct vhost_dev *hdev;
MemoryRegion *mr;
hwaddr iommu_offset;
- IOMMUNotifier n;
+ IOMMUMRNotifier n;
QLIST_ENTRY(vhost_iommu) iommu_next;
};

@@ -75,7 +75,7 @@ struct vhost_dev {
struct vhost_log *log;
QLIST_ENTRY(vhost_dev) entry;
QLIST_HEAD(, vhost_iommu) iommu_list;
- IOMMUNotifier n;
+ IOMMUMRNotifier n;
};

int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
diff --git a/memory.c b/memory.c
index e26e5a3..77fb3ef 100644
--- a/memory.c
+++ b/memory.c
@@ -1689,7 +1689,7 @@ void memory_region_init_iommu(void *_iommu_mr,
iommu_mr = IOMMU_MEMORY_REGION(mr);
mr->terminates = true; /* then re-forwards */
QLIST_INIT(&iommu_mr->iommu_notify);
- iommu_mr->iommu_notify_flags = IOMMU_NOTIFIER_NONE;
+ iommu_mr->iommu_notify_flags = IOMMU_MR_EVENT_NONE;
}

static void memory_region_finalize(Object *obj)
@@ -1786,12 +1786,12 @@ bool memory_region_is_logging(MemoryRegion *mr, uint8_t client)

static void memory_region_update_iommu_notify_flags(IOMMUMemoryRegion *iommu_mr)
{
- IOMMUNotifierFlag flags = IOMMU_NOTIFIER_NONE;
- IOMMUNotifier *iommu_notifier;
+ IOMMUMREventFlag flags = IOMMU_MR_EVENT_NONE;
+ IOMMUMRNotifier *iommu_mr_notifier;
IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);

- IOMMU_NOTIFIER_FOREACH(iommu_notifier, iommu_mr) {
- flags |= iommu_notifier->notifier_flags;
+ IOMMU_MR_NOTIFIER_FOREACH(iommu_mr_notifier, iommu_mr) {
+ flags |= iommu_mr_notifier->notifier_flags;
}

if (flags != iommu_mr->iommu_notify_flags && imrc->notify_flag_changed) {
@@ -1804,7 +1804,7 @@ static void memory_region_update_iommu_notify_flags(IOMMUMemoryRegion *iommu_mr)
}

void memory_region_register_iommu_notifier(MemoryRegion *mr,
- IOMMUNotifier *n)
+ IOMMUMRNotifier *n)
{
IOMMUMemoryRegion *iommu_mr;

@@ -1815,7 +1815,7 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,

/* We need to register for at least one bitfield */
iommu_mr = IOMMU_MEMORY_REGION(mr);
- assert(n->notifier_flags != IOMMU_NOTIFIER_NONE);
+ assert(n->notifier_flags != IOMMU_MR_EVENT_NONE);
assert(n->start <= n->end);
QLIST_INSERT_HEAD(&iommu_mr->iommu_notify, n, node);
memory_region_update_iommu_notify_flags(iommu_mr);
@@ -1831,7 +1831,8 @@ uint64_t memory_region_iommu_get_min_page_size(IOMMUMemoryRegion *iommu_mr)
return TARGET_PAGE_SIZE;
}

-void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr,
+ IOMMUMRNotifier *n)
{
MemoryRegion *mr = MEMORY_REGION(iommu_mr);
IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
@@ -1862,15 +1863,15 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)

void memory_region_iommu_replay_all(IOMMUMemoryRegion *iommu_mr)
{
- IOMMUNotifier *notifier;
+ IOMMUMRNotifier *notifier;

- IOMMU_NOTIFIER_FOREACH(notifier, iommu_mr) {
+ IOMMU_MR_NOTIFIER_FOREACH(notifier, iommu_mr) {
memory_region_iommu_replay(iommu_mr, notifier);
}
}

void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
- IOMMUNotifier *n)
+ IOMMUMRNotifier *n)
{
IOMMUMemoryRegion *iommu_mr;

@@ -1883,10 +1884,10 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
memory_region_update_iommu_notify_flags(iommu_mr);
}

-void memory_region_notify_one(IOMMUNotifier *notifier,
+void memory_region_notify_one(IOMMUMRNotifier *notifier,
IOMMUTLBEntry *entry)
{
- IOMMUNotifierFlag request_flags;
+ IOMMUMREventFlag request_flags;

/*
* Skip the notification if the notification does not overlap
@@ -1898,9 +1899,9 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
}

if (entry->perm & IOMMU_RW) {
- request_flags = IOMMU_NOTIFIER_MAP;
+ request_flags = IOMMU_MR_EVENT_MAP;
} else {
- request_flags = IOMMU_NOTIFIER_UNMAP;
+ request_flags = IOMMU_MR_EVENT_UNMAP;
}

if (notifier->notifier_flags & request_flags) {
@@ -1911,12 +1912,12 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
void memory_region_notify_iommu(IOMMUMemoryRegion *iommu_mr,
IOMMUTLBEntry entry)
{
- IOMMUNotifier *iommu_notifier;
+ IOMMUMRNotifier *iommu_mr_notifier;

assert(memory_region_is_iommu(MEMORY_REGION(iommu_mr)));

- IOMMU_NOTIFIER_FOREACH(iommu_notifier, iommu_mr) {
- memory_region_notify_one(iommu_notifier, &entry);
+ IOMMU_MR_NOTIFIER_FOREACH(iommu_mr_notifier, iommu_mr) {
+ memory_region_notify_one(iommu_mr_notifier, &entry);
}
}
--
1.9.1
Liu, Yi L
2017-11-03 12:01:53 UTC
Permalink
From: Peter Xu <***@redhat.com>

Provide AddressSpaceOps.iommu_get() in Intel IOMMU emulator.

Signed-off-by: Peter Xu <***@redhat.com>
Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/i386/intel_iommu.c | 7 +++++++
include/hw/i386/intel_iommu.h | 2 ++
2 files changed, 9 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e81c706..54343e5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2687,6 +2687,12 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
},
};

+static IOMMUObject *vtd_as_iommu_get(AddressSpace *as)
+{
+ VTDAddressSpace *vtd_dev_as = container_of(as, VTDAddressSpace, as);
+ return &vtd_dev_as->iommu_object;
+}
+
VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
{
uintptr_t key = (uintptr_t)bus;
@@ -2748,6 +2754,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
VTD_INTERRUPT_ADDR_FIRST,
&vtd_dev_as->iommu_ir, 64);
address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name);
+ vtd_dev_as->as.as_ops.iommu_get = vtd_as_iommu_get;
memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
&vtd_dev_as->sys_alias, 1);
memory_region_add_subregion_overlap(&vtd_dev_as->root, 0,
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index c85f9ff..a3c6d45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -27,6 +27,7 @@
#include "hw/i386/ioapic.h"
#include "hw/pci/msi.h"
#include "hw/sysbus.h"
+#include "hw/core/iommu.h"

#define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
#define INTEL_IOMMU_DEVICE(obj) \
@@ -90,6 +91,7 @@ struct VTDAddressSpace {
MemoryRegion sys_alias;
MemoryRegion iommu_ir; /* Interrupt region: 0xfeeXXXXX */
IntelIOMMUState *iommu_state;
+ IOMMUObject iommu_object;
VTDContextCacheEntry context_cache_entry;
};
--
1.9.1
Liu, Yi L
2017-11-03 12:01:52 UTC
Permalink
From: Peter Xu <***@redhat.com>

AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.

The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.

For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.

This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.

Signed-off-by: Peter Xu <***@redhat.com>
Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 +++++++++++++++++++++++++++++++++++++++
include/exec/memory.h | 22 +++++++++++++++
include/hw/core/iommu.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
memory.c | 10 +++++--
5 files changed, 162 insertions(+), 2 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h

diff --git a/hw/core/Makefile.objs b/hw/core/Makefile.objs
index f8d7a4a..d688412 100644
--- a/hw/core/Makefile.objs
+++ b/hw/core/Makefile.objs
@@ -5,6 +5,7 @@ common-obj-y += fw-path-provider.o
# irq.o needed for qdev GPIO handling:
common-obj-y += irq.o
common-obj-y += hotplug.o
+common-obj-y += iommu.o
common-obj-y += nmi.o

common-obj-$(CONFIG_EMPTY_SLOT) += empty_slot.o
diff --git a/hw/core/iommu.c b/hw/core/iommu.c
new file mode 100644
index 0000000..7c4fcfe
--- /dev/null
+++ b/hw/core/iommu.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ * Authors: Peter Xu <***@redhat.com>,
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/core/iommu.h"
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event)
+{
+ n->event = event;
+ n->iommu_notify = fn;
+ QLIST_INSERT_HEAD(&iommu->iommu_notifiers, n, node);
+ return;
+}
+
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier)
+{
+ IOMMUNotifier *cur, *next;
+
+ QLIST_FOREACH_SAFE(cur, &iommu->iommu_notifiers, node, next) {
+ if (cur == notifier) {
+ QLIST_REMOVE(cur, node);
+ break;
+ }
+ }
+}
+
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data)
+{
+ IOMMUNotifier *cur;
+
+ QLIST_FOREACH(cur, &iommu->iommu_notifiers, node) {
+ if ((cur->event == event_data->event) && cur->iommu_notify) {
+ cur->iommu_notify(cur, event_data);
+ }
+ }
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 03595e3..8350973 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -26,6 +26,7 @@
#include "qom/object.h"
#include "qemu/rcu.h"
#include "hw/qdev-core.h"
+#include "hw/core/iommu.h"

#define RAM_ADDR_INVALID (~(ram_addr_t)0)

@@ -301,6 +302,19 @@ struct MemoryListener {
};

/**
+ * AddressSpaceOps: callbacks structure for address space specific operations
+ *
+ * @iommu_get: returns an IOMMU object that backs the address space.
+ * Normally this should be NULL for generic address
+ * spaces, and it's only used when there is one
+ * translation unit behind this address space.
+ */
+struct AddressSpaceOps {
+ IOMMUObject *(*iommu_get)(AddressSpace *as);
+};
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/**
* AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
@@ -316,6 +330,7 @@ struct AddressSpace {
struct MemoryRegionIoeventfd *ioeventfds;
QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+ AddressSpaceOps as_ops;
};

FlatView *address_space_to_flatview(AddressSpace *as);
@@ -1988,6 +2003,13 @@ address_space_write_cached(MemoryRegionCache *cache, hwaddr addr,
address_space_write(cache->as, cache->xlat + addr, MEMTXATTRS_UNSPECIFIED, buf, len);
}

+/**
+ * address_space_iommu_get: Get the backend IOMMU for the address space
+ *
+ * @as: the address space to fetch IOMMU from
+ */
+IOMMUObject *address_space_iommu_get(AddressSpace *as);
+
#endif

#endif
diff --git a/include/hw/core/iommu.h b/include/hw/core/iommu.h
new file mode 100644
index 0000000..34387c0
--- /dev/null
+++ b/include/hw/core/iommu.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ * Authors: Peter Xu <***@redhat.com>,
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_CORE_IOMMU_H
+#define HW_CORE_IOMMU_H
+
+#include "qemu/queue.h"
+
+enum IOMMUEvent {
+ IOMMU_EVENT_BIND_PASIDT,
+};
+typedef enum IOMMUEvent IOMMUEvent;
+
+struct IOMMUEventData {
+ IOMMUEvent event;
+ uint64_t length;
+ void *data;
+};
+typedef struct IOMMUEventData IOMMUEventData;
+
+typedef struct IOMMUNotifier IOMMUNotifier;
+
+typedef void (*IOMMUNotifyFn)(IOMMUNotifier *notifier,
+ IOMMUEventData *event_data);
+
+struct IOMMUNotifier {
+ IOMMUNotifyFn iommu_notify;
+ /*
+ * What events we are listening to. Let's allow multiple event
+ * registrations from beginning.
+ */
+ IOMMUEvent event;
+ QLIST_ENTRY(IOMMUNotifier) node;
+};
+
+typedef struct IOMMUObject IOMMUObject;
+
+/*
+ * This stands for an IOMMU unit. Any translation device should have
+ * this struct inside its own structure to make sure it can leverage
+ * common IOMMU functionalities.
+ */
+struct IOMMUObject {
+ QLIST_HEAD(, IOMMUNotifier) iommu_notifiers;
+};
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event);
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier);
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data);
+
+#endif
diff --git a/memory.c b/memory.c
index 77fb3ef..307f665 100644
--- a/memory.c
+++ b/memory.c
@@ -235,8 +235,6 @@ struct FlatView {
MemoryRegion *root;
};

-typedef struct AddressSpaceOps AddressSpaceOps;
-
#define FOR_EACH_FLAT_RANGE(var, view) \
for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)

@@ -2793,6 +2791,14 @@ static void do_address_space_destroy(AddressSpace *as)
memory_region_unref(as->root);
}

+IOMMUObject *address_space_iommu_get(AddressSpace *as)
+{
+ if (!as->as_ops.iommu_get) {
+ return NULL;
+ }
+ return as->as_ops.iommu_get(as);
+}
+
void address_space_destroy(AddressSpace *as)
{
MemoryRegion *root = as->root;
--
1.9.1
Liu, Yi L
2017-11-14 14:20:26 UTC
Permalink
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
David had an objection in the past about this method, saying that
several IOMMUs could translate a single AS?
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01610.html
In
https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/pci/pci-iommu.txt,
it is said
"a given PCI device can only master through one IOMMU"
Post by Liu, Yi L
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
standing
Post by Liu, Yi L
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
relying
Post by Liu, Yi L
on MemoryRegion.
I think I would split this patch into a 1 first patch introducing the
iommu object and a second adding the AS ops and address_space_iommu_get()
Good point.
Post by Liu, Yi L
---
hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 +++++++++++++++++++++++++++++++++++++++
include/exec/memory.h | 22 +++++++++++++++
include/hw/core/iommu.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
memory.c | 10 +++++--
5 files changed, 162 insertions(+), 2 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h
diff --git a/hw/core/Makefile.objs b/hw/core/Makefile.objs
index f8d7a4a..d688412 100644
--- a/hw/core/Makefile.objs
+++ b/hw/core/Makefile.objs
@@ -5,6 +5,7 @@ common-obj-y += fw-path-provider.o
common-obj-y += irq.o
common-obj-y += hotplug.o
+common-obj-y += iommu.o
common-obj-y += nmi.o
common-obj-$(CONFIG_EMPTY_SLOT) += empty_slot.o
diff --git a/hw/core/iommu.c b/hw/core/iommu.c
new file mode 100644
index 0000000..7c4fcfe
--- /dev/null
+++ b/hw/core/iommu.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU emulation of IOMMU logic
May be rephrased as it does not really explain what the iommu object
exposes as an API
yes, may need to rephrase to address it clear.
Post by Liu, Yi L
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/core/iommu.h"
+ IOMMUNotifier *n)
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event)
+{
+ n->event = event;
+ n->iommu_notify = fn;
+ QLIST_INSERT_HEAD(&iommu->iommu_notifiers, n, node);
+ return;
+}
+
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier)
+{
+ IOMMUNotifier *cur, *next;
+
+ QLIST_FOREACH_SAFE(cur, &iommu->iommu_notifiers, node, next) {
+ if (cur == notifier) {
+ QLIST_REMOVE(cur, node);
+ break;
+ }
+ }
+}
+
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data)
+{
+ IOMMUNotifier *cur;
+
+ QLIST_FOREACH(cur, &iommu->iommu_notifiers, node) {
+ if ((cur->event == event_data->event) && cur->iommu_notify) {
can cur->iommu_notify be NULL if registered as above?
the cur->event is actually initialized with a event and also a iommu_notify.
So if the cur->event meets the requested event type, then it should be
non-NULL.
Post by Liu, Yi L
+ cur->iommu_notify(cur, event_data);
+ }
+ }
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 03595e3..8350973 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -26,6 +26,7 @@
#include "qom/object.h"
#include "qemu/rcu.h"
#include "hw/qdev-core.h"
+#include "hw/core/iommu.h"
#define RAM_ADDR_INVALID (~(ram_addr_t)0)
@@ -301,6 +302,19 @@ struct MemoryListener {
};
/**
+ * AddressSpaceOps: callbacks structure for address space specific operations
+ *
+ * Normally this should be NULL for generic address
+ * spaces, and it's only used when there is one
+ * translation unit behind this address space.
+ */
+struct AddressSpaceOps {
+ IOMMUObject *(*iommu_get)(AddressSpace *as);
+};
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/**
* AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
@@ -316,6 +330,7 @@ struct AddressSpace {
struct MemoryRegionIoeventfd *ioeventfds;
QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+ AddressSpaceOps as_ops;
};
FlatView *address_space_to_flatview(AddressSpace *as);
@@ -1988,6 +2003,13 @@ address_space_write_cached(MemoryRegionCache *cache, hwaddr addr,
address_space_write(cache->as, cache->xlat + addr, MEMTXATTRS_UNSPECIFIED, buf, len);
}
+/**
+ * address_space_iommu_get: Get the backend IOMMU for the address space
+ *
+ */
+IOMMUObject *address_space_iommu_get(AddressSpace *as);
+
#endif
#endif
diff --git a/include/hw/core/iommu.h b/include/hw/core/iommu.h
new file mode 100644
index 0000000..34387c0
--- /dev/null
+++ b/include/hw/core/iommu.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_CORE_IOMMU_H
+#define HW_CORE_IOMMU_H
+
+#include "qemu/queue.h"
+
+enum IOMMUEvent {
+ IOMMU_EVENT_BIND_PASIDT,
+};
+typedef enum IOMMUEvent IOMMUEvent;
+
+struct IOMMUEventData {
+ IOMMUEvent event;
/* length and opaque data passed to notifiers */ ?
yes, it is. But the "void *data" would be replaced with well defined structure.
No plan to do it as opaque. Once this patchset is accepted, later patchset
would define it as;

struct IOMMUEventData {
IOMMUEvent event;
uint64_t length;
union {
struct pasid_table_config pasidt_info;
struct tlb_invalidate_info tlb_info;
};
};
typedef struct IOMMUEventData IOMMUEventData;

This is in my further patchset. Currently, we want to show the idea of
introducing this new notifier framework.
Post by Liu, Yi L
+ uint64_t length;
+ void *data;
+};
+typedef struct IOMMUEventData IOMMUEventData;
+
+typedef struct IOMMUNotifier IOMMUNotifier;
+
+typedef void (*IOMMUNotifyFn)(IOMMUNotifier *notifier,
+ IOMMUEventData *event_data);
+
+struct IOMMUNotifier {
+ IOMMUNotifyFn iommu_notify;
+ /*
+ * What events we are listening to. Let's allow multiple event
+ * registrations from beginning.
+ */
+ IOMMUEvent event;
/* the event the notifier is sensitive to ? */
events(aka. operations) like bind pasid table/flush 1st level tlb. etc.
Post by Liu, Yi L
+ QLIST_ENTRY(IOMMUNotifier) node;
+};
+
+typedef struct IOMMUObject IOMMUObject;
+
+/*
+ * This stands for an IOMMU unit. Any translation device should have
+ * this struct inside its own structure to make sure it can leverage
+ * common IOMMU functionalities.
+ */
+struct IOMMUObject {
+ QLIST_HEAD(, IOMMUNotifier) iommu_notifiers;
where is the QLIST_INIT supposed to be done?
yeah, need to add it accordingly.

Thanks,
Yi L
Thanks
Eric
Post by Liu, Yi L
+};
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event);
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier);
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data);
+
+#endif
diff --git a/memory.c b/memory.c
index 77fb3ef..307f665 100644
--- a/memory.c
+++ b/memory.c
@@ -235,8 +235,6 @@ struct FlatView {
MemoryRegion *root;
};
-typedef struct AddressSpaceOps AddressSpaceOps;
-
#define FOR_EACH_FLAT_RANGE(var, view) \
for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
@@ -2793,6 +2791,14 @@ static void do_address_space_destroy(AddressSpace *as)
memory_region_unref(as->root);
}
+IOMMUObject *address_space_iommu_get(AddressSpace *as)
+{
+ if (!as->as_ops.iommu_get) {
+ return NULL;
+ }
+ return as->as_ops.iommu_get(as);
+}
+
void address_space_destroy(AddressSpace *as)
{
MemoryRegion *root = as->root;
Liu, Yi L
2017-11-14 13:59:04 UTC
Permalink
On Tue, Nov 14, 2017 at 09:53:07AM +0100, Auger Eric wrote:
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
Hi David,
Fully understood. I'll try my best to address your question. Also,
feel free to input further questions, anyhow, the more we discuss the
better work we done.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
IOMMUObject is aimed to represent the iommu itself. e.g. the iommu
specific operations. One of the key purpose of IOMMUObject is to
introduce a notifier framework to let iommu emulator to be able to
do iommu operations other than MAP/UNMAP. As IOMMU grows more and
more feature, MAP/UNMAP is not the only operation iommu emulator needs
to deal. e.g. shared virtual memory. So far, as I know AMD/ARM also
has it. may correct me on it. As my cover letter mentioned, MR based
notifier framework doesn’t work for the newly added IOMMU operations.
Like bind guest pasid table pointer to host and propagate guest's
iotlb flush to host.
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Let me take virt-SVM as an example. As far as I know, for virt-SVM,
the implementation of different vendors are similar. The key design
is to have a nested translation(aka. two stage translation). It is to
have guest maintain gVA->gPA mapping and hypervisor builds gPA->hPA
mapping. Similar to EPT based virt-MMU solution.
In Qemu, gPA->hPA mapping is done through MAP/UNMAP notifier, it can
keep going. But for gVA->gPA mapping, only guest knows it, so hypervisor
needs to trap specific guest iommu operation and pass the gVA->gPA
mapping knowledge to host through a notifier(newly added one). In VT-d,
it is called bind guest pasid table to host.
What I don't get is the PASID table is per extended context entry. I
understand this latter is indexed by PCI device function. And today MR
are created per PCIe device if I am not wrong.
In my understanding, MR is more related to AddressSpace not exactly tagged
with PCIe device.
So why can't we have 1
new MR notifier dedicated to PASID table passing? My understanding is
the MR, having a 1-1 correspondence with a PCIe device and thus a
context could be of right granularity. Then I understand the only flags
I didn't quite get your point regards to the "granlarity" here. May talk
a little bit more here?
we currently have are NONE, MAP and UNMAP but couldn't we add a new one
for PASID TABLE passing? So this is not crystal clear to me why MR
notifiers are not adapted to PASID table passing.
This is also my initial plan. You may get some detail in the link below.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html

In brief, the new notifier I want to add is not really MR related and
just more like a behaviour of a translation unit. Also, as my cover letter
mentioned that the MR notifiers would not be registered for VT-d(virt-svm)
in region_add(). Then it requires to register MR notifiers out of the
region_add(). Also paste below. I think it more or less breaks the
consistency of MR notifiers. Also, I think existing MR notifiers are more
related IOVA address translation(on VT-d it is 2nd level translation, for ARM
is it stage 2?), and there is some existing codes highly rely on this
assumption. e.g. the memory_replay introduced by Peter, the notifier node
would affect the replay. If adding a non MAP/UNMAP notifier, it would break
the logic. So it's also a reason to add a separate framework instead of
just adding a new flag to the existing MR notifier framework.

https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html

+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Also, for the gVA iotlb flushing, only guest knows it. So hypervisor
needs to propagate it to host. Here, MAP/UNMAP is not suitable since
this gVA iotlb flush here doesn’t require to modify host iommu
translation table.
I don't really get this argument. IOMMUNotifier just is a notifier that
is attached to an IOMMU MR and calls a an IOMMUNotify function, right?
yes, it is.
Then the role of the function currently is attached to the currently
existing flags, MAP, UNMAP. This is not linked to an action on the
physical IOMMU, right?
The MAP/UNMAP notifier finally talks to physical IOMMU. is it? My point
here is MAP/UNMAP finally would talk to physical IOMMU change the translation
page table in memory. However, for virt-svm case, the translation page
table is the process vaddr page table(though the I/O page table is also used
we don't need to talk it since hypervisor owns it). process vaddr page table
is owned by guest, changes to the translation page table is by guest. So for
such cache, just need to flush the cache in iommu side. no need to modify
translation page table.
At the time gVA iotlb flush is issued, the gVA->gPA
mapping has already modified. Host iommu only needs to reference it when
performing address translation. But before host iommu perform translation,
it needs to flush the old gVA cache. In VT-d, it is called 1st level cache
flushing.
The fact MR notifiers may not be relevant could be linked to the nature
of the tagging of the structures you want to flush. My understanding is
if you want to flush by source-id, MR granularity could be fine. Please
could you clarify why do you need an iommu wide operation in that case.
The flush is not limited to source-id granularity, it would be page selected
and others. As I mentioned, it has no requirement to modify the translation
page table, so it is more like a translation unit behavior.
Both of the two notifiers(operations) has no direct relationship with MR,
instead they highly depends on virt-iommu itself.
As described above this is not obvious to me. Please could you clarify
why source-id granularity (which I understand has a 1-1 granularity with
MR/AS is not the correct granularity). Of course, please correct me if
my understanding of MR mapping is not correct.
It's correct that for the PCIe device, the iova address space(aka PCI address
space) has kind of 1-1 relationship with MR. But, for virt-SVM, the address
space is not limted to iova address space, it has process address space, how
can such an address space relate to a MR...

Thanks,
Yi L
Thanks
Eric
If virt-iommu exists,
then the two notfiers are needed, if not, then it's not.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
This will benefit the notifier registration. As my comments above, the
IOMMUObject is to represent iommu. Associate an AddressSpace with an
IOMMUObject makes it easy to check if it is necessary to register the
notifiers.
If no IOMMUObject, means no virt-iommu exposed to guest, then
no need to register notifiers. For this, I also considered to use the
MemoryRegion.iommu_ops. e.g. for VT-d, it can be a loop to check all the
subregions and register notfiers if it is an iommu MemoryRegion. Peter
mentioned it may not work for SPAR. So he proposed associating an
AddressSpace with an IOMMUObject. I think it wroks and easier, so I
didn’t object it.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html
+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Thanks,
Yi L
Post by Liu, Yi L
---
hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 +++++++++++++++++++++++++++++++++++++++
include/exec/memory.h | 22 +++++++++++++++
include/hw/core/iommu.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
memory.c | 10 +++++--
5 files changed, 162 insertions(+), 2 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h
diff --git a/hw/core/Makefile.objs b/hw/core/Makefile.objs
index f8d7a4a..d688412 100644
--- a/hw/core/Makefile.objs
+++ b/hw/core/Makefile.objs
@@ -5,6 +5,7 @@ common-obj-y += fw-path-provider.o
common-obj-y += irq.o
common-obj-y += hotplug.o
+common-obj-y += iommu.o
common-obj-y += nmi.o
common-obj-$(CONFIG_EMPTY_SLOT) += empty_slot.o
diff --git a/hw/core/iommu.c b/hw/core/iommu.c
new file mode 100644
index 0000000..7c4fcfe
--- /dev/null
+++ b/hw/core/iommu.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/core/iommu.h"
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event)
+{
+ n->event = event;
+ n->iommu_notify = fn;
+ QLIST_INSERT_HEAD(&iommu->iommu_notifiers, n, node);
+ return;
+}
+
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier)
+{
+ IOMMUNotifier *cur, *next;
+
+ QLIST_FOREACH_SAFE(cur, &iommu->iommu_notifiers, node, next) {
+ if (cur == notifier) {
+ QLIST_REMOVE(cur, node);
+ break;
+ }
+ }
+}
+
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data)
+{
+ IOMMUNotifier *cur;
+
+ QLIST_FOREACH(cur, &iommu->iommu_notifiers, node) {
+ if ((cur->event == event_data->event) && cur->iommu_notify) {
+ cur->iommu_notify(cur, event_data);
+ }
+ }
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 03595e3..8350973 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -26,6 +26,7 @@
#include "qom/object.h"
#include "qemu/rcu.h"
#include "hw/qdev-core.h"
+#include "hw/core/iommu.h"
#define RAM_ADDR_INVALID (~(ram_addr_t)0)
@@ -301,6 +302,19 @@ struct MemoryListener {
};
/**
+ * AddressSpaceOps: callbacks structure for address space specific operations
+ *
+ * Normally this should be NULL for generic address
+ * spaces, and it's only used when there is one
+ * translation unit behind this address space.
+ */
+struct AddressSpaceOps {
+ IOMMUObject *(*iommu_get)(AddressSpace *as);
+};
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/**
* AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
@@ -316,6 +330,7 @@ struct AddressSpace {
struct MemoryRegionIoeventfd *ioeventfds;
QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+ AddressSpaceOps as_ops;
};
FlatView *address_space_to_flatview(AddressSpace *as);
@@ -1988,6 +2003,13 @@ address_space_write_cached(MemoryRegionCache *cache, hwaddr addr,
address_space_write(cache->as, cache->xlat + addr, MEMTXATTRS_UNSPECIFIED, buf, len);
}
+/**
+ * address_space_iommu_get: Get the backend IOMMU for the address space
+ *
+ */
+IOMMUObject *address_space_iommu_get(AddressSpace *as);
+
#endif
#endif
diff --git a/include/hw/core/iommu.h b/include/hw/core/iommu.h
new file mode 100644
index 0000000..34387c0
--- /dev/null
+++ b/include/hw/core/iommu.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_CORE_IOMMU_H
+#define HW_CORE_IOMMU_H
+
+#include "qemu/queue.h"
+
+enum IOMMUEvent {
+ IOMMU_EVENT_BIND_PASIDT,
+};
+typedef enum IOMMUEvent IOMMUEvent;
+
+struct IOMMUEventData {
+ IOMMUEvent event;
+ uint64_t length;
+ void *data;
+};
+typedef struct IOMMUEventData IOMMUEventData;
+
+typedef struct IOMMUNotifier IOMMUNotifier;
+
+typedef void (*IOMMUNotifyFn)(IOMMUNotifier *notifier,
+ IOMMUEventData *event_data);
+
+struct IOMMUNotifier {
+ IOMMUNotifyFn iommu_notify;
+ /*
+ * What events we are listening to. Let's allow multiple event
+ * registrations from beginning.
+ */
+ IOMMUEvent event;
+ QLIST_ENTRY(IOMMUNotifier) node;
+};
+
+typedef struct IOMMUObject IOMMUObject;
+
+/*
+ * This stands for an IOMMU unit. Any translation device should have
+ * this struct inside its own structure to make sure it can leverage
+ * common IOMMU functionalities.
+ */
+struct IOMMUObject {
+ QLIST_HEAD(, IOMMUNotifier) iommu_notifiers;
+};
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event);
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier);
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data);
+
+#endif
diff --git a/memory.c b/memory.c
index 77fb3ef..307f665 100644
--- a/memory.c
+++ b/memory.c
@@ -235,8 +235,6 @@ struct FlatView {
MemoryRegion *root;
};
-typedef struct AddressSpaceOps AddressSpaceOps;
-
#define FOR_EACH_FLAT_RANGE(var, view) \
for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
@@ -2793,6 +2791,14 @@ static void do_address_space_destroy(AddressSpace *as)
memory_region_unref(as->root);
}
+IOMMUObject *address_space_iommu_get(AddressSpace *as)
+{
+ if (!as->as_ops.iommu_get) {
+ return NULL;
+ }
+ return as->as_ops.iommu_get(as);
+}
+
void address_space_destroy(AddressSpace *as)
{
MemoryRegion *root = as->root;
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Auger Eric
2017-11-14 21:52:54 UTC
Permalink
Hi Yi L,
Post by Liu, Yi L
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
Hi David,
Fully understood. I'll try my best to address your question. Also,
feel free to input further questions, anyhow, the more we discuss the
better work we done.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
IOMMUObject is aimed to represent the iommu itself. e.g. the iommu
specific operations. One of the key purpose of IOMMUObject is to
introduce a notifier framework to let iommu emulator to be able to
do iommu operations other than MAP/UNMAP. As IOMMU grows more and
more feature, MAP/UNMAP is not the only operation iommu emulator needs
to deal. e.g. shared virtual memory. So far, as I know AMD/ARM also
has it. may correct me on it. As my cover letter mentioned, MR based
notifier framework doesn’t work for the newly added IOMMU operations.
Like bind guest pasid table pointer to host and propagate guest's
iotlb flush to host.
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Let me take virt-SVM as an example. As far as I know, for virt-SVM,
the implementation of different vendors are similar. The key design
is to have a nested translation(aka. two stage translation). It is to
have guest maintain gVA->gPA mapping and hypervisor builds gPA->hPA
mapping. Similar to EPT based virt-MMU solution.
In Qemu, gPA->hPA mapping is done through MAP/UNMAP notifier, it can
keep going. But for gVA->gPA mapping, only guest knows it, so hypervisor
needs to trap specific guest iommu operation and pass the gVA->gPA
mapping knowledge to host through a notifier(newly added one). In VT-d,
it is called bind guest pasid table to host.
What I don't get is the PASID table is per extended context entry. I
understand this latter is indexed by PCI device function. And today MR
are created per PCIe device if I am not wrong.
In my understanding, MR is more related to AddressSpace not exactly tagged
with PCIe device.
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
Post by Liu, Yi L
So why can't we have 1
new MR notifier dedicated to PASID table passing? My understanding is
the MR, having a 1-1 correspondence with a PCIe device and thus a
context could be of right granularity. Then I understand the only flags
I didn't quite get your point regards to the "granlarity" here. May talk
a little bit more here?
The PASID table is per device (contained by extended context which is
dev/fn indexed). The "record_device" notifier also is attached to a
specific PCIe device. So we can't really say they have an iommu wide
scope (PCIe device granularity would fit). However I understand from
below explanation that TLB invalidate notifier is not especially tight
to a given source-id as we are going to invalidate by PASID/page.

I think the main justification behind introducing this new framework is
that PT is set along with SVM and in this case the IOMMU MR notifiers
are not registered since the IOMMU MR is disabled.

So new notifiers do not fit nicely in current framework.
Post by Liu, Yi L
we currently have are NONE, MAP and UNMAP but couldn't we add a new one
for PASID TABLE passing? So this is not crystal clear to me why MR
notifiers are not adapted to PASID table passing.
This is also my initial plan. You may get some detail in the link below.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
In brief, the new notifier I want to add is not really MR related and
just more like a behaviour of a translation unit. Also, as my cover letter
mentioned that the MR notifiers would not be registered for VT-d(virt-svm)
in region_add(). Then it requires to register MR notifiers out of the
region_add().
Yes to me this is the main issue: IOMMU MR are not registered in PT
mode. Commit message of "[RFC PATCH 06/20] VFIO: add new notifier for
binding PASID table" helped me to introduce the problem, although I did
not quite understand what you call "vtd_address_space".

Also paste below. I think it more or less breaks the
Post by Liu, Yi L
consistency of MR notifiers. Also, I think existing MR notifiers are more
related IOVA address translation(on VT-d it is 2nd level translation, for ARM
is it stage 2?),
Well, level 1 input address is IOVA as well.

On ARM, IOVA -> IPA*=GPA is stage1 and GPA -> HPA is stage2. So I would
rather say that for ARM, MR relates to stage1 which is the one emulated
by vSMMU along with VFIO.
*IPA = intermediate physical address.

and there is some existing codes highly rely on this
Post by Liu, Yi L
assumption. e.g. the memory_replay introduced by Peter, the notifier node
would affect the replay. If adding a non MAP/UNMAP notifier, it would break
the logic. So it's also a reason to add a separate framework instead of
just adding a new flag to the existing MR notifier framework.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html
+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Also, for the gVA iotlb flushing, only guest knows it. So hypervisor
needs to propagate it to host. Here, MAP/UNMAP is not suitable since
this gVA iotlb flush here doesn’t require to modify host iommu
translation table.
I don't really get this argument. IOMMUNotifier just is a notifier that
is attached to an IOMMU MR and calls a an IOMMUNotify function, right?
yes, it is.
Then the role of the function currently is attached to the currently
existing flags, MAP, UNMAP. This is not linked to an action on the
physical IOMMU, right?
The MAP/UNMAP notifier finally talks to physical IOMMU. is it? My point
here is MAP/UNMAP finally would talk to physical IOMMU change the translation
page table in memory.
yes it programs the pIOMMU with combined translation stages.
However, for virt-svm case, the translation page
Post by Liu, Yi L
table is the process vaddr page table(though the I/O page table is also used
we don't need to talk it since hypervisor owns it). process vaddr page table
is owned by guest, changes to the translation page table is by guest. So for
such cache, just need to flush the cache in iommu side. no need to modify
translation page table.
agreed but the MR notifier is not necessarily supposed to touch table,
right?
Post by Liu, Yi L
At the time gVA iotlb flush is issued, the gVA->gPA
mapping has already modified. Host iommu only needs to reference it when
performing address translation. But before host iommu perform translation,
it needs to flush the old gVA cache. In VT-d, it is called 1st level cache
flushing.
The fact MR notifiers may not be relevant could be linked to the nature
of the tagging of the structures you want to flush. My understanding is
if you want to flush by source-id, MR granularity could be fine. Please
could you clarify why do you need an iommu wide operation in that case.
The flush is not limited to source-id granularity, it would be page selected
and others. As I mentioned, it has no requirement to modify the translation
page table, so it is more like a translation unit behavior.
we could also enumerate all MR notifiers - assuming they are registered!
- and call notifiers for each of them, right? Maybe this is not the
strongest argument.
Post by Liu, Yi L
Both of the two notifiers(operations) has no direct relationship with MR,
instead they highly depends on virt-iommu itself.
As described above this is not obvious to me. Please could you clarify
why source-id granularity (which I understand has a 1-1 granularity with
MR/AS is not the correct granularity). Of course, please correct me if
my understanding of MR mapping is not correct.
It's correct that for the PCIe device, the iova address space(aka PCI address
space) has kind of 1-1 relationship with MR. But, for virt-SVM, the address
space is not limted to iova address space, it has process address space, how
can such an address space relate to a MR...
So, this discussion, reading again the cover letter, and digging into
your original series helped me understand the need for those new
notifiers. What confused me originally was the "iommu" wide notifiers.
Maybe you should clarify your cover letter by explicitly saying that
- SVM works along with PT = 1
- if PT = 1 IOMMU MR are disabled so MR notifier are not registered
- new notifiers do not fit nicely in this framework as they need to be
registered even if PT = 1
- having those notifiers attached to the whole IOMMU object abstraction
is not an issue

Please correct me if I said anything wrong and please apologize if I
brought any further confusion.

Thanks

Eric
Post by Liu, Yi L
Thanks,
Yi L
Thanks
Eric
If virt-iommu exists,
then the two notfiers are needed, if not, then it's not.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
This will benefit the notifier registration. As my comments above, the
IOMMUObject is to represent iommu. Associate an AddressSpace with an
IOMMUObject makes it easy to check if it is necessary to register the
notifiers.
If no IOMMUObject, means no virt-iommu exposed to guest, then
no need to register notifiers. For this, I also considered to use the
MemoryRegion.iommu_ops. e.g. for VT-d, it can be a loop to check all the
subregions and register notfiers if it is an iommu MemoryRegion. Peter
mentioned it may not work for SPAR. So he proposed associating an
AddressSpace with an IOMMUObject. I think it wroks and easier, so I
didn’t object it.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html
+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Thanks,
Yi L
Post by Liu, Yi L
---
hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 +++++++++++++++++++++++++++++++++++++++
include/exec/memory.h | 22 +++++++++++++++
include/hw/core/iommu.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
memory.c | 10 +++++--
5 files changed, 162 insertions(+), 2 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h
diff --git a/hw/core/Makefile.objs b/hw/core/Makefile.objs
index f8d7a4a..d688412 100644
--- a/hw/core/Makefile.objs
+++ b/hw/core/Makefile.objs
@@ -5,6 +5,7 @@ common-obj-y += fw-path-provider.o
common-obj-y += irq.o
common-obj-y += hotplug.o
+common-obj-y += iommu.o
common-obj-y += nmi.o
common-obj-$(CONFIG_EMPTY_SLOT) += empty_slot.o
diff --git a/hw/core/iommu.c b/hw/core/iommu.c
new file mode 100644
index 0000000..7c4fcfe
--- /dev/null
+++ b/hw/core/iommu.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/core/iommu.h"
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event)
+{
+ n->event = event;
+ n->iommu_notify = fn;
+ QLIST_INSERT_HEAD(&iommu->iommu_notifiers, n, node);
+ return;
+}
+
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier)
+{
+ IOMMUNotifier *cur, *next;
+
+ QLIST_FOREACH_SAFE(cur, &iommu->iommu_notifiers, node, next) {
+ if (cur == notifier) {
+ QLIST_REMOVE(cur, node);
+ break;
+ }
+ }
+}
+
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data)
+{
+ IOMMUNotifier *cur;
+
+ QLIST_FOREACH(cur, &iommu->iommu_notifiers, node) {
+ if ((cur->event == event_data->event) && cur->iommu_notify) {
+ cur->iommu_notify(cur, event_data);
+ }
+ }
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 03595e3..8350973 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -26,6 +26,7 @@
#include "qom/object.h"
#include "qemu/rcu.h"
#include "hw/qdev-core.h"
+#include "hw/core/iommu.h"
#define RAM_ADDR_INVALID (~(ram_addr_t)0)
@@ -301,6 +302,19 @@ struct MemoryListener {
};
/**
+ * AddressSpaceOps: callbacks structure for address space specific operations
+ *
+ * Normally this should be NULL for generic address
+ * spaces, and it's only used when there is one
+ * translation unit behind this address space.
+ */
+struct AddressSpaceOps {
+ IOMMUObject *(*iommu_get)(AddressSpace *as);
+};
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/**
* AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
@@ -316,6 +330,7 @@ struct AddressSpace {
struct MemoryRegionIoeventfd *ioeventfds;
QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+ AddressSpaceOps as_ops;
};
FlatView *address_space_to_flatview(AddressSpace *as);
@@ -1988,6 +2003,13 @@ address_space_write_cached(MemoryRegionCache *cache, hwaddr addr,
address_space_write(cache->as, cache->xlat + addr, MEMTXATTRS_UNSPECIFIED, buf, len);
}
+/**
+ * address_space_iommu_get: Get the backend IOMMU for the address space
+ *
+ */
+IOMMUObject *address_space_iommu_get(AddressSpace *as);
+
#endif
#endif
diff --git a/include/hw/core/iommu.h b/include/hw/core/iommu.h
new file mode 100644
index 0000000..34387c0
--- /dev/null
+++ b/include/hw/core/iommu.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_CORE_IOMMU_H
+#define HW_CORE_IOMMU_H
+
+#include "qemu/queue.h"
+
+enum IOMMUEvent {
+ IOMMU_EVENT_BIND_PASIDT,
+};
+typedef enum IOMMUEvent IOMMUEvent;
+
+struct IOMMUEventData {
+ IOMMUEvent event;
+ uint64_t length;
+ void *data;
+};
+typedef struct IOMMUEventData IOMMUEventData;
+
+typedef struct IOMMUNotifier IOMMUNotifier;
+
+typedef void (*IOMMUNotifyFn)(IOMMUNotifier *notifier,
+ IOMMUEventData *event_data);
+
+struct IOMMUNotifier {
+ IOMMUNotifyFn iommu_notify;
+ /*
+ * What events we are listening to. Let's allow multiple event
+ * registrations from beginning.
+ */
+ IOMMUEvent event;
+ QLIST_ENTRY(IOMMUNotifier) node;
+};
+
+typedef struct IOMMUObject IOMMUObject;
+
+/*
+ * This stands for an IOMMU unit. Any translation device should have
+ * this struct inside its own structure to make sure it can leverage
+ * common IOMMU functionalities.
+ */
+struct IOMMUObject {
+ QLIST_HEAD(, IOMMUNotifier) iommu_notifiers;
+};
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event);
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier);
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data);
+
+#endif
diff --git a/memory.c b/memory.c
index 77fb3ef..307f665 100644
--- a/memory.c
+++ b/memory.c
@@ -235,8 +235,6 @@ struct FlatView {
MemoryRegion *root;
};
-typedef struct AddressSpaceOps AddressSpaceOps;
-
#define FOR_EACH_FLAT_RANGE(var, view) \
for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
@@ -2793,6 +2791,14 @@ static void do_address_space_destroy(AddressSpace *as)
memory_region_unref(as->root);
}
+IOMMUObject *address_space_iommu_get(AddressSpace *as)
+{
+ if (!as->as_ops.iommu_get) {
+ return NULL;
+ }
+ return as->as_ops.iommu_get(as);
+}
+
void address_space_destroy(AddressSpace *as)
{
MemoryRegion *root = as->root;
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-11-15 02:36:19 UTC
Permalink
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
Hi David,
Fully understood. I'll try my best to address your question. Also,
feel free to input further questions, anyhow, the more we discuss the
better work we done.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
IOMMUObject is aimed to represent the iommu itself. e.g. the iommu
specific operations. One of the key purpose of IOMMUObject is to
introduce a notifier framework to let iommu emulator to be able to
do iommu operations other than MAP/UNMAP. As IOMMU grows more and
more feature, MAP/UNMAP is not the only operation iommu emulator needs
to deal. e.g. shared virtual memory. So far, as I know AMD/ARM also
has it. may correct me on it. As my cover letter mentioned, MR based
notifier framework doesn’t work for the newly added IOMMU operations.
Like bind guest pasid table pointer to host and propagate guest's
iotlb flush to host.
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Let me take virt-SVM as an example. As far as I know, for virt-SVM,
the implementation of different vendors are similar. The key design
is to have a nested translation(aka. two stage translation). It is to
have guest maintain gVA->gPA mapping and hypervisor builds gPA->hPA
mapping. Similar to EPT based virt-MMU solution.
In Qemu, gPA->hPA mapping is done through MAP/UNMAP notifier, it can
keep going. But for gVA->gPA mapping, only guest knows it, so hypervisor
needs to trap specific guest iommu operation and pass the gVA->gPA
mapping knowledge to host through a notifier(newly added one). In VT-d,
it is called bind guest pasid table to host.
What I don't get is the PASID table is per extended context entry. I
understand this latter is indexed by PCI device function. And today MR
are created per PCIe device if I am not wrong.
In my understanding, MR is more related to AddressSpace not exactly tagged
with PCIe device.
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
yes, it is. This is the PCIe device address space, it's can be guest's physical
address space if no vIOMMU exposed. Or it can be the guset IOVA address space
if vIOMMU is epxosed. Both the two address space is treated as 2nd level
transaltion in VT-d, which is different from the 1st level translation which
is for process vaddr space.
Post by Liu, Yi L
So why can't we have 1
new MR notifier dedicated to PASID table passing? My understanding is
the MR, having a 1-1 correspondence with a PCIe device and thus a
context could be of right granularity. Then I understand the only flags
I didn't quite get your point regards to the "granlarity" here. May talk
a little bit more here?
The PASID table is per device (contained by extended context which is
dev/fn indexed). The "record_device" notifier also is attached to a
specific PCIe device. So we can't really say they have an iommu wide
scope (PCIe device granularity would fit). However I understand from
below explanation that TLB invalidate notifier is not especially tight
to a given source-id as we are going to invalidate by PASID/page.
correct, it has no tight relationship to the "granlarity" here for the
discussion to introduce this new notifer framework.
I think the main justification behind introducing this new framework is
that PT is set along with SVM and in this case the IOMMU MR notifiers
are not registered since the IOMMU MR is disabled.
So new notifiers do not fit nicely in current framework.
exactly, that's what I mean. Although we have tricky to fix it, but the trick
is still a trick. Better to have a new and dedicate framework.
Post by Liu, Yi L
we currently have are NONE, MAP and UNMAP but couldn't we add a new one
for PASID TABLE passing? So this is not crystal clear to me why MR
notifiers are not adapted to PASID table passing.
This is also my initial plan. You may get some detail in the link below.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05295.html
In brief, the new notifier I want to add is not really MR related and
just more like a behaviour of a translation unit. Also, as my cover letter
mentioned that the MR notifiers would not be registered for VT-d(virt-svm)
in region_add(). Then it requires to register MR notifiers out of the
region_add().
Yes to me this is the main issue: IOMMU MR are not registered in PT
mode. Commit message of "[RFC PATCH 06/20] VFIO: add new notifier for
binding PASID table" helped me to introduce the problem, although I did
not quite understand what you call "vtd_address_space".
"vtd_address_space" is a wrapper of the DMA AddressSpace of a device behind
a virtual VT-d.
Also paste below. I think it more or less breaks the
Post by Liu, Yi L
consistency of MR notifiers. Also, I think existing MR notifiers are more
related IOVA address translation(on VT-d it is 2nd level translation, for ARM
is it stage 2?),
Well, level 1 input address is IOVA as well.
On ARM, IOVA -> IPA*=GPA is stage1 and GPA -> HPA is stage2. So I would
rather say that for ARM, MR relates to stage1 which is the one emulated
by vSMMU along with VFIO.
*IPA = intermediate physical address.
thx for correction. BTW. if you have virt-SVM on ARM, will stage 1 cover the
gVA->gPA/IPA translation?
and there is some existing codes highly rely on this
Post by Liu, Yi L
assumption. e.g. the memory_replay introduced by Peter, the notifier node
would affect the replay. If adding a non MAP/UNMAP notifier, it would break
the logic. So it's also a reason to add a separate framework instead of
just adding a new flag to the existing MR notifier framework.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html
+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Also, for the gVA iotlb flushing, only guest knows it. So hypervisor
needs to propagate it to host. Here, MAP/UNMAP is not suitable since
this gVA iotlb flush here doesn’t require to modify host iommu
translation table.
I don't really get this argument. IOMMUNotifier just is a notifier that
is attached to an IOMMU MR and calls a an IOMMUNotify function, right?
yes, it is.
Then the role of the function currently is attached to the currently
existing flags, MAP, UNMAP. This is not linked to an action on the
physical IOMMU, right?
The MAP/UNMAP notifier finally talks to physical IOMMU. is it? My point
here is MAP/UNMAP finally would talk to physical IOMMU change the translation
page table in memory.
yes it programs the pIOMMU with combined translation stages.
However, for virt-svm case, the translation page
Post by Liu, Yi L
table is the process vaddr page table(though the I/O page table is also used
we don't need to talk it since hypervisor owns it). process vaddr page table
is owned by guest, changes to the translation page table is by guest. So for
such cache, just need to flush the cache in iommu side. no need to modify
translation page table.
agreed but the MR notifier is not necessarily supposed to touch table,
right?
Not sure for other platform, for VT-d, the existing MAP/UNMAP notifier would
not touch the page table in Qemu, but they would touch through VFIO APIs. And
I think it's necessary when it's registered. e.g. when virtual VT-d is exposed,
guest can use guset IOVA, requires MAP/UNMAP APIs to shadow the mapping guest
iommu driver builds.
Post by Liu, Yi L
At the time gVA iotlb flush is issued, the gVA->gPA
mapping has already modified. Host iommu only needs to reference it when
performing address translation. But before host iommu perform translation,
it needs to flush the old gVA cache. In VT-d, it is called 1st level cache
flushing.
The fact MR notifiers may not be relevant could be linked to the nature
of the tagging of the structures you want to flush. My understanding is
if you want to flush by source-id, MR granularity could be fine. Please
could you clarify why do you need an iommu wide operation in that case.
The flush is not limited to source-id granularity, it would be page selected
and others. As I mentioned, it has no requirement to modify the translation
page table, so it is more like a translation unit behavior.
we could also enumerate all MR notifiers - assuming they are registered!
- and call notifiers for each of them, right? Maybe this is not the
strongest argument.
yes, it's not the strongest argument for the implementaion. But for design,
it is a strong argument to address the necessarity across different platforms.
Post by Liu, Yi L
Both of the two notifiers(operations) has no direct relationship with MR,
instead they highly depends on virt-iommu itself.
As described above this is not obvious to me. Please could you clarify
why source-id granularity (which I understand has a 1-1 granularity with
MR/AS is not the correct granularity). Of course, please correct me if
my understanding of MR mapping is not correct.
It's correct that for the PCIe device, the iova address space(aka PCI address
space) has kind of 1-1 relationship with MR. But, for virt-SVM, the address
space is not limted to iova address space, it has process address space, how
can such an address space relate to a MR...
So, this discussion, reading again the cover letter, and digging into
your original series helped me understand the need for those new
notifiers. What confused me originally was the "iommu" wide notifiers.
Maybe you should clarify your cover letter by explicitly saying that
- SVM works along with PT = 1
- if PT = 1 IOMMU MR are disabled so MR notifier are not registered
- new notifiers do not fit nicely in this framework as they need to be
registered even if PT = 1
- having those notifiers attached to the whole IOMMU object abstraction
is not an issue
Please correct me if I said anything wrong and please apologize if I
brought any further confusion.
yes, you're right. should have written it in a better manner. Would update the
cover-letter accordingly in next version. Also, feel free to let me know if it
meets the requirements of the pv work you're doing.

Thanks,
Yi L
Thanks
Eric
Post by Liu, Yi L
Thanks,
Yi L
Thanks
Eric
If virt-iommu exists,
then the two notfiers are needed, if not, then it's not.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
This will benefit the notifier registration. As my comments above, the
IOMMUObject is to represent iommu. Associate an AddressSpace with an
IOMMUObject makes it easy to check if it is necessary to register the
notifiers.
If no IOMMUObject, means no virt-iommu exposed to guest, then
no need to register notifiers. For this, I also considered to use the
MemoryRegion.iommu_ops. e.g. for VT-d, it can be a loop to check all the
subregions and register notfiers if it is an iommu MemoryRegion. Peter
mentioned it may not work for SPAR. So he proposed associating an
AddressSpace with an IOMMUObject. I think it wroks and easier, so I
didn’t object it.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04931.html
+ /* Check if vIOMMU exists */
+ QTAILQ_FOREACH(subregion, &as->root->subregions, subregions_link) {
+ if (memory_region_is_iommu(subregion)) {
+ IOMMUNotifier n1;
+
+ /*
+ FIXME: current iommu notifier is actually designed for
+ IOMMUTLB MAP/UNMAP. However, vIOMMU emulator may need
+ notifiers other than MAP/UNMAP, so it'll be better to
+ split the non-IOMMUTLB notifier from the current IOMMUTLB
+ notifier framewrok.
+ */
+ iommu_notifier_init(&n1, vfio_iommu_bind_pasid_tbl_notify,
+ IOMMU_NOTIFIER_SVM_PASIDT_BIND,
+ 0,
+ 0);
+ vfio_register_notifier(group->container,
+ subregion,
+ 0,
+ &n1);
+ }
+ }
+
Thanks,
Yi L
Post by Liu, Yi L
---
hw/core/Makefile.objs | 1 +
hw/core/iommu.c | 58 +++++++++++++++++++++++++++++++++++++++
include/exec/memory.h | 22 +++++++++++++++
include/hw/core/iommu.h | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
memory.c | 10 +++++--
5 files changed, 162 insertions(+), 2 deletions(-)
create mode 100644 hw/core/iommu.c
create mode 100644 include/hw/core/iommu.h
diff --git a/hw/core/Makefile.objs b/hw/core/Makefile.objs
index f8d7a4a..d688412 100644
--- a/hw/core/Makefile.objs
+++ b/hw/core/Makefile.objs
@@ -5,6 +5,7 @@ common-obj-y += fw-path-provider.o
common-obj-y += irq.o
common-obj-y += hotplug.o
+common-obj-y += iommu.o
common-obj-y += nmi.o
common-obj-$(CONFIG_EMPTY_SLOT) += empty_slot.o
diff --git a/hw/core/iommu.c b/hw/core/iommu.c
new file mode 100644
index 0000000..7c4fcfe
--- /dev/null
+++ b/hw/core/iommu.c
@@ -0,0 +1,58 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/core/iommu.h"
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event)
+{
+ n->event = event;
+ n->iommu_notify = fn;
+ QLIST_INSERT_HEAD(&iommu->iommu_notifiers, n, node);
+ return;
+}
+
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier)
+{
+ IOMMUNotifier *cur, *next;
+
+ QLIST_FOREACH_SAFE(cur, &iommu->iommu_notifiers, node, next) {
+ if (cur == notifier) {
+ QLIST_REMOVE(cur, node);
+ break;
+ }
+ }
+}
+
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data)
+{
+ IOMMUNotifier *cur;
+
+ QLIST_FOREACH(cur, &iommu->iommu_notifiers, node) {
+ if ((cur->event == event_data->event) && cur->iommu_notify) {
+ cur->iommu_notify(cur, event_data);
+ }
+ }
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 03595e3..8350973 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -26,6 +26,7 @@
#include "qom/object.h"
#include "qemu/rcu.h"
#include "hw/qdev-core.h"
+#include "hw/core/iommu.h"
#define RAM_ADDR_INVALID (~(ram_addr_t)0)
@@ -301,6 +302,19 @@ struct MemoryListener {
};
/**
+ * AddressSpaceOps: callbacks structure for address space specific operations
+ *
+ * Normally this should be NULL for generic address
+ * spaces, and it's only used when there is one
+ * translation unit behind this address space.
+ */
+struct AddressSpaceOps {
+ IOMMUObject *(*iommu_get)(AddressSpace *as);
+};
+typedef struct AddressSpaceOps AddressSpaceOps;
+
+/**
* AddressSpace: describes a mapping of addresses to #MemoryRegion objects
*/
struct AddressSpace {
@@ -316,6 +330,7 @@ struct AddressSpace {
struct MemoryRegionIoeventfd *ioeventfds;
QTAILQ_HEAD(memory_listeners_as, MemoryListener) listeners;
QTAILQ_ENTRY(AddressSpace) address_spaces_link;
+ AddressSpaceOps as_ops;
};
FlatView *address_space_to_flatview(AddressSpace *as);
@@ -1988,6 +2003,13 @@ address_space_write_cached(MemoryRegionCache *cache, hwaddr addr,
address_space_write(cache->as, cache->xlat + addr, MEMTXATTRS_UNSPECIFIED, buf, len);
}
+/**
+ * address_space_iommu_get: Get the backend IOMMU for the address space
+ *
+ */
+IOMMUObject *address_space_iommu_get(AddressSpace *as);
+
#endif
#endif
diff --git a/include/hw/core/iommu.h b/include/hw/core/iommu.h
new file mode 100644
index 0000000..34387c0
--- /dev/null
+++ b/include/hw/core/iommu.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU emulation of IOMMU logic
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_CORE_IOMMU_H
+#define HW_CORE_IOMMU_H
+
+#include "qemu/queue.h"
+
+enum IOMMUEvent {
+ IOMMU_EVENT_BIND_PASIDT,
+};
+typedef enum IOMMUEvent IOMMUEvent;
+
+struct IOMMUEventData {
+ IOMMUEvent event;
+ uint64_t length;
+ void *data;
+};
+typedef struct IOMMUEventData IOMMUEventData;
+
+typedef struct IOMMUNotifier IOMMUNotifier;
+
+typedef void (*IOMMUNotifyFn)(IOMMUNotifier *notifier,
+ IOMMUEventData *event_data);
+
+struct IOMMUNotifier {
+ IOMMUNotifyFn iommu_notify;
+ /*
+ * What events we are listening to. Let's allow multiple event
+ * registrations from beginning.
+ */
+ IOMMUEvent event;
+ QLIST_ENTRY(IOMMUNotifier) node;
+};
+
+typedef struct IOMMUObject IOMMUObject;
+
+/*
+ * This stands for an IOMMU unit. Any translation device should have
+ * this struct inside its own structure to make sure it can leverage
+ * common IOMMU functionalities.
+ */
+struct IOMMUObject {
+ QLIST_HEAD(, IOMMUNotifier) iommu_notifiers;
+};
+
+void iommu_notifier_register(IOMMUObject *iommu,
+ IOMMUNotifier *n,
+ IOMMUNotifyFn fn,
+ IOMMUEvent event);
+void iommu_notifier_unregister(IOMMUObject *iommu,
+ IOMMUNotifier *notifier);
+void iommu_notify(IOMMUObject *iommu, IOMMUEventData *event_data);
+
+#endif
diff --git a/memory.c b/memory.c
index 77fb3ef..307f665 100644
--- a/memory.c
+++ b/memory.c
@@ -235,8 +235,6 @@ struct FlatView {
MemoryRegion *root;
};
-typedef struct AddressSpaceOps AddressSpaceOps;
-
#define FOR_EACH_FLAT_RANGE(var, view) \
for (var = (view)->ranges; var < (view)->ranges + (view)->nr; ++var)
@@ -2793,6 +2791,14 @@ static void do_address_space_destroy(AddressSpace *as)
memory_region_unref(as->root);
}
+IOMMUObject *address_space_iommu_get(AddressSpace *as)
+{
+ if (!as->as_ops.iommu_get) {
+ return NULL;
+ }
+ return as->as_ops.iommu_get(as);
+}
+
void address_space_destroy(AddressSpace *as)
{
MemoryRegion *root = as->root;
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Peter Xu
2017-11-15 07:16:32 UTC
Permalink
On Tue, Nov 14, 2017 at 10:52:54PM +0100, Auger Eric wrote:

[...]
Post by Auger Eric
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
I think this is the most tricky point - in QEMU IOMMU MR is not really
a 1:1 relationship to devices. For Intel, it's true; for Power, it's
not. On Power guests, one device's DMA address space can be splited
into different translation windows, while each window corresponds to
one IOMMU MR.

So IMHO the real 1:1 mapping is between the device and its DMA address
space, rather than MRs.

It's been a long time since when I drafted the patches. I think at
least that should be a more general notifier mechanism comparing to
current IOMMUNotifier thing, which was bound to IOTLB notifies only.
AFAICT if we want to trap first-level translation changes, current
notifier is not even close to that interface - just see the definition
of IOMMUTLBEntry, it is tailored only for MAP/UNMAP of translation
addresses, not anything else. And IMHO that's why it's tightly bound
to MemoryRegions, and that's the root problem. The dynamic IOMMU MR
switching problem is related to this issue as well.

I am not sure current "get IOMMU object from address space" solution
would be best, maybe it's "too bigger a scope", I think it depends on
whether in the future we'll have some requirement in such a bigger
scope (say, something we want to trap from vIOMMU and deliver it to
host IOMMU which may not even be device-related? I don't know). Now
another alternative I am thinking is, whether we can provide a
per-device notifier, then it can be bound to PCIDevice rather than
MemoryRegions, then it will be in device scope.

Thanks,
--
Peter Xu
David Gibson
2017-12-18 11:35:31 UTC
Permalink
Post by Peter Xu
[...]
Post by Auger Eric
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
I think this is the most tricky point - in QEMU IOMMU MR is not really
a 1:1 relationship to devices. For Intel, it's true; for Power, it's
not. On Power guests, one device's DMA address space can be splited
into different translation windows, while each window corresponds to
one IOMMU MR.
Right.
Post by Peter Xu
So IMHO the real 1:1 mapping is between the device and its DMA address
space, rather than MRs.
That's not true either. With both POWER and Intel, several devices
can share a DMA address space: on POWER if they are in the same PE, on
Intel if they are place in the same IOMMU domain.

On x86 and on POWER bare metal we generally try to make the minimum
granularity for each PE/domain be a single function. However, that
may not be possible in the case of PCIe to PCI bridges, or
multifunction devices where the functions aren't properly isolated
from each other (e.g. function 0 debug registers which can affect
other functions are quite common).

For POWER guests we only have one PE/domain per virtual host bridge.
That's just a matter of implementation simplicity - if you want fine
grained isolation you can just create more virtual host bridges.
Post by Peter Xu
It's been a long time since when I drafted the patches. I think at
least that should be a more general notifier mechanism comparing to
current IOMMUNotifier thing, which was bound to IOTLB notifies only.
AFAICT if we want to trap first-level translation changes, current
notifier is not even close to that interface - just see the definition
of IOMMUTLBEntry, it is tailored only for MAP/UNMAP of translation
addresses, not anything else. And IMHO that's why it's tightly bound
to MemoryRegions, and that's the root problem. The dynamic IOMMU MR
switching problem is related to this issue as well.
So, having read and thought a bunch more, I think I know where you
need to start hooking this in. The thing is the current qemu PCI DMA
structure assumes that each device belongs to just a single PCI
address space - that's what pci_device_iommu_address_space() returns.

For virt-SVM that's just not true. IIUC, a virt-SVM capable device
could simultaneously write to multiple process address spaces, since
the process IDs actually go over the bus.

So trying to hook notifiers at the AddressSpace OR MemoryRegion level
just doesn't make sense - if we've picked a single addresss space for
the device, we've already made a wrong step.

Instead what you need I think is something like:
pci_device_virtsvm_context(). virt-SVM capable devices would need to
call that *before* calling pci_device_iommu_address_space (). Well
rather the virt-SVM capable DMA helpers would need to call that.

That would return a new VirtSVMContext (or something) object, which
would roughly correspond to a single PASID table. That's where the
methods and notifiers for managing that would need to go.
Post by Peter Xu
I am not sure current "get IOMMU object from address space" solution
would be best, maybe it's "too bigger a scope", I think it depends on
whether in the future we'll have some requirement in such a bigger
scope (say, something we want to trap from vIOMMU and deliver it to
host IOMMU which may not even be device-related? I don't know). Now
another alternative I am thinking is, whether we can provide a
per-device notifier, then it can be bound to PCIDevice rather than
MemoryRegions, then it will be in device scope.
I think that sounds like a version of what I've suggested above.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-12-20 06:47:30 UTC
Permalink
Post by David Gibson
Post by Peter Xu
[...]
Post by Auger Eric
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
I think this is the most tricky point - in QEMU IOMMU MR is not really
a 1:1 relationship to devices. For Intel, it's true; for Power, it's
not. On Power guests, one device's DMA address space can be splited
into different translation windows, while each window corresponds to
one IOMMU MR.
Right.
Post by Peter Xu
So IMHO the real 1:1 mapping is between the device and its DMA address
space, rather than MRs.
That's not true either. With both POWER and Intel, several devices
can share a DMA address space: on POWER if they are in the same PE, on
Intel if they are place in the same IOMMU domain.
On x86 and on POWER bare metal we generally try to make the minimum
granularity for each PE/domain be a single function. However, that
may not be possible in the case of PCIe to PCI bridges, or
multifunction devices where the functions aren't properly isolated
from each other (e.g. function 0 debug registers which can affect
other functions are quite common).
For POWER guests we only have one PE/domain per virtual host bridge.
That's just a matter of implementation simplicity - if you want fine
grained isolation you can just create more virtual host bridges.
Post by Peter Xu
It's been a long time since when I drafted the patches. I think at
least that should be a more general notifier mechanism comparing to
current IOMMUNotifier thing, which was bound to IOTLB notifies only.
AFAICT if we want to trap first-level translation changes, current
notifier is not even close to that interface - just see the definition
of IOMMUTLBEntry, it is tailored only for MAP/UNMAP of translation
addresses, not anything else. And IMHO that's why it's tightly bound
to MemoryRegions, and that's the root problem. The dynamic IOMMU MR
switching problem is related to this issue as well.
So, having read and thought a bunch more, I think I know where you
need to start hooking this in. The thing is the current qemu PCI DMA
structure assumes that each device belongs to just a single PCI
address space - that's what pci_device_iommu_address_space() returns.
For virt-SVM that's just not true. IIUC, a virt-SVM capable device
could simultaneously write to multiple process address spaces, since
the process IDs actually go over the bus.
Correct.
Post by David Gibson
So trying to hook notifiers at the AddressSpace OR MemoryRegion level
just doesn't make sense - if we've picked a single addresss space for
the device, we've already made a wrong step.
That's also why we want to have notifiers based on IOMMUObject(may be
not a suitable name, let me use it as the patch named).
Post by David Gibson
pci_device_virtsvm_context(). virt-SVM capable devices would need to
call that *before* calling pci_device_iommu_address_space (). Well
rather the virt-SVM capable DMA helpers would need to call that.
That would return a new VirtSVMContext (or something) object, which
would roughly correspond to a single PASID table. That's where the
methods and notifiers for managing that would need to go.
Correct, pci_device_iommu_address_space() returns an AS and it is
a PCI address space. And if pci_device_virtsvm_context() is also
called in vfio_realize(), it may not return an AS since there may
be no 1st level translation page table bound.

So as you said, return a new VirtSVMContext, this VirtSVMContext can
hook some new notifiers. I think the IOMMUObject introduced in this patch
can meet the requirement. But it may be re-named.

So here it addressed the concern you raised before which is hook IOMMUObject
via a PCI address space. Regards to VirtSVMContext, it may be a replacement
of IOMMUObject. As it is related to PASID, I'm considering to name it as
IOMMUPasidContext or IOMMUPasidObject. So it would be an abstraction of all
the IOMMU PASID related operations.

Regards,
Yi L
Post by David Gibson
Post by Peter Xu
I am not sure current "get IOMMU object from address space" solution
would be best, maybe it's "too bigger a scope", I think it depends on
whether in the future we'll have some requirement in such a bigger
scope (say, something we want to trap from vIOMMU and deliver it to
host IOMMU which may not even be device-related? I don't know). Now
another alternative I am thinking is, whether we can provide a
per-device notifier, then it can be bound to PCIDevice rather than
MemoryRegions, then it will be in device scope.
I think that sounds like a version of what I've suggested above.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson
2017-12-20 11:18:16 UTC
Permalink
Post by Liu, Yi L
Post by David Gibson
Post by Peter Xu
[...]
Post by Auger Eric
I meant, in the current intel_iommu code, vtd_find_add_as() creates 1
IOMMU MR and 1 AS per PCIe device, right?
I think this is the most tricky point - in QEMU IOMMU MR is not really
a 1:1 relationship to devices. For Intel, it's true; for Power, it's
not. On Power guests, one device's DMA address space can be splited
into different translation windows, while each window corresponds to
one IOMMU MR.
Right.
Post by Peter Xu
So IMHO the real 1:1 mapping is between the device and its DMA address
space, rather than MRs.
That's not true either. With both POWER and Intel, several devices
can share a DMA address space: on POWER if they are in the same PE, on
Intel if they are place in the same IOMMU domain.
On x86 and on POWER bare metal we generally try to make the minimum
granularity for each PE/domain be a single function. However, that
may not be possible in the case of PCIe to PCI bridges, or
multifunction devices where the functions aren't properly isolated
from each other (e.g. function 0 debug registers which can affect
other functions are quite common).
For POWER guests we only have one PE/domain per virtual host bridge.
That's just a matter of implementation simplicity - if you want fine
grained isolation you can just create more virtual host bridges.
Post by Peter Xu
It's been a long time since when I drafted the patches. I think at
least that should be a more general notifier mechanism comparing to
current IOMMUNotifier thing, which was bound to IOTLB notifies only.
AFAICT if we want to trap first-level translation changes, current
notifier is not even close to that interface - just see the definition
of IOMMUTLBEntry, it is tailored only for MAP/UNMAP of translation
addresses, not anything else. And IMHO that's why it's tightly bound
to MemoryRegions, and that's the root problem. The dynamic IOMMU MR
switching problem is related to this issue as well.
So, having read and thought a bunch more, I think I know where you
need to start hooking this in. The thing is the current qemu PCI DMA
structure assumes that each device belongs to just a single PCI
address space - that's what pci_device_iommu_address_space() returns.
For virt-SVM that's just not true. IIUC, a virt-SVM capable device
could simultaneously write to multiple process address spaces, since
the process IDs actually go over the bus.
Correct.
Post by David Gibson
So trying to hook notifiers at the AddressSpace OR MemoryRegion level
just doesn't make sense - if we've picked a single addresss space for
the device, we've already made a wrong step.
That's also why we want to have notifiers based on IOMMUObject(may be
not a suitable name, let me use it as the patch named).
Right, I think "IOMMUObject" is a misleading name.
Post by Liu, Yi L
Post by David Gibson
pci_device_virtsvm_context(). virt-SVM capable devices would need to
call that *before* calling pci_device_iommu_address_space (). Well
rather the virt-SVM capable DMA helpers would need to call that.
That would return a new VirtSVMContext (or something) object, which
would roughly correspond to a single PASID table. That's where the
methods and notifiers for managing that would need to go.
Correct, pci_device_iommu_address_space() returns an AS and it is
a PCI address space. And if pci_device_virtsvm_context() is also
called in vfio_realize(), it may not return an AS since there may
be no 1st level translation page table bound.
So as you said, return a new VirtSVMContext, this VirtSVMContext can
hook some new notifiers. I think the IOMMUObject introduced in this patch
can meet the requirement. But it may be re-named.
Ok.
Post by Liu, Yi L
So here it addressed the concern you raised before which is hook IOMMUObject
via a PCI address space. Regards to VirtSVMContext, it may be a replacement
of IOMMUObject. As it is related to PASID, I'm considering to name it as
IOMMUPasidContext or IOMMUPasidObject. So it would be an abstraction of all
the IOMMU PASID related operations.
I'm ok with calling it a "PASID context".

Thinking about this some more, here are some extra observations:

* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.

* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
those AddressSpace objects in advance might be too expensive. I
can see a couple of options to avoid this:

1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.

2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.

* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.

Do you have a plan for what the virt-SVM aware DMA functions will look
like?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-12-21 08:40:19 UTC
Permalink
[...]
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
So, having read and thought a bunch more, I think I know where you
need to start hooking this in. The thing is the current qemu PCI DMA
structure assumes that each device belongs to just a single PCI
address space - that's what pci_device_iommu_address_space() returns.
For virt-SVM that's just not true. IIUC, a virt-SVM capable device
could simultaneously write to multiple process address spaces, since
the process IDs actually go over the bus.
Correct.
Post by David Gibson
So trying to hook notifiers at the AddressSpace OR MemoryRegion level
just doesn't make sense - if we've picked a single addresss space for
the device, we've already made a wrong step.
That's also why we want to have notifiers based on IOMMUObject(may be
not a suitable name, let me use it as the patch named).
Right, I think "IOMMUObject" is a misleading name.
Post by Liu, Yi L
Post by David Gibson
pci_device_virtsvm_context(). virt-SVM capable devices would need to
call that *before* calling pci_device_iommu_address_space (). Well
rather the virt-SVM capable DMA helpers would need to call that.
That would return a new VirtSVMContext (or something) object, which
would roughly correspond to a single PASID table. That's where the
methods and notifiers for managing that would need to go.
Correct, pci_device_iommu_address_space() returns an AS and it is
a PCI address space. And if pci_device_virtsvm_context() is also
called in vfio_realize(), it may not return an AS since there may
be no 1st level translation page table bound.
So as you said, return a new VirtSVMContext, this VirtSVMContext can
hook some new notifiers. I think the IOMMUObject introduced in this patch
can meet the requirement. But it may be re-named.
Ok.
Post by Liu, Yi L
So here it addressed the concern you raised before which is hook IOMMUObject
via a PCI address space. Regards to VirtSVMContext, it may be a replacement
of IOMMUObject. As it is related to PASID, I'm considering to name it as
IOMMUPasidContext or IOMMUPasidObject. So it would be an abstraction of all
the IOMMU PASID related operations.
I'm ok with calling it a "PASID context".
* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.
Correct. Also virt-SVM still needs the PCI Address space. And the PCI
Address space == Guest physical Address space. For virt-SVM, requires
pt mode to ensure the nested translation.
Post by David Gibson
* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
I also thought about creating AddressSpace objects for each process ID.
But I don't think we need to do it. My reason as below:

In theory, it is correct to have AddressSpace object for each process
virtual address space in Qemu, and this is what we are doing for PCI
address space so far. However, this is necessary when we want to mirror
the mapping to host. Each time there is mapping changed within the PCI
address space, we need to mirror it to host.

While for virt-SVM, we don't need to mirror the changes within the guest
process address space. The nested translation capability in HW brings us
a convenience. In nested translation, HW can access a guest PASID table
with a GPA(this is what Intel and AMD does, not sure about ARM, maybe or
maybe not). For VT-d and AMD-IOMMU, even any change in guest PASID table,
it is not necessary to mirror it to host. Based on the statements above,
there is a candidate function to be included in PASIDContext. It could be
bind_guest_pasid_table(). And be used to set the guest PASID Table to host
translation structure when guest finds a device is has SVM capability.
Post by David Gibson
those AddressSpace objects in advance might be too expensive. I
1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.
Translation code actually walks the guest CR3 table to get a GVA->GPA map.
But it is not really required so far due to HW capability.
Post by David Gibson
2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.
This "Lazily" mechanism is not required. But I agree with the last statement.
When PASID Table altered, the cache should be invalidated. Additionally, the
cache for the mappings(GVA) within the guest process address space should also
be invalidated when there is change. This also calls for a candidate function
in PASIDContext. It could be svm_invalidate_tlb(). And be used to invalidate
GVA cache and also PASID entry cache.
Post by David Gibson
* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
Yeah, this is the case. And to be accurate, both the 1st level and 2nd level
translation here happens on HW IOMMU. The 1st level page table is actually
like "linked" from guest. And the 2nd level page table is created/conrolled
by Qemu(with MAP/UNMAP).
Post by David Gibson
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.
Not sure if I got your point accurately. Let me try to reply based on
what I got. If guest enables virt-IOVA(DMA isolaton in guest scope) and
virt-SVM. I think this is a case which guest controls both 1st and 2nd
level translation.

For virt-IOVA, we need to mirror all the guest IOVA mapping to host(aka.
shadow). Current MemoryRegion based notifiers(MAP/UNMAP) should be enough.
Peter has already upstreamed it.
Post by David Gibson
Do you have a plan for what the virt-SVM aware DMA functions will look
like?
I think there are two candidates so far. This should be enough for VT-d and
AMD-IOMMU. For ARM-SMMU, it may need extra function.
* bind_guest_pasid_table()
* svm_invalidate_tlb()
Post by David Gibson
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Thanks,
Yi L
David Gibson
2018-01-03 00:28:17 UTC
Permalink
Post by Liu, Yi L
[...]
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
So, having read and thought a bunch more, I think I know where you
need to start hooking this in. The thing is the current qemu PCI DMA
structure assumes that each device belongs to just a single PCI
address space - that's what pci_device_iommu_address_space() returns.
For virt-SVM that's just not true. IIUC, a virt-SVM capable device
could simultaneously write to multiple process address spaces, since
the process IDs actually go over the bus.
Correct.
Post by David Gibson
So trying to hook notifiers at the AddressSpace OR MemoryRegion level
just doesn't make sense - if we've picked a single addresss space for
the device, we've already made a wrong step.
That's also why we want to have notifiers based on IOMMUObject(may be
not a suitable name, let me use it as the patch named).
Right, I think "IOMMUObject" is a misleading name.
Post by Liu, Yi L
Post by David Gibson
pci_device_virtsvm_context(). virt-SVM capable devices would need to
call that *before* calling pci_device_iommu_address_space (). Well
rather the virt-SVM capable DMA helpers would need to call that.
That would return a new VirtSVMContext (or something) object, which
would roughly correspond to a single PASID table. That's where the
methods and notifiers for managing that would need to go.
Correct, pci_device_iommu_address_space() returns an AS and it is
a PCI address space. And if pci_device_virtsvm_context() is also
called in vfio_realize(), it may not return an AS since there may
be no 1st level translation page table bound.
So as you said, return a new VirtSVMContext, this VirtSVMContext can
hook some new notifiers. I think the IOMMUObject introduced in this patch
can meet the requirement. But it may be re-named.
Ok.
Post by Liu, Yi L
So here it addressed the concern you raised before which is hook IOMMUObject
via a PCI address space. Regards to VirtSVMContext, it may be a replacement
of IOMMUObject. As it is related to PASID, I'm considering to name it as
IOMMUPasidContext or IOMMUPasidObject. So it would be an abstraction of all
the IOMMU PASID related operations.
I'm ok with calling it a "PASID context".
* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.
Correct. Also virt-SVM still needs the PCI Address space. And the PCI
Address space == Guest physical Address space.
Not necessarily. That's true if you're only making the L1 translation
guest visible. But I think we need to at least think about the case
where both L1 and L2 translations are guest visible, in which case the
PCI address space is not the same as the guest physical address space.
Post by Liu, Yi L
For virt-SVM, requires
pt mode to ensure the nested translation.
What is pt mode?
Post by Liu, Yi L
Post by David Gibson
* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
I also thought about creating AddressSpace objects for each process ID.
In theory, it is correct to have AddressSpace object for each process
virtual address space in Qemu, and this is what we are doing for PCI
address space so far. However, this is necessary when we want to mirror
the mapping to host. Each time there is mapping changed within the PCI
address space, we need to mirror it to host.
While for virt-SVM, we don't need to mirror the changes within the guest
process address space. The nested translation capability in HW brings us
a convenience. In nested translation, HW can access a guest PASID table
with a GPA(this is what Intel and AMD does, not sure about ARM, maybe or
maybe not). For VT-d and AMD-IOMMU, even any change in guest PASID table,
it is not necessary to mirror it to host. Based on the statements above,
there is a candidate function to be included in PASIDContext. It could be
bind_guest_pasid_table(). And be used to set the guest PASID Table to host
translation structure when guest finds a device is has SVM
capability.
That's true for passthrough devices, but not for qemu emulated
devices. None of those support SVM yet, but there's no reason they
couldn't in future.

Even though that will never be the main production case, I think we'll
get a better model if we think about these edge cases carefully.
Post by Liu, Yi L
Post by David Gibson
those AddressSpace objects in advance might be too expensive. I
1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.
Translation code actually walks the guest CR3 table to get a
GVA->GPA map.
Right, but that's clearly x86 specific.
Post by Liu, Yi L
But it is not really required so far due to HW capability.
Post by David Gibson
2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.
This "Lazily" mechanism is not required. But I agree with the last statement.
When PASID Table altered, the cache should be invalidated. Additionally, the
cache for the mappings(GVA) within the guest process address space should also
be invalidated when there is change. This also calls for a candidate function
in PASIDContext. It could be svm_invalidate_tlb(). And be used to invalidate
GVA cache and also PASID entry cache.
Post by David Gibson
* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
Yeah, this is the case. And to be accurate, both the 1st level and 2nd level
translation here happens on HW IOMMU. The 1st level page table is actually
like "linked" from guest. And the 2nd level page table is created/conrolled
by Qemu(with MAP/UNMAP).
Post by David Gibson
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.
Not sure if I got your point accurately. Let me try to reply based on
what I got. If guest enables virt-IOVA(DMA isolaton in guest scope) and
virt-SVM. I think this is a case which guest controls both 1st and 2nd
level translation.
That sounds right, from my limited knowledge of the x86 IOMMU.
Post by Liu, Yi L
For virt-IOVA, we need to mirror all the guest IOVA mapping to host(aka.
shadow). Current MemoryRegion based notifiers(MAP/UNMAP) should be enough.
Peter has already upstreamed it.
Right, but we also need to model the translations for emulated
devices.
Post by Liu, Yi L
Post by David Gibson
Do you have a plan for what the virt-SVM aware DMA functions will look
like?
I think there are two candidates so far. This should be enough for VT-d and
AMD-IOMMU. For ARM-SMMU, it may need extra function.
* bind_guest_pasid_table()
* svm_invalidate_tlb()
That's not really what I meant. What I was getting it if you were
writing an emulated device which supported SVM, what would the
functions to perform SVM-aware DMA look like?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2018-01-04 09:40:54 UTC
Permalink
[...]
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
I'm ok with calling it a "PASID context".
* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.
Correct. Also virt-SVM still needs the PCI Address space. And the PCI
Address space == Guest physical Address space.
Not necessarily. That's true if you're only making the L1 translation
guest visible. But I think we need to at least think about the case
where both L1 and L2 translations are guest visible, in which case the
PCI address space is not the same as the guest physical address space.
Post by Liu, Yi L
For virt-SVM, requires
pt mode to ensure the nested translation.
What is pt mode?
The pt mode here means the kernel parameter "iommu=pt" which means iommu
do 1:1 mapping for iova. For t virt-SVM on VT-d, with guest set iommu=pt,
the 2nd level page table in host would be a GPA->HPA mapping. If not, the
host 2nd level page table would be a GIOVA->HPA mapping which is not expected
in nested translation.
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
I also thought about creating AddressSpace objects for each process ID.
In theory, it is correct to have AddressSpace object for each process
virtual address space in Qemu, and this is what we are doing for PCI
address space so far. However, this is necessary when we want to mirror
the mapping to host. Each time there is mapping changed within the PCI
address space, we need to mirror it to host.
While for virt-SVM, we don't need to mirror the changes within the guest
process address space. The nested translation capability in HW brings us
a convenience. In nested translation, HW can access a guest PASID table
with a GPA(this is what Intel and AMD does, not sure about ARM, maybe or
maybe not). For VT-d and AMD-IOMMU, even any change in guest PASID table,
it is not necessary to mirror it to host. Based on the statements above,
there is a candidate function to be included in PASIDContext. It could be
bind_guest_pasid_table(). And be used to set the guest PASID Table to host
translation structure when guest finds a device is has SVM
capability.
That's true for passthrough devices, but not for qemu emulated
devices. None of those support SVM yet, but there's no reason they
couldn't in future.
Even though that will never be the main production case, I think we'll
get a better model if we think about these edge cases carefully.
Yeah, it's a quite good edge case. If an emulated device want to do DMA
to a guest process virtual address space, Qemu needs to know it and do
necessary address translation for it just as iova. If we want to support
emulated SVM capable device, not sure if the current AddressSpace in Qemu
can fit well. Honestly, SVM capable devices are major complicated
accelerators. So I think we need to balance the efort. But I agree it is
helpful to understand more about the edge cases.
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
those AddressSpace objects in advance might be too expensive. I
1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.
Translation code actually walks the guest CR3 table to get a
GVA->GPA map.
Right, but that's clearly x86 specific.
Yes, CR3 table is x86 specific. While for page table walking, I think
it can be vendor-agnostic.
Post by David Gibson
Post by Liu, Yi L
But it is not really required so far due to HW capability.
Post by David Gibson
2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.
This "Lazily" mechanism is not required. But I agree with the last statement.
When PASID Table altered, the cache should be invalidated. Additionally, the
cache for the mappings(GVA) within the guest process address space should also
be invalidated when there is change. This also calls for a candidate function
in PASIDContext. It could be svm_invalidate_tlb(). And be used to invalidate
GVA cache and also PASID entry cache.
Post by David Gibson
* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
Yeah, this is the case. And to be accurate, both the 1st level and 2nd level
translation here happens on HW IOMMU. The 1st level page table is actually
like "linked" from guest. And the 2nd level page table is created/conrolled
by Qemu(with MAP/UNMAP).
Post by David Gibson
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.
Not sure if I got your point accurately. Let me try to reply based on
what I got. If guest enables virt-IOVA(DMA isolaton in guest scope) and
virt-SVM. I think this is a case which guest controls both 1st and 2nd
level translation.
That sounds right, from my limited knowledge of the x86 IOMMU.
Post by Liu, Yi L
For virt-IOVA, we need to mirror all the guest IOVA mapping to host(aka.
shadow). Current MemoryRegion based notifiers(MAP/UNMAP) should be enough.
Peter has already upstreamed it.
Right, but we also need to model the translations for emulated
devices.
For emulated device, I think the virt-IOVA is workable with latest Qemu.
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
Do you have a plan for what the virt-SVM aware DMA functions will look
like?
I think there are two candidates so far. This should be enough for VT-d and
AMD-IOMMU. For ARM-SMMU, it may need extra function.
* bind_guest_pasid_table()
* svm_invalidate_tlb()
That's not really what I meant. What I was getting it if you were
writing an emulated device which supported SVM, what would the
functions to perform SVM-aware DMA look like?
Sorry for the misunderstanding, I'm not writing an emulated device with SVM
capability. For a SVM capable device, in brief, it needs to be able to send
Memory Request with PASID. If an emulated device tries to access a process
virtual address space, Qemu needs to know it and do the translation for it.
That's the SVM-awar DAM should look like. However, I haven't got a model for
it. Probably needs to add pasid field in AddressSpace.
Post by David Gibson
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Regards,
Yi L
Liu, Yi L
2018-01-12 10:25:34 UTC
Permalink
[...]

Sorry for the delayed reply, spent some time on reconsidering your comments.
Post by David Gibson
I'm ok with calling it a "PASID context".
* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.
* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
those AddressSpace objects in advance might be too expensive. I
1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.
2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.
Sorry, a double check here. Does "AddressSpace objects" mean the existing
AddressSpace definition in Qemu?
Post by David Gibson
* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.
Do you have a plan for what the virt-SVM aware DMA functions will look
like?
The behaviour is device specific.
For a SVM capable physcial device, it would store the pasid value in a
register locates in the deivce. e.g. a GPU context can be set to use SVM,
after the pasid is set, any DMA from this context is DMAs target to a
process virtual address space.

So for a virt-SVM aware DMA device, the device model needs to figure out
the target address space. With the correct address space, then consume
the translate() callback provided by iommu emulator. And then emulate the
DMA operation for the emulated device.

I'll try to get a new version with your suggestions.

Thanks,
Yi L
David Gibson
2018-01-16 06:04:25 UTC
Permalink
Post by Liu, Yi L
[...]
Sorry for the delayed reply, spent some time on reconsidering your comments.
Post by David Gibson
I'm ok with calling it a "PASID context".
* I think each device needs both a PASID context and an ordinary
address space. The PASID context would be used for bus
transactions which include a process id, the address space for
those that don't.
* Theoretically, the PASID context could be modelled as an array/map
of AddressSpace objects for each process ID. However, creating all
those AddressSpace objects in advance might be too expensive. I
1) Have the PASID context class include a 'translate' method similar
to the one in IOMMUMemoryRegionClass, but taking a process ID as well
as an address. This would avoid creating extra AddressSpace objects,
but might require duplicating a bunch of the translation code that
already exists for AddressSpace.
2) "Lazily" create AddressSpace objects. The generic part of the
PASID aware DMA helper functions would use a cache of AddressSpace's
for particular process IDs, using the AddressSpace (and MemoryRegion
within) to translate accesses for a particular process ID. However,
these AddressSpace and MemoryRegion objects would only be created when
the device first accesses that address space. In the common case,
where a single device is just being used by a single process or a
small number, this should keep the number of AddressSpace objects
relatively small. Obviously the cache would need to be invalidated,
cleaning up the AddressSpace objects, when the PASID table is altered.
Sorry, a double check here. Does "AddressSpace objects" mean the existing
AddressSpace definition in Qemu?
Yes.
Post by Liu, Yi L
Post by David Gibson
* I realize that the expected case here is with KVM, where the guest
controls the first level translation, but the host controls the
second level translation. However, we should also be able to model
the case where the guest controls both levels for the sake of full
system emulation. I think understanding this case will lead to a
better design even for the simpler case.
Do you have a plan for what the virt-SVM aware DMA functions will look
like?
The behaviour is device specific.
For a SVM capable physcial device, it would store the pasid value in a
register locates in the deivce. e.g. a GPU context can be set to use SVM,
after the pasid is set, any DMA from this context is DMAs target to a
process virtual address space.
That doesn't sound any more device specific than any DMA operation,
and we have helpers for that.
Post by Liu, Yi L
So for a virt-SVM aware DMA device, the device model needs to figure out
the target address space. With the correct address space, then consume
the translate() callback provided by iommu emulator. And then emulate the
DMA operation for the emulated device.
Nearly all of that sounds like something that belongs in a helper
function. Basically a varaint of dma_memory_rw() (and related
functions) that takes a PASID as well as an address.
Post by Liu, Yi L
I'll try to get a new version with your suggestions.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Peter Xu
2017-11-14 03:31:00 UTC
Permalink
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
In VT-d (I believe the same to ARM SMMUs), IMHO the special thing is
that the translation windows (and device address spaces in QEMU) are
only talking about second level translations, but not first level,
while virt-svm needs to play with first level translations. Until
now, AFAIU we don't really have a good interface for first level
translations at all (aka. the process address space).
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
Now I know that for Power it may not have anything like a "translation
unit" but everything is defined as "translation windows" in the
guests. However the problem still exist for some other platforms.
Say, for Intel we have emulated VT-d; for ARM, we have vSMMU. AFAIU
these platforms do have their translation units, and even for ARM it
should need such an interface (or any better interfaces, which are
always welcomed) for virt-svm to work. Otherwise I don't know a way
to configure the first level translation tables.

Meanwhile, IMO this abstraction should not really affect pseries - it
should be only useful for those platforms who would like to use it.
For pseries, we can just ignore that new interface if we don't really
even have such a translation unit.
The other thing that bothers me here is the way it's attached to an
AddressSpace. IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
IMO the PCI address space is still used. For virt-svm, host IOMMU
will be working in nested translation mode, so we should be having two
mappings working in parallel:

1. DPDK process (in guest) address space mapping (GVA -> GPA)
2. guest direct memory mappings (GPA -> HPA)

And here AFAIU the 2nd mapping is working exactly like general PCI
devices, the only difference is that the 2nd level mapping is always
static, just like when IOMMU passthrough is enabled for that device.

So, IMHO virt-SVM is not really in parallel with PCI subsystem. For
the SVM in guest, it may be different, since it should only be using
first level translations. However to implement virt-SVM, IMHO we not
only need existing PCI address space translation logic, we also need
an extra way to configure the first level mappings, as discussed.

Thanks,
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
Peter Xu
David Gibson
2017-12-18 05:41:32 UTC
Permalink
Sorry I've taken so long to reply, I've been super busy with other
things.
Post by Peter Xu
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
In VT-d (I believe the same to ARM SMMUs), IMHO the special thing is
that the translation windows (and device address spaces in QEMU) are
only talking about second level translations, but not first level,
while virt-svm needs to play with first level translations. Until
now, AFAIU we don't really have a good interface for first level
translations at all (aka. the process address space).
Ok, that explains why you need some kind of different model than the
existing one, but that wasn't really disputed. It still doesn't say
why separating the unclearly defined IOMMUObject from the IOMMU MR is
a good way of modelling this.

If the issue is 1st vs. 2nd level translation, it really seems you
should be explicitly modelling the different 1st level vs. 2nd level
address spaces, rather than just splitting functions between the MR
and AS level with no clear rationale behind what goes where.
Post by Peter Xu
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
Now I know that for Power it may not have anything like a "translation
unit" but everything is defined as "translation windows" in the
guests. However the problem still exist for some other platforms.
Say, for Intel we have emulated VT-d; for ARM, we have vSMMU. AFAIU
these platforms do have their translation units, and even for ARM it
should need such an interface (or any better interfaces, which are
always welcomed) for virt-svm to work. Otherwise I don't know a way
to configure the first level translation tables.
Meanwhile, IMO this abstraction should not really affect pseries - it
should be only useful for those platforms who would like to use it.
For pseries, we can just ignore that new interface if we don't really
even have such a translation unit.
But it's _still_ not clear to me what a "translation unit" means.
What is common across a translation unit that is not common across
different translation units?
Post by Peter Xu
The other thing that bothers me here is the way it's attached to an
AddressSpace. IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
IMO the PCI address space is still used. For virt-svm, host IOMMU
will be working in nested translation mode, so we should be having two
1. DPDK process (in guest) address space mapping (GVA -> GPA)
2. guest direct memory mappings (GPA -> HPA)
And here AFAIU the 2nd mapping is working exactly like general PCI
devices, the only difference is that the 2nd level mapping is always
static, just like when IOMMU passthrough is enabled for that device.
Ok. IIUC the 2nd level mapping isn't visible to the guest, is that
right?

But this doesn't really affect my point - from the guest's point of
view, the "usual" address space for a PCI device is equal to the GPA
(i.e. no guest visible translation), but for SVM devices they instead
use an SVM address space depending on the process id, which is not the
same as the GPA space.
Post by Peter Xu
So, IMHO virt-SVM is not really in parallel with PCI subsystem.
Well, it's not clear to me if it's in parallel to, or on top of. But
the crucial point is that an SVM device does not access the "main" PCI
address space (whether or not that has a traditional IOMMU). Instead
it has access to a bunch of different address spaces, one for each
process ID.
Post by Peter Xu
For
the SVM in guest, it may be different, since it should only be using
first level translations. However to implement virt-SVM, IMHO we not
only need existing PCI address space translation logic, we also need
an extra way to configure the first level mappings, as discussed.
Right. I'm not disputing that a way to configure those first level
mappings is necessary. The proposed model just doesn't seem a good
fit. It still only represents a single "PCI address space", then
attaches things about the process table mapping to that, when in fact
they affect the process table and therefore *different* address spaces
from that which the methods are attached to.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-11-16 08:57:09 UTC
Permalink
Hi David,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
On POWER, the DMA isolation is done by allocating different DMA window
to different isolation domains? And a single isolation domain may include
multiple dma windows? So with or withou IOMMU, there is only a single
DMA address shared by all the devices in the system? The isolation
mechanism is as what described above?
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
On VT-d, with IOMMU presented, each isolation domain has its own address
space. That's why we talked more on address space level. And iommu makes
the difference. That's the behavioural characteristics a single iommu
translation unit has. And thus an IOMMUObject going to have.
The other thing that bothers me here is the way it's attached to an
AddressSpace.
My consideration is iommu handles AddressSpaces. dma address space is also
an address space managed by iommu. That's why we believe it is fine to
associate dma address space with an IOMMUObject.
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.

Thanks,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson
2017-12-18 06:14:42 UTC
Permalink
Hi David,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
On POWER, the DMA isolation is done by allocating different DMA window
to different isolation domains? And a single isolation domain may include
multiple dma windows? So with or withou IOMMU, there is only a single
DMA address shared by all the devices in the system? The isolation
mechanism is as what described above?
No, the multiple windows are completely unrelated to how things are
isolated.

Just like on x86, each IOMMU domain has independent IOMMU mappings.
The only difference is that IBM calls the domains "partitionable
endpoints" (PEs) and they tend to be statically created at boot time,
rather than runtime generated.

The windows are about what addresses in PCI space are translated by
the IOMMU. If the device generates a PCI cycle, only certain
addresses will be mapped by the IOMMU to DMA - other addresses will
correspond to other devices MMIOs, MSI vectors, maybe other things.

The set of addresses translated by the IOMMU need not be contiguous.
Or, there could be two IOMMUs on the bus, each accepting different
address ranges. These two situations are not distinguishable from the
guest's point of view.

So for a typical PAPR setup, the device can access system RAM either
via DMA in the range 0..1GiB (the "32-bit window") or in the range
2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit
window has mappings dynamically created by drivers, and the 64-bit
window has all of system RAM mapped 1:1, but that's entirely up to the
OS, it can map each window however it wants.

32-bit devices (or "64 bit" devices which don't actually implement
enough the address bits) will only be able to use the 32-bit window,
of course.

MMIOs of other devices, the "magic" MSI-X addresses belonging to the
host bridge and other things exist outside those ranges. Those are
just the ranges which are used to DMA to RAM.

Each PE (domain) can see a different version of what's in each
window.

In fact, if I understand the "IO hole" correctly, the situation on x86
isn't very different. It has a window below the IO hole and a second
window above the IO hole. The addresses within the IO hole go to
(32-bit) devices on the PCI bus, rather than being translated by the
IOMMU to RAM addresses. Because the gap is smaller between the two
windows, I think we get away without really modelling this detail in
qemu though.
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
On VT-d, with IOMMU presented, each isolation domain has its own address
space. That's why we talked more on address space level. And iommu makes
the difference. That's the behavioural characteristics a single iommu
translation unit has. And thus an IOMMUObject going to have.
Right, that's the same on POWER. But the IOMMU only translates *some*
addresses within the address space, not all of them. The rest will go
to other PCI devices or be unmapped, but won't go to RAM.

That's why the IOMMU should really be associated with an MR (or
several MRs), not an AddressSpace, it only translates some addresses.
The other thing that bothers me here is the way it's attached to an
AddressSpace.
My consideration is iommu handles AddressSpaces. dma address space is also
an address space managed by iommu.
No, it's not. It's a region (or several) within the overall PCI
address space. Other things in the addressspace, such as other
device's BARs exist independent of the IOMMU.

It's not something that could really work with PCI-E, I think, but
with a more traditional PCI bus there's no reason you couldn't have
multiple IOMMUs listening on different regions of the PCI address
space.
That's why we believe it is fine to
associate dma address space with an IOMMUObject.
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.
Thanks,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-12-18 09:17:35 UTC
Permalink
Post by David Gibson
Hi David,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
On POWER, the DMA isolation is done by allocating different DMA window
to different isolation domains? And a single isolation domain may include
multiple dma windows? So with or withou IOMMU, there is only a single
DMA address shared by all the devices in the system? The isolation
mechanism is as what described above?
No, the multiple windows are completely unrelated to how things are
isolated.
I'm afraid I chose a wrong word by using "DMA window"..
Actually, when mentioning "DMA window", I mean address ranges in an iova
address space. Anyhow, let me re-shape my understanding of POWER IOMMU and
make sure we are in the same page.
Post by David Gibson
Just like on x86, each IOMMU domain has independent IOMMU mappings.
The only difference is that IBM calls the domains "partitionable
endpoints" (PEs) and they tend to be statically created at boot time,
rather than runtime generated.
Does POWER IOMMU also have iova concept? Device can use an iova to
access memory, and IOMMU translates the iova to an address within the
system physical address?
Post by David Gibson
The windows are about what addresses in PCI space are translated by
the IOMMU. If the device generates a PCI cycle, only certain
addresses will be mapped by the IOMMU to DMA - other addresses will
correspond to other devices MMIOs, MSI vectors, maybe other things.
I guess the windows you mentioned here is the address ranges within the
system physical address space as you also mentioned MMIOs and etc.
Post by David Gibson
The set of addresses translated by the IOMMU need not be contiguous.
I suppose you mean the output addresses of the IOMMU need not be
contiguous?
Post by David Gibson
Or, there could be two IOMMUs on the bus, each accepting different
address ranges. These two situations are not distinguishable from the
guest's point of view.
So for a typical PAPR setup, the device can access system RAM either
via DMA in the range 0..1GiB (the "32-bit window") or in the range
2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit
window has mappings dynamically created by drivers, and the 64-bit
window has all of system RAM mapped 1:1, but that's entirely up to the
OS, it can map each window however it wants.
32-bit devices (or "64 bit" devices which don't actually implement
enough the address bits) will only be able to use the 32-bit window,
of course.
MMIOs of other devices, the "magic" MSI-X addresses belonging to the
host bridge and other things exist outside those ranges. Those are
just the ranges which are used to DMA to RAM.
Each PE (domain) can see a different version of what's in each
window.
If I'm correct so far. PE actually defines a mapping between an address
range of an address space(aka. iova address space) and an address range
of the system physical address space.

Then my question is: does each PE define a separate iova address sapce
which is flat from 0 - 2^AW -1, AW is address width? As a reference,
VT-d domain defines a flat address space for each domain.
Post by David Gibson
In fact, if I understand the "IO hole" correctly, the situation on x86
isn't very different. It has a window below the IO hole and a second
window above the IO hole. The addresses within the IO hole go to
(32-bit) devices on the PCI bus, rather than being translated by the
If you mean the "IO hole" within the system physcial address space, I think
it's yes.
Post by David Gibson
IOMMU to RAM addresses. Because the gap is smaller between the two
windows, I think we get away without really modelling this detail in
qemu though.
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
On VT-d, with IOMMU presented, each isolation domain has its own address
space. That's why we talked more on address space level. And iommu makes
the difference. That's the behavioural characteristics a single iommu
translation unit has. And thus an IOMMUObject going to have.
Right, that's the same on POWER. But the IOMMU only translates *some*
addresses within the address space, not all of them. The rest will go
to other PCI devices or be unmapped, but won't go to RAM.
That's why the IOMMU should really be associated with an MR (or
several MRs), not an AddressSpace, it only translates some addresses.
If I'm correct so far, I do believe the major difference between VT-d and
POWER IOMMU is that VT-d isolation domain is a flat address space while
PE of POWER is something different(need your input here as I'm not sure about
it). Maybe it's like there is a flat address space, each PE takes some address
ranges and maps the address ranges to different system physcial address ranges.
Post by David Gibson
The other thing that bothers me here is the way it's attached to an
AddressSpace.
My consideration is iommu handles AddressSpaces. dma address space is also
an address space managed by iommu.
No, it's not. It's a region (or several) within the overall PCI
address space. Other things in the addressspace, such as other
device's BARs exist independent of the IOMMU.
It's not something that could really work with PCI-E, I think, but
with a more traditional PCI bus there's no reason you couldn't have
multiple IOMMUs listening on different regions of the PCI address
space.
I think the point here is on POWER, the input addresses of IOMMUs are actaully
in the same address space? What IOMMU does is mapping the different ranges to
different system physcial address ranges. So it's as you mentioned, multiple
IOMMUs listen on different regions of a PCI address space.

While for VT-d, it's not the case. The input addresses of IOMMUs may not
in the same address sapce. As I mentioned, each IOMMU domain on VT-d is a
separate address space. So for VT-d, IOMMUs are actually listening to different
address spaces. That's the point why we want to have address space level
abstraction of IOMMU.
Post by David Gibson
That's why we believe it is fine to
associate dma address space with an IOMMUObject.
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.
Thanks,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Regards,
Yi L
David Gibson
2017-12-18 11:22:18 UTC
Permalink
Post by Liu, Yi L
Post by David Gibson
Hi David,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
On POWER, the DMA isolation is done by allocating different DMA window
to different isolation domains? And a single isolation domain may include
multiple dma windows? So with or withou IOMMU, there is only a single
DMA address shared by all the devices in the system? The isolation
mechanism is as what described above?
No, the multiple windows are completely unrelated to how things are
isolated.
I'm afraid I chose a wrong word by using "DMA window"..
Actually, when mentioning "DMA window", I mean address ranges in an iova
address space.
Yes, so did I. My one window I mean one contiguous range of IOVA addresses.
Post by Liu, Yi L
Anyhow, let me re-shape my understanding of POWER IOMMU and
make sure we are in the same page.
Post by David Gibson
Just like on x86, each IOMMU domain has independent IOMMU mappings.
The only difference is that IBM calls the domains "partitionable
endpoints" (PEs) and they tend to be statically created at boot time,
rather than runtime generated.
Does POWER IOMMU also have iova concept? Device can use an iova to
access memory, and IOMMU translates the iova to an address within the
system physical address?
Yes. When I say the "PCI address space" I mean the IOVA space.
Post by Liu, Yi L
Post by David Gibson
The windows are about what addresses in PCI space are translated by
the IOMMU. If the device generates a PCI cycle, only certain
addresses will be mapped by the IOMMU to DMA - other addresses will
correspond to other devices MMIOs, MSI vectors, maybe other things.
I guess the windows you mentioned here is the address ranges within the
system physical address space as you also mentioned MMIOs and etc.
No. I mean ranges within the PCI space == IOVA space. It's simplest
to understand with traditional PCI. A cycle on the bus doesn't know
whether the destination is a device or memory, it just has an address
- a PCI memory address. Part of that address range is mapped to
system RAM, optionally with an IOMMU translating it. Other parts of
that address space are used for devices.

With PCI-E things get more complicated, but the conceptual model is
the same.
Post by Liu, Yi L
Post by David Gibson
The set of addresses translated by the IOMMU need not be contiguous.
I suppose you mean the output addresses of the IOMMU need not be
contiguous?
No. I mean the input addresses of the IOMMU.
Post by Liu, Yi L
Post by David Gibson
Or, there could be two IOMMUs on the bus, each accepting different
address ranges. These two situations are not distinguishable from the
guest's point of view.
So for a typical PAPR setup, the device can access system RAM either
via DMA in the range 0..1GiB (the "32-bit window") or in the range
2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit
window has mappings dynamically created by drivers, and the 64-bit
window has all of system RAM mapped 1:1, but that's entirely up to the
OS, it can map each window however it wants.
32-bit devices (or "64 bit" devices which don't actually implement
enough the address bits) will only be able to use the 32-bit window,
of course.
MMIOs of other devices, the "magic" MSI-X addresses belonging to the
host bridge and other things exist outside those ranges. Those are
just the ranges which are used to DMA to RAM.
Each PE (domain) can see a different version of what's in each
window.
If I'm correct so far. PE actually defines a mapping between an address
range of an address space(aka. iova address space) and an address range
of the system physical address space.
No. A PE means several things, but basically it is an isolation
domain, like an Intel IOMMU domain. Each PE has an independent set of
IOMMU mappings which translate part of the PCI address space to system
memory space.
Post by Liu, Yi L
Then my question is: does each PE define a separate iova address sapce
which is flat from 0 - 2^AW -1, AW is address width? As a reference,
VT-d domain defines a flat address space for each domain.
Partly. Each PE has an address space which all devices in the PE see.
Only some of that address space is mapped to system memory though,
other parts are occupied by devices, others are unmapped.

Only the parts mapped by the IOMMU vary between PEs - the other parts
of the address space will be identical for all PEs on the host
bridge. However for POWER guests (not for hosts) there is exactly one
PE for each virtual host bridge.
Post by Liu, Yi L
Post by David Gibson
In fact, if I understand the "IO hole" correctly, the situation on x86
isn't very different. It has a window below the IO hole and a second
window above the IO hole. The addresses within the IO hole go to
(32-bit) devices on the PCI bus, rather than being translated by the
If you mean the "IO hole" within the system physcial address space, I think
it's yes.
Well, really I mean the IO hole in PCI address space. Because system
address space and PCI memory space were traditionally identity mapped
on x86 this is easy to confuse though.
Post by Liu, Yi L
Post by David Gibson
IOMMU to RAM addresses. Because the gap is smaller between the two
windows, I think we get away without really modelling this detail in
qemu though.
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
On VT-d, with IOMMU presented, each isolation domain has its own address
space. That's why we talked more on address space level. And iommu makes
the difference. That's the behavioural characteristics a single iommu
translation unit has. And thus an IOMMUObject going to have.
Right, that's the same on POWER. But the IOMMU only translates *some*
addresses within the address space, not all of them. The rest will go
to other PCI devices or be unmapped, but won't go to RAM.
That's why the IOMMU should really be associated with an MR (or
several MRs), not an AddressSpace, it only translates some addresses.
If I'm correct so far, I do believe the major difference between VT-d and
POWER IOMMU is that VT-d isolation domain is a flat address space while
PE of POWER is something different(need your input here as I'm not sure about
it). Maybe it's like there is a flat address space, each PE takes some address
ranges and maps the address ranges to different system physcial address ranges.
No, it's really not that different. In both cases (without virt-SVM)
there's a system memory address space, and a PCI address space for
each domain / PE. There are one or more "outbound" windows in system
memory space that map system memory cycles to PCI cycles (used by the
CPU to access MMIO) and one or more "inbound" (DMA) windows in PCI
memory space which map PCI cycles onto system memory cycles (used by
devices to access system memory).

On old-style PCs, both inbound and outbound windows were (mostly)
identity maps. On POWER they are not.
Post by Liu, Yi L
Post by David Gibson
The other thing that bothers me here is the way it's attached to an
AddressSpace.
My consideration is iommu handles AddressSpaces. dma address space is also
an address space managed by iommu.
No, it's not. It's a region (or several) within the overall PCI
address space. Other things in the addressspace, such as other
device's BARs exist independent of the IOMMU.
It's not something that could really work with PCI-E, I think, but
with a more traditional PCI bus there's no reason you couldn't have
multiple IOMMUs listening on different regions of the PCI address
space.
I think the point here is on POWER, the input addresses of IOMMUs are actaully
in the same address space?
I'm not sure what you mean, but I don't think so. Each PE has its own
IOMMU input address space.
Post by Liu, Yi L
What IOMMU does is mapping the different ranges to
different system physcial address ranges. So it's as you mentioned, multiple
IOMMUs listen on different regions of a PCI address space.
No. That could be the case in theory, but it's not the usual case.

Or rather it depends what you mean by "an IOMMU". For PAPR guests,
both IOVA 0..1GiB and 2^59..(somewhere) are mapped to system memory,
but with separate page tables. You could consider that two IOMMUs (we
mostly treat it that way in qemu). However, all the mapping is
handled by the same host bridge with 2 sets of page tables per PE, so
you could also call it one IOMMU.

This is what I'm getting at when I say that "one IOMMU" is not a
clearly defined unit.
Post by Liu, Yi L
While for VT-d, it's not the case. The input addresses of IOMMUs may not
in the same address sapce. As I mentioned, each IOMMU domain on VT-d is a
separate address space. So for VT-d, IOMMUs are actually listening to different
address spaces. That's the point why we want to have address space level
abstraction of IOMMU.
Post by David Gibson
That's why we believe it is fine to
associate dma address space with an IOMMUObject.
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.
Thanks,
Yi L
Regards,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-12-20 06:32:42 UTC
Permalink
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
Hi David,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Even understanding the SVM stuff better than I did, I don't really see
why an AddressSpace is an obvious unit to have an IOMMUObject
associated with it.
Here's what I thought about it: IOMMUObject was planned to be the
abstraction of the hardware translation unit, which is a higher level
of the translated address spaces. Say, for each PCI device, it can
have its own translated address space. However for multiple PCI
devices, they can be sharing the same translation unit that handles
the translation requests from different devices. That's the case for
Intel platforms. We introduced this IOMMUObject because sometimes we
want to do something with that translation unit rather than a specific
device, in which we need a general IOMMU device handle.
Ok, but what does "hardware translation unit" mean in practice. The
guest neither knows nor cares, which bits of IOMMU translation happen
to be included in the same bundle of silicon. It only cares what the
behaviour is. What behavioural characteristics does a single
IOMMUObject have?
IIRC one issue left over during last time's discussion was that there
could be more complicated IOMMU models. E.g., one device's DMA request
can be translated nestedly by two or multiple IOMMUs, and current
proposal cannot really handle that complicated hierachy. I'm just
thinking whether we can start from a simple model (say, we don't allow
nested IOMMUs, and actually we don't even allow multiple IOMMUs so
far), then we can evolve from that point in the future.
Also, I thought there were something you mentioned that this approach
is not correct for Power systems, but I can't really remember the
details... Anyways, I think this is not the only approach to solve
the problem, and I believe any new better idea would be greatly
welcomed as well. :)
So, some of my initial comments were based on a misunderstanding of
what was proposed here - since discussing this with Yi at LinuxCon
Beijing, I have a better idea of what's going on.
On POWER - or rather the "pseries" platform, which is paravirtualized.
We can have multiple vIOMMU windows (usually 2) for a single virtual
On POWER, the DMA isolation is done by allocating different DMA window
to different isolation domains? And a single isolation domain may include
multiple dma windows? So with or withou IOMMU, there is only a single
DMA address shared by all the devices in the system? The isolation
mechanism is as what described above?
No, the multiple windows are completely unrelated to how things are
isolated.
I'm afraid I chose a wrong word by using "DMA window"..
Actually, when mentioning "DMA window", I mean address ranges in an iova
address space.
Yes, so did I. My one window I mean one contiguous range of IOVA addresses.
Post by Liu, Yi L
Anyhow, let me re-shape my understanding of POWER IOMMU and
make sure we are in the same page.
Post by David Gibson
Just like on x86, each IOMMU domain has independent IOMMU mappings.
The only difference is that IBM calls the domains "partitionable
endpoints" (PEs) and they tend to be statically created at boot time,
rather than runtime generated.
Does POWER IOMMU also have iova concept? Device can use an iova to
access memory, and IOMMU translates the iova to an address within the
system physical address?
Yes. When I say the "PCI address space" I mean the IOVA space.
Post by Liu, Yi L
Post by David Gibson
The windows are about what addresses in PCI space are translated by
the IOMMU. If the device generates a PCI cycle, only certain
addresses will be mapped by the IOMMU to DMA - other addresses will
correspond to other devices MMIOs, MSI vectors, maybe other things.
I guess the windows you mentioned here is the address ranges within the
system physical address space as you also mentioned MMIOs and etc.
No. I mean ranges within the PCI space == IOVA space. It's simplest
to understand with traditional PCI. A cycle on the bus doesn't know
whether the destination is a device or memory, it just has an address
- a PCI memory address. Part of that address range is mapped to
system RAM, optionally with an IOMMU translating it. Other parts of
that address space are used for devices.
With PCI-E things get more complicated, but the conceptual model is
the same.
Post by Liu, Yi L
Post by David Gibson
The set of addresses translated by the IOMMU need not be contiguous.
I suppose you mean the output addresses of the IOMMU need not be
contiguous?
No. I mean the input addresses of the IOMMU.
Post by Liu, Yi L
Post by David Gibson
Or, there could be two IOMMUs on the bus, each accepting different
address ranges. These two situations are not distinguishable from the
guest's point of view.
So for a typical PAPR setup, the device can access system RAM either
via DMA in the range 0..1GiB (the "32-bit window") or in the range
2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit
window has mappings dynamically created by drivers, and the 64-bit
window has all of system RAM mapped 1:1, but that's entirely up to the
OS, it can map each window however it wants.
32-bit devices (or "64 bit" devices which don't actually implement
enough the address bits) will only be able to use the 32-bit window,
of course.
MMIOs of other devices, the "magic" MSI-X addresses belonging to the
host bridge and other things exist outside those ranges. Those are
just the ranges which are used to DMA to RAM.
Each PE (domain) can see a different version of what's in each
window.
If I'm correct so far. PE actually defines a mapping between an address
range of an address space(aka. iova address space) and an address range
of the system physical address space.
No. A PE means several things, but basically it is an isolation
domain, like an Intel IOMMU domain. Each PE has an independent set of
IOMMU mappings which translate part of the PCI address space to system
memory space.
Post by Liu, Yi L
Then my question is: does each PE define a separate iova address sapce
which is flat from 0 - 2^AW -1, AW is address width? As a reference,
VT-d domain defines a flat address space for each domain.
Partly. Each PE has an address space which all devices in the PE see.
Only some of that address space is mapped to system memory though,
other parts are occupied by devices, others are unmapped.
Only the parts mapped by the IOMMU vary between PEs - the other parts
of the address space will be identical for all PEs on the host
Thx, this comment addressed me well. This is different from what we have
on VT-d.
Post by David Gibson
bridge. However for POWER guests (not for hosts) there is exactly one
PE for each virtual host bridge.
Post by Liu, Yi L
Post by David Gibson
In fact, if I understand the "IO hole" correctly, the situation on x86
isn't very different. It has a window below the IO hole and a second
window above the IO hole. The addresses within the IO hole go to
(32-bit) devices on the PCI bus, rather than being translated by the
If you mean the "IO hole" within the system physcial address space, I think
it's yes.
Well, really I mean the IO hole in PCI address space. Because system
address space and PCI memory space were traditionally identity mapped
on x86 this is easy to confuse though.
Post by Liu, Yi L
Post by David Gibson
IOMMU to RAM addresses. Because the gap is smaller between the two
windows, I think we get away without really modelling this detail in
qemu though.
PCI host bridge. Because of the paravirtualization, the mapping to
hardware is fuzzy, but for passthrough devices they will both be
implemented by the IOMMU built into the physical host bridge. That
isn't importat to the guest, though - all operations happen at the
window level.
On VT-d, with IOMMU presented, each isolation domain has its own address
space. That's why we talked more on address space level. And iommu makes
the difference. That's the behavioural characteristics a single iommu
translation unit has. And thus an IOMMUObject going to have.
Right, that's the same on POWER. But the IOMMU only translates *some*
addresses within the address space, not all of them. The rest will go
to other PCI devices or be unmapped, but won't go to RAM.
That's why the IOMMU should really be associated with an MR (or
several MRs), not an AddressSpace, it only translates some addresses.
If I'm correct so far, I do believe the major difference between VT-d and
POWER IOMMU is that VT-d isolation domain is a flat address space while
PE of POWER is something different(need your input here as I'm not sure about
it). Maybe it's like there is a flat address space, each PE takes some address
ranges and maps the address ranges to different system physcial address ranges.
No, it's really not that different. In both cases (without virt-SVM)
there's a system memory address space, and a PCI address space for
each domain / PE. There are one or more "outbound" windows in system
memory space that map system memory cycles to PCI cycles (used by the
CPU to access MMIO) and one or more "inbound" (DMA) windows in PCI
memory space which map PCI cycles onto system memory cycles (used by
devices to access system memory).
On old-style PCs, both inbound and outbound windows were (mostly)
identity maps. On POWER they are not.
Post by Liu, Yi L
Post by David Gibson
The other thing that bothers me here is the way it's attached to an
AddressSpace.
My consideration is iommu handles AddressSpaces. dma address space is also
an address space managed by iommu.
No, it's not. It's a region (or several) within the overall PCI
address space. Other things in the addressspace, such as other
device's BARs exist independent of the IOMMU.
It's not something that could really work with PCI-E, I think, but
with a more traditional PCI bus there's no reason you couldn't have
multiple IOMMUs listening on different regions of the PCI address
space.
I think the point here is on POWER, the input addresses of IOMMUs are actaully
in the same address space?
I'm not sure what you mean, but I don't think so. Each PE has its own
IOMMU input address space.
Post by Liu, Yi L
What IOMMU does is mapping the different ranges to
different system physcial address ranges. So it's as you mentioned, multiple
IOMMUs listen on different regions of a PCI address space.
No. That could be the case in theory, but it's not the usual case.
Or rather it depends what you mean by "an IOMMU". For PAPR guests,
both IOVA 0..1GiB and 2^59..(somewhere) are mapped to system memory,
but with separate page tables. You could consider that two IOMMUs (we
mostly treat it that way in qemu). However, all the mapping is
handled by the same host bridge with 2 sets of page tables per PE, so
you could also call it one IOMMU.
This is what I'm getting at when I say that "one IOMMU" is not a
clearly defined unit.
Post by Liu, Yi L
While for VT-d, it's not the case. The input addresses of IOMMUs may not
in the same address sapce. As I mentioned, each IOMMU domain on VT-d is a
separate address space. So for VT-d, IOMMUs are actually listening to different
address spaces. That's the point why we want to have address space level
abstraction of IOMMU.
Post by David Gibson
That's why we believe it is fine to
associate dma address space with an IOMMUObject.
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
After thinking more, I agree that it is not suitable to hook up something for
1st level via the PCI address space. In the time 1st and 2nd level translation
is exposed to guest, a device would write to multiple address spaces. PCI address
space is only one of them. I think your reply in another email is a good start,
let me reply my thoughts under that email.

Regards,
Yi L
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.
Thanks,
Yi L
Regards,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson
2017-12-20 11:01:10 UTC
Permalink
[snip]
Post by Liu, Yi L
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
So for a typical PAPR setup, the device can access system RAM either
via DMA in the range 0..1GiB (the "32-bit window") or in the range
2^59..2^59+<something> (the "64-bit window"). Typically the 32-bit
window has mappings dynamically created by drivers, and the 64-bit
window has all of system RAM mapped 1:1, but that's entirely up to the
OS, it can map each window however it wants.
32-bit devices (or "64 bit" devices which don't actually implement
enough the address bits) will only be able to use the 32-bit window,
of course.
MMIOs of other devices, the "magic" MSI-X addresses belonging to the
host bridge and other things exist outside those ranges. Those are
just the ranges which are used to DMA to RAM.
Each PE (domain) can see a different version of what's in each
window.
If I'm correct so far. PE actually defines a mapping between an address
range of an address space(aka. iova address space) and an address range
of the system physical address space.
No. A PE means several things, but basically it is an isolation
domain, like an Intel IOMMU domain. Each PE has an independent set of
IOMMU mappings which translate part of the PCI address space to system
memory space.
Post by Liu, Yi L
Then my question is: does each PE define a separate iova address sapce
which is flat from 0 - 2^AW -1, AW is address width? As a reference,
VT-d domain defines a flat address space for each domain.
Partly. Each PE has an address space which all devices in the PE see.
Only some of that address space is mapped to system memory though,
other parts are occupied by devices, others are unmapped.
Only the parts mapped by the IOMMU vary between PEs - the other parts
of the address space will be identical for all PEs on the host
Thx, this comment addressed me well. This is different from what we have
on VT-d.
Really? That's hard to believe. I'm pretty sure the VT-d IOMMU must
have a range < 2^64, and anything on the bus outside that range I
expect would be common between all domains. In particular I'd expect
the BARs for other devices not to be remapped by the IOMMU (though
they may be inaccessible on PCI-E due peer to peer transactions being
blocked). As well as things above the IOMMU's range, I'd expect the
region for 32-bit BARs to be common between all domains.

[snip]
Post by Liu, Yi L
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
Post by Liu, Yi L
IIUC how SVM works, the whole point is that the device
no longer writes into a specific PCI address space. Instead, it
writes directly into a process address space. So it seems to me more
that SVM should operate at the PCI level, and disassociate the device
from the normal PCI address space entirely, rather than hooking up
something via that address space.
After thinking more, I agree that it is not suitable to hook up something for
1st level via the PCI address space. In the time 1st and 2nd level translation
is exposed to guest, a device would write to multiple address spaces. PCI address
space is only one of them. I think your reply in another email is a good start,
let me reply my thoughts under that email.
Regards,
Yi L
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
Post by Liu, Yi L
As Peter replied, we still need the PCI address space, it would be used
to build up the 2nd level page table which would be used in nested
translation.
Thanks,
Yi L
Regards,
Yi L
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-12-22 06:47:28 UTC
Permalink
[snip]
Post by David Gibson
Post by Liu, Yi L
Post by David Gibson
Partly. Each PE has an address space which all devices in the PE see.
Only some of that address space is mapped to system memory though,
other parts are occupied by devices, others are unmapped.
Only the parts mapped by the IOMMU vary between PEs - the other parts
of the address space will be identical for all PEs on the host
Thx, this comment addressed me well. This is different from what we have
on VT-d.
Really? That's hard to believe. I'm pretty sure the VT-d IOMMU must
have a range < 2^64, and anything on the bus outside that range I
expect would be common between all domains. In particular I'd expect
the BARs for other devices not to be remapped by the IOMMU (though
they may be inaccessible on PCI-E due peer to peer transactions being
blocked). As well as things above the IOMMU's range, I'd expect the
region for 32-bit BARs to be common between all domains.
Sorry I misunderstood you. In each IOVA space, there is reserved range
, it is the BARs MMIO range. Such reservation is to avoid un-expected
Peer-To-Peer transaction. So regards to the IOVA space, all vendors should
be similar. So you are right~

Thanks,
Yi L
David Gibson
2017-12-18 06:30:55 UTC
Permalink
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
For systems that have IOMMUs, we will create a special address
space per device which is different from system default address
space for it (please refer to pci_device_iommu_address_space()).
Normally when that happens, there will be one specific IOMMU (or
say, translation unit) stands right behind that new address space.
This iommu_get() fetches that guy behind the address space. Here,
the guy is defined as IOMMUObject, which includes a notifier_list
so far, may extend in future. Along with IOMMUObject, a new iommu
notifier mechanism is introduced. It would be used for virt-svm.
Also IOMMUObject can further have a IOMMUObjectOps which is similar
to MemoryRegionOps. The difference is IOMMUObjectOps is not relied
on MemoryRegion.
Hi, sorry I didn't reply to the earlier postings of this after our
discussion in China. I've been sick several times and very busy.
Hi David,
Fully understood. I'll try my best to address your question. Also,
feel free to input further questions, anyhow, the more we discuss the
better work we done.
I still don't feel like there's an adequate explanation of exactly
what an IOMMUObject represents. Obviously it can represent more than
IOMMUObject is aimed to represent the iommu itself. e.g. the iommu
specific operations. One of the key purpose of IOMMUObject is to
introduce a notifier framework to let iommu emulator to be able to
do iommu operations other than MAP/UNMAP. As IOMMU grows more and
more feature, MAP/UNMAP is not the only operation iommu emulator needs
to deal. e.g. shared virtual memory. So far, as I know AMD/ARM also
has it. may correct me on it. As my cover letter mentioned, MR based
notifier framework doesn’t work for the newly added IOMMU operations.
Like bind guest pasid table pointer to host and propagate guest's
iotlb flush to host.
a single translation window - since that's represented by the
IOMMUMR. But what exactly do all the MRs - or whatever else - that
are represented by the IOMMUObject have in common, from a functional
point of view.
Let me take virt-SVM as an example. As far as I know, for virt-SVM,
the implementation of different vendors are similar. The key design
is to have a nested translation(aka. two stage translation). It is to
have guest maintain gVA->gPA mapping and hypervisor builds gPA->hPA
mapping. Similar to EPT based virt-MMU solution.
In Qemu, gPA->hPA mapping is done through MAP/UNMAP notifier, it can
keep going. But for gVA->gPA mapping, only guest knows it, so hypervisor
needs to trap specific guest iommu operation and pass the gVA->gPA
mapping knowledge to host through a notifier(newly added one). In VT-d,
it is called bind guest pasid table to host.
What I don't get is the PASID table is per extended context entry. I
understand this latter is indexed by PCI device function. And today MR
are created per PCIe device if I am not wrong. So why can't we have 1
new MR notifier dedicated to PASID table passing? My understanding is
the MR, having a 1-1 correspondence with a PCIe device and thus a
context could be of right granularity.
Not really. The MR(s) and AS is created per a group of devices which
will always see the same mappings. On Intel that's the IOMMU domain.
On PAPR that's a partitionable-endpoint - except that we choose to
only have one PE per guest host bridge (but multiple host bridges is
standard for POWER).

There's a qemu hook to get the right AS for a device, which takes the
devfn as a parameter. Depending on the host bridge implementation,
though, it won't necessary return a different AS for every device
though.
Then I understand the only flags
we currently have are NONE, MAP and UNMAP but couldn't we add a new one
for PASID TABLE passing? So this is not crystal clear to me why MR
notifiers are not adapted to PASID table passing.
Right, to me either. Things get more complicated if both the 1st
level (per PASID) and 2nd level translations (per PCI RID) are visible
to the guest. Having level 1 owned by the guest and 2nd level owned
by the host is the typical mode of operation, but if we want to model
bare metal machines we do need to handle the case of both. Similarly,
implementing virt-SVM can't go and break our modelling of
"traditional" non-PASID aware IOMMUs. Those are not usually present
in x86 guests, although they can be, and they are *always* present for
PAPR guests.
Also, for the gVA iotlb flushing, only guest knows it. So hypervisor
needs to propagate it to host. Here, MAP/UNMAP is not suitable since
this gVA iotlb flush here doesn’t require to modify host iommu
translation table.
I don't really get this argument. IOMMUNotifier just is a notifier that
is attached to an IOMMU MR and calls a an IOMMUNotify function, right?
Then the role of the function currently is attached to the currently
existing flags, MAP, UNMAP. This is not linked to an action on the
physical IOMMU, right?
Maybe, maybe not. In the case of emulated devices, it need not touch
the host MMU. However, for the case of VFIO devices, we need to
mirror mappings in the guest IOMMU to the host IOMMU.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
David Gibson
2017-12-18 11:38:26 UTC
Permalink
Hi Yi L,
Post by Liu, Yi L
AddressSpaceOps is similar to MemoryRegionOps, it's just for address
spaces to store arch-specific hooks.
The first hook I would like to introduce is iommu_get(). Return an
IOMMUObject behind the AddressSpace.
David had an objection in the past about this method, saying that
several IOMMUs could translate a single AS?
https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01610.html
In
https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/pci/pci-iommu.txt,
it is said
"a given PCI device can only master through one IOMMU"
That's using a platform specific meaning of what "one IOMMU" means.
In general what's several IOMMUs and what's one IOMMU which responds
to several address regions is not distinguishable from the device's
point of view.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Liu, Yi L
2017-11-03 12:01:54 UTC
Permalink
Rename GuestIOMMU to GuestIOMMUMR as the existing GuestIOMMU is
for MemoryRegion related notifiers.

Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/vfio/common.c | 15 ++++++++-------
include/hw/vfio/vfio-common.h | 8 ++++----
2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 1f7d516..3d40bec 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -348,7 +348,7 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,

static void vfio_iommu_map_notify(IOMMUMRNotifier *n, IOMMUTLBEntry *iotlb)
{
- VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+ VFIOGuestIOMMUMR *giommu = container_of(n, VFIOGuestIOMMUMR, n);
VFIOContainer *container = giommu->container;
hwaddr iova = iotlb->iova + giommu->iommu_offset;
bool read_only;
@@ -478,7 +478,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
memory_region_ref(section->mr);

if (memory_region_is_iommu(section->mr)) {
- VFIOGuestIOMMU *giommu;
+ VFIOGuestIOMMUMR *giommu;
IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);

trace_vfio_listener_region_add_iommu(iova, end);
@@ -500,7 +500,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
IOMMU_MR_EVENT_ALL,
section->offset_within_region,
int128_get64(llend));
- QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
+ QLIST_INSERT_HEAD(&container->giommu_mr_list, giommu, giommu_next);

memory_region_register_iommu_notifier(section->mr, &giommu->n);
memory_region_iommu_replay(giommu->iommu, &giommu->n);
@@ -567,9 +567,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
}

if (memory_region_is_iommu(section->mr)) {
- VFIOGuestIOMMU *giommu;
+ VFIOGuestIOMMUMR *giommu;

- QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+ QLIST_FOREACH(giommu, &container->giommu_mr_list, giommu_next) {
if (MEMORY_REGION(giommu->iommu) == section->mr &&
giommu->n.start == section->offset_within_region) {
memory_region_unregister_iommu_notifier(section->mr,
@@ -1163,12 +1163,13 @@ static void vfio_disconnect_container(VFIOGroup *group)

if (QLIST_EMPTY(&container->group_list)) {
VFIOAddressSpace *space = container->space;
- VFIOGuestIOMMU *giommu, *tmp;
+ VFIOGuestIOMMUMR *giommu, *tmp;

vfio_listener_release(container);
QLIST_REMOVE(container, next);

- QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+ QLIST_FOREACH_SAFE(giommu, &container->giommu_mr_list,
+ giommu_next, tmp) {
memory_region_unregister_iommu_notifier(
MEMORY_REGION(giommu->iommu), &giommu->n);
QLIST_REMOVE(giommu, giommu_next);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 865e3e7..702a085 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -87,19 +87,19 @@ typedef struct VFIOContainer {
* contiguous IOVA window. We may need to generalize that in
* future
*/
- QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
+ QLIST_HEAD(, VFIOGuestIOMMUMR) giommu_mr_list;
QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
QLIST_HEAD(, VFIOGroup) group_list;
QLIST_ENTRY(VFIOContainer) next;
} VFIOContainer;

-typedef struct VFIOGuestIOMMU {
+typedef struct VFIOGuestIOMMUMR {
VFIOContainer *container;
IOMMUMemoryRegion *iommu;
hwaddr iommu_offset;
IOMMUMRNotifier n;
- QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-} VFIOGuestIOMMU;
+ QLIST_ENTRY(VFIOGuestIOMMUMR) giommu_next;
+} VFIOGuestIOMMUMR;

typedef struct VFIOHostDMAWindow {
hwaddr min_iova;
--
1.9.1
Liu, Yi L
2017-11-03 12:01:55 UTC
Permalink
This patch introduce a notify framework for IOMMUObject.iommu_notifiers.
Introduce VFIOGuestIOMMUObject is to link VFIO Container and the new
IOMMUObject notififiers.

VFIOGuestIOMMUObject instance is allocated when device is assigned and
meanwhile vIOMMU is exposed to guest.

If there is IOMMUObject behind the device AddressSpace(a.ka vIOMMU exposed).
The VFIOGuestIOMMUObject instance would be allocated and inserted to the
VFIOContainer.giommu_object_list.

Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/vfio/pci.c | 39 ++++++++++++++++++++++++++++++++++++++-
include/hw/vfio/vfio-common.h | 8 ++++++++
2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee3..5b77c7e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2642,6 +2642,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
VFIODevice *vbasedev_iter;
VFIOGroup *group;
+ AddressSpace *as;
+ IOMMUObject *iommu;
char *tmp, group_path[PATH_MAX], *group_name;
Error *err = NULL;
ssize_t len;
@@ -2694,7 +2696,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)

trace_vfio_realize(vdev->vbasedev.name, groupid);

- group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
+ as = pci_device_iommu_address_space(pdev);
+ group = vfio_get_group(groupid, as, errp);
if (!group) {
goto error;
}
@@ -2877,6 +2880,17 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vfio_register_req_notifier(vdev);
vfio_setup_resetfn_quirk(vdev);

+ iommu = address_space_iommu_get(as);
+ if (iommu != NULL) {
+ VFIOGuestIOMMUObject *giommu;
+ giommu = g_malloc0(sizeof(*giommu));
+ giommu->iommu = iommu;
+ giommu->container = group->container;
+ QLIST_INSERT_HEAD(&group->container->giommu_object_list,
+ giommu,
+ giommu_next);
+ }
+
return;

out_teardown:
@@ -2907,6 +2921,28 @@ static void vfio_instance_finalize(Object *obj)
vfio_put_group(group);
}

+static void vfio_release_iommu_object(PCIDevice *pdev)
+{
+ VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+ AddressSpace *as;
+ IOMMUObject *iommu;
+
+ as = pci_device_iommu_address_space(pdev);
+ iommu = address_space_iommu_get(as);
+ if (iommu != NULL) {
+ VFIOGuestIOMMUObject *giommu, *tmp;
+ VFIOGroup *group;
+ group = vdev->vbasedev.group;
+
+ QLIST_FOREACH_SAFE(giommu,
+ &group->container->giommu_object_list,
+ giommu_next, tmp) {
+ QLIST_REMOVE(giommu, giommu_next);
+ g_free(giommu);
+ }
+ }
+ return;
+}
static void vfio_exitfn(PCIDevice *pdev)
{
VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -2915,6 +2951,7 @@ static void vfio_exitfn(PCIDevice *pdev)
vfio_unregister_err_notifier(vdev);
pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
vfio_disable_interrupts(vdev);
+ vfio_release_iommu_object(pdev);
if (vdev->intx.mmap_timer) {
timer_free(vdev->intx.mmap_timer);
}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 702a085..e4963cc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -88,6 +88,7 @@ typedef struct VFIOContainer {
* future
*/
QLIST_HEAD(, VFIOGuestIOMMUMR) giommu_mr_list;
+ QLIST_HEAD(, VFIOGuestIOMMUObject) giommu_object_list;
QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
QLIST_HEAD(, VFIOGroup) group_list;
QLIST_ENTRY(VFIOContainer) next;
@@ -101,6 +102,13 @@ typedef struct VFIOGuestIOMMUMR {
QLIST_ENTRY(VFIOGuestIOMMUMR) giommu_next;
} VFIOGuestIOMMUMR;

+typedef struct VFIOGuestIOMMUObject {
+ VFIOContainer *container;
+ IOMMUObject *iommu;
+ IOMMUNotifier n;
+ QLIST_ENTRY(VFIOGuestIOMMUObject) giommu_next;
+} VFIOGuestIOMMUObject;
+
typedef struct VFIOHostDMAWindow {
hwaddr min_iova;
hwaddr max_iova;
--
1.9.1
Liu, Yi L
2017-11-14 14:24:07 UTC
Permalink
Hi Eric,
Hi Yi L,
Post by Liu, Yi L
This patch introduce a notify framework for IOMMUObject.iommu_notifiers.
Introduce VFIOGuestIOMMUObject is to link VFIO Container and the new
IOMMUObject notififiers.
notifiers
Post by Liu, Yi L
VFIOGuestIOMMUObject instance is allocated when device is assigned and
meanwhile vIOMMU is exposed to guest.
If there is IOMMUObject behind the device AddressSpace(a.ka vIOMMU exposed).
The VFIOGuestIOMMUObject instance would be allocated and inserted to the
VFIOContainer.giommu_object_list.
---
hw/vfio/pci.c | 39 ++++++++++++++++++++++++++++++++++++++-
include/hw/vfio/vfio-common.h | 8 ++++++++
2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee3..5b77c7e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2642,6 +2642,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
VFIODevice *vbasedev_iter;
VFIOGroup *group;
+ AddressSpace *as;
+ IOMMUObject *iommu;
char *tmp, group_path[PATH_MAX], *group_name;
Error *err = NULL;
ssize_t len;
@@ -2694,7 +2696,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
trace_vfio_realize(vdev->vbasedev.name, groupid);
- group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), errp);
+ as = pci_device_iommu_address_space(pdev);
+ group = vfio_get_group(groupid, as, errp);
if (!group) {
goto error;
}
@@ -2877,6 +2880,17 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vfio_register_req_notifier(vdev);
vfio_setup_resetfn_quirk(vdev);
+ iommu = address_space_iommu_get(as);
+ if (iommu != NULL) {
+ VFIOGuestIOMMUObject *giommu;
nit: void line needed
accepted.
Post by Liu, Yi L
+ giommu = g_malloc0(sizeof(*giommu));
+ giommu->iommu = iommu;
+ giommu->container = group->container;
+ QLIST_INSERT_HEAD(&group->container->giommu_object_list,
There is no QLIST_INIT anywhere for container's giommu_object_list?
yes, need to add. thx for the catching.

Thanks,
Yi L
Thanks
Eric
Post by Liu, Yi L
+ giommu,
+ giommu_next);
+ }
+
return;
@@ -2907,6 +2921,28 @@ static void vfio_instance_finalize(Object *obj)
vfio_put_group(group);
}
+static void vfio_release_iommu_object(PCIDevice *pdev)
+{
+ VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+ AddressSpace *as;
+ IOMMUObject *iommu;
+
+ as = pci_device_iommu_address_space(pdev);
+ iommu = address_space_iommu_get(as);
+ if (iommu != NULL) {
+ VFIOGuestIOMMUObject *giommu, *tmp;
+ VFIOGroup *group;
+ group = vdev->vbasedev.group;
+
+ QLIST_FOREACH_SAFE(giommu,
+ &group->container->giommu_object_list,
+ giommu_next, tmp) {
+ QLIST_REMOVE(giommu, giommu_next);
+ g_free(giommu);
+ }
+ }
+ return;
+}
static void vfio_exitfn(PCIDevice *pdev)
{
VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -2915,6 +2951,7 @@ static void vfio_exitfn(PCIDevice *pdev)
vfio_unregister_err_notifier(vdev);
pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
vfio_disable_interrupts(vdev);
+ vfio_release_iommu_object(pdev);
if (vdev->intx.mmap_timer) {
timer_free(vdev->intx.mmap_timer);
}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 702a085..e4963cc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -88,6 +88,7 @@ typedef struct VFIOContainer {
* future
*/
QLIST_HEAD(, VFIOGuestIOMMUMR) giommu_mr_list;
+ QLIST_HEAD(, VFIOGuestIOMMUObject) giommu_object_list;
QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
QLIST_HEAD(, VFIOGroup) group_list;
QLIST_ENTRY(VFIOContainer) next;
@@ -101,6 +102,13 @@ typedef struct VFIOGuestIOMMUMR {
QLIST_ENTRY(VFIOGuestIOMMUMR) giommu_next;
} VFIOGuestIOMMUMR;
+typedef struct VFIOGuestIOMMUObject {
+ VFIOContainer *container;
+ IOMMUObject *iommu;
+ IOMMUNotifier n;
+ QLIST_ENTRY(VFIOGuestIOMMUObject) giommu_next;
+} VFIOGuestIOMMUObject;
+
typedef struct VFIOHostDMAWindow {
hwaddr min_iova;
hwaddr max_iova;
Liu, Yi L
2017-11-03 12:01:56 UTC
Permalink
This is an example to show the usage of IOMMUObject based notifier.

For passthru devices, if there is a vIOMMU exposed to guest, guest
would issue iommu operation on the devices. And the iommu operations
needs to be propagated to host iommu driver.

In future, the IOMMUObject notifiers may include:
* notifier for guest pasid table binding
* notifier for guest iommu tlb invalidation
Both of the two notifiers would be include in future virt-SVM patchset.

In virt-SVM patchset, this notifier would be fulfilled.

Signed-off-by: Liu, Yi L <***@linux.intel.com>
---
hw/vfio/pci.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5b77c7e..3ed521e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2637,6 +2637,14 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
vdev->req_enabled = false;
}

+static void vfio_iommu_bind_pasidtbl_notify(IOMMUNotifier *n,
+ IOMMUEventData *event_data)
+{
+/* Sample code, would be detailed in coming virt-SVM patchset.
+ VFIOGuestIOMMUObject *giommu = container_of(n, VFIOGuestIOMMUObject, n);
+ VFIOContainer *container = giommu->container;
+*/
+}
static void vfio_realize(PCIDevice *pdev, Error **errp)
{
VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
@@ -2889,6 +2897,12 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
QLIST_INSERT_HEAD(&group->container->giommu_object_list,
giommu,
giommu_next);
+ /* Register vfio_iommu_bind_pasidtbl_notify with event flag
+ IOMMU_EVENT_BIND_PASIDT */
+ iommu_notifier_register(iommu,
+ &giommu->n,
+ vfio_iommu_bind_pasidtbl_notify,
+ IOMMU_EVENT_BIND_PASIDT);
}

return;
--
1.9.1
Loading...