Merge branch 'kvm-guestmemfd' into HEAD

Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
author: Paolo Bonzini <pbonzini@redhat.com> 2023-11-13 05:58:30 -0500
committer: Paolo Bonzini <pbonzini@redhat.com> 2023-11-14 08:31:31 -0500
commit: 6c370dc65374db5afbc5c6c64c662f922a2555ad (patch)
tree: c11f225af8afb218635822e52a7281715575d78b /mm/compaction.c
parent: b85ea95d086471afb4ad062012a4d73cd328fa86 (diff)
parent: 5d74316466f4aabdd2ee1e33b45e4933c9bc3ea1 (diff)
1 files changed, 31 insertions, 12 deletions
diff --git a/mm/compaction.c b/mm/compaction.c
index 01ba298739dd..e2d0c604ae42 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -882,6 +882,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		bool is_dirty, is_unevictable;
 
 		if (skip_on_failure && low_pfn >= next_skip_pfn) {
 			/*
@@ -1079,8 +1080,10 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!folio_test_lru(folio))
 			goto isolate_fail_put;
 
+		is_unevictable = folio_test_unevictable(folio);
+
 		/* Compaction might skip unevictable pages but CMA takes them */
-		if (!(mode & ISOLATE_UNEVICTABLE) && folio_test_unevictable(folio))
+		if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
 			goto isolate_fail_put;
 
 		/*
@@ -1092,26 +1095,42 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_writeback(folio))
 			goto isolate_fail_put;
 
-		if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-			bool migrate_dirty;
+		is_dirty = folio_test_dirty(folio);
+
+		if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+		    (mapping && is_unevictable)) {
+			bool migrate_dirty = true;
+			bool is_unmovable;
 
 			/*
 			 * Only folios without mappings or that have
-			 * a ->migrate_folio callback are possible to
-			 * migrate without blocking.  However, we may
-			 * be racing with truncation, which can free
-			 * the mapping.  Truncation holds the folio lock
-			 * until after the folio is removed from the page
-			 * cache so holding it ourselves is sufficient.
+			 * a ->migrate_folio callback are possible to migrate
+			 * without blocking.
+			 *
+			 * Folios from unmovable mappings are not migratable.
+			 *
+			 * However, we can be racing with truncation, which can
+			 * free the mapping that we need to check. Truncation
+			 * holds the folio lock until after the folio is removed
+			 * from the page so holding it ourselves is sufficient.
+			 *
+			 * To avoid locking the folio just to check unmovable,
+			 * assume every unmovable folio is also unevictable,
+			 * which is a cheaper test.  If our assumption goes
+			 * wrong, it's not a correctness bug, just potentially
+			 * wasted cycles.
 			 */
 			if (!folio_trylock(folio))
 				goto isolate_fail_put;
 
 			mapping = folio_mapping(folio);
-			migrate_dirty = !mapping ||
-					mapping->a_ops->migrate_folio;
+			if ((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) {
+				migrate_dirty = !mapping ||
+						mapping->a_ops->migrate_folio;
+			}
+			is_unmovable = mapping && mapping_unmovable(mapping);
 			folio_unlock(folio);
-			if (!migrate_dirty)
+			if (!migrate_dirty || is_unmovable)
 				goto isolate_fail_put;
 		}
author	Paolo Bonzini <pbonzini@redhat.com>	2023-11-13 05:58:30 -0500
committer	Paolo Bonzini <pbonzini@redhat.com>	2023-11-14 08:31:31 -0500
commit	6c370dc65374db5afbc5c6c64c662f922a2555ad (patch)
tree	c11f225af8afb218635822e52a7281715575d78b /mm/compaction.c
parent	b85ea95d086471afb4ad062012a4d73cd328fa86 (diff)
parent	5d74316466f4aabdd2ee1e33b45e4933c9bc3ea1 (diff)