[RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes

Wed Jul 26 21:20:16 AEST 2023

Hi Sean,

On 7/24/2023 10:30 PM, Sean Christopherson wrote:
> On Mon, Jul 24, 2023, Nikunj A. Dadhania wrote:
>> On 7/19/2023 5:14 AM, Sean Christopherson wrote:
>>> This is the next iteration of implementing fd-based (instead of vma-based)
>>> memory for KVM guests.  If you want the full background of why we are doing
>>> this, please go read the v10 cover letter[1].
>>>
>>> The biggest change from v10 is to implement the backing storage in KVM
>>> itself, and expose it via a KVM ioctl() instead of a "generic" sycall.
>>> See link[2] for details on why we pivoted to a KVM-specific approach.
>>>
>>> Key word is "biggest".  Relative to v10, there are many big changes.
>>> Highlights below (I can't remember everything that got changed at
>>> this point).
>>>
>>> Tagged RFC as there are a lot of empty changelogs, and a lot of missing
>>> documentation.  And ideally, we'll have even more tests before merging.
>>> There are also several gaps/opens (to be discussed in tomorrow's PUCK).
>>
>> As per our discussion on the PUCK call, here are the memory/NUMA accounting 
>> related observations that I had while working on SNP guest secure page migration:
>>
>> * gmem allocations are currently treated as file page allocations
>>   accounted to the kernel and not to the QEMU process.
> 
> We need to level set on terminology: these are all *stats*, not accounting.  That
> distinction matters because we have wiggle room on stats, e.g. we can probably get
> away with just about any definition of how guest_memfd memory impacts stats, so
> long as the information that is surfaced to userspace is useful and expected.
> 
> But we absolutely need to get accounting correct, specifically the allocations
> need to be correctly accounted in memcg.  And unless I'm missing something,
> nothing in here shows anything related to memcg.

I tried out memcg after creating a separate cgroup for the qemu process. Guest 
memory is accounted in memcg.

  $ egrep -w "file|file_thp|unevictable" memory.stat
  file 42978775040
  file_thp 42949672960
  unevictable 42953588736 

NUMA allocations are coming from right nodes as set by the numactl.

  $ egrep -w "file|file_thp|unevictable" memory.numa_stat
  file N0=0 N1=20480 N2=21489377280 N3=21489377280
  file_thp N0=0 N1=0 N2=21472739328 N3=21476933632
  unevictable N0=0 N1=0 N2=21474697216 N3=21478891520

> 
>>   Starting an SNP guest with 40G memory with memory interleave between
>>   Node2 and Node3
>>
>>   $ numactl -i 2,3 ./bootg_snp.sh
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>  242179 root      20   0   40.4g  99580  51676 S  78.0   0.0   0:56.58 qemu-system-x86
>>
>>   -> Incorrect process resident memory and shared memory is reported
> 
> I don't know that I would call these "incorrect".  Shared memory definitely is
> correct, because by definition guest_memfd isn't shared.  RSS is less clear cut;
> gmem memory is resident in RAM, but if we show gmem in RSS then we'll end up with
> scenarios where RSS > VIRT, which will be quite confusing for unaware users (I'm
> assuming the 40g of VIRT here comes from QEMU mapping the shared half of gmem
> memslots).

I am not sure why will RSS exceed the VIRT, it should be at max 40G (assuming all the
memory is private)

As per my experiments with a hack below. MM_FILEPAGES does get accounted to RSS/SHR in top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4339 root      20   0   40.4g  40.1g  40.1g S  76.7  16.0   0:13.83 qemu-system-x86

diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..5b1f48a2e714 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -166,6 +166,7 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
 {
        trace_rss_stat(mm, member);
 }
+EXPORT_SYMBOL(mm_trace_rss_stat);

 /*
  * Note: this doesn't free the actual pages themselves. That
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index a7e926af4255..e4f268bf9ce2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -91,6 +91,10 @@ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
                        clear_highpage(folio_page(folio, i));
        }

+       /* Account only once for the first time */
+       if (!folio_test_dirty(folio))
+               add_mm_counter(current->mm, MM_FILEPAGES, folio_nr_pages(folio));
+
        folio_mark_accessed(folio);
        folio_mark_dirty(folio);
        folio_mark_uptodate(folio);

We can update the rss_stat appropriately to get correct reporting in userspace.

>>   Accounting of the memory happens in the host page fault handler path,
>>   but for private guest pages we will never hit that.
>>
>> * NUMA allocation does use the process mempolicy for appropriate node 
>>   allocation (Node2 and Node3), but they again do not get attributed to 
>>   the QEMU process
>>
>>   Every 1.0s: sudo numastat  -m -p qemu-system-x86 | egrep -i "qemu|PID|Node|Filepage"   gomati: Mon Jul 24 11:51:34 2023
>>
>>   Per-node process memory usage (in MBs)
>>   PID                               Node 0          Node 1          Node 2          Node 3           Total
>>   242179 (qemu-system-x86)           21.14            1.61           39.44           39.38          101.57
>>
>>   Per-node system memory usage (in MBs):
>>                             Node 0          Node 1          Node 2          Node 3           Total
>>   FilePages                2475.63         2395.83        23999.46        23373.22        52244.14
>>
>>
>> * Most of the memory accounting relies on the VMAs and as private-fd of 
>>   gmem doesn't have a VMA(and that was the design goal), user-space fails 
>>   to attribute the memory appropriately to the process.
>>
>>   /proc/<qemu pid>/numa_maps
>>   7f528be00000 interleave:2-3 file=/memfd:memory-backend-memfd-shared\040(deleted) anon=1070 dirty=1070 mapped=1987 mapmax=256 active=1956 N2=582 N3=1405 kernelpagesize_kB=4
>>   7f5c90200000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted)
>>   7f5c90400000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=32 active=0 N2=32 kernelpagesize_kB=4
>>   7f5c90800000 interleave:2-3 file=/memfd:rom-backend-memfd-shared\040(deleted) dirty=892 active=0 N2=512 N3=380 kernelpagesize_kB=4
>>
>>   /proc/<qemu pid>/smaps
>>   7f528be00000-7f5c8be00000 rw-p 00000000 00:01 26629                      /memfd:memory-backend-memfd-shared (deleted)
>>   7f5c90200000-7f5c90220000 rw-s 00000000 00:01 44033                      /memfd:rom-backend-memfd-shared (deleted)
>>   7f5c90400000-7f5c90420000 rw-s 00000000 00:01 44032                      /memfd:rom-backend-memfd-shared (deleted)
>>   7f5c90800000-7f5c90b7c000 rw-s 00000000 00:01 1025                       /memfd:rom-backend-memfd-shared (deleted)
> 
> This is all expected, and IMO correct.  There are no userspace mappings, and so
> not accounting anything is working as intended.
Doesn't sound that correct, if 10 SNP guests are running each using 10GB, how would we know who is using 100GB of memory?

> 
>> * QEMU based NUMA bindings will not work. Memory backend uses mbind() 
>>   to set the policy for a particular virtual memory range but gmem 
>>   private-FD does not have a virtual memory range visible in the host.
> 
> Yes, adding a generic fbind() is the way to solve silve.

Regards,
Nikunj