[RFC PATCH 00/21] Avoid IPI while updating page table entries.

Aneesh Kumar K.V aneesh.kumar at linux.ibm.com
Thu Feb 27 17:55:59 AEDT 2020


Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a large number
of IPIs that were caused by clearing level 1 PTE entries (THP) entries.
This is shown in the stack trace below.


- qemu-system-ppc  [kernel.vmlinux]            [k] smp_call_function_many
   - smp_call_function_many
      - 36.09% smp_call_function_many
           serialize_against_pte_lookup 
           radix__pmdp_huge_get_and_clear
           zap_huge_pmd
           unmap_page_range
           unmap_vmas
           unmap_region
           __do_munmap
           __vm_munmap
           sys_munmap
          system_call 
           __munmap
           qemu_ram_munmap
           qemu_anon_ram_free
           reclaim_ramblock
           call_rcu_thread
           qemu_thread_start 
           start_thread
           __clone

Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk")

serialize_against_pte_lookup makes sure that all parallel lockless page table
walk completes before we convert a PMD pte entry to regular pmd entry.
We end up doing that conversion in the below scenarios

1) __split_huge_zero_page_pmd
2) do_huge_pmd_wp_page_fallback
3) MADV_DONTNEED running parallel to page faults.

local_irq_disable and lockless page table walk:

The lockless page table walk work with the assumption that we can dereference
the page table contents without holding a lock. For this to work, we need to
make sure we read the page table contents atomically and page table pages are
not going to be freed/released while we are walking the
table pages. We can achieve by using a rcu based freeing for page table pages or
if the architecture implements broadcast tlbie, we can block the IPI as we walk the
page table pages.

To support both the above framework, lockless page table walk is done with
irq disabled instead of rcu_read_lock()

We do have two interface for lockless page table walk, gup fast and __find_linux_pte.
This patch series makes __find_linux_pte table walk safe against the conversion of PMD PTE
to regular PMD. 

gup fast:

gup fast is already safe against THP split because kernel now differentiate between a pmd
split and a compound page split. gup fast can run parallel to a pmd split and we prevent
a parallel gup fast to a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.


Similar to how gup is safe against parallel pmd split, this patch series updates the
__find_linux_pte callers to be safe against a parallel pmd split. We do that by enforcing
the following rules.

1) Don't reload the pte value, because that can be updated in parallel.
2) Code should be able to work with a stale PTE value and not the recent one. ie,
the pte value that we are looking at may not be the latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this as part of pte_present()
check.

NOTE: I am still looking for details w.r.t corruption fixed by 
Commit: 33258a1db165 ("powerpc/64s: Fix THP PMD collapse serialisation")
Understanding that is important to make sure this series is not creating a regression around that.


Aneesh Kumar K.V (21):
  powerpc/pkeys: Avoid using lockless page table walk
  powerpc/pkeys: Check vma before returning key fault error to the user
  powerpc/mm/hash64: use _PAGE_PTE when checking for pte_present
  powerpc/hash64: Restrict page table lookup using init_mm with
    __flush_hash_table_range
  powerpc/book3s64/hash: Use the pte_t address from the caller
  powerpc/book3s/hash64/devmap: Use H_PAGE_THP_HUGE when setting up
    level huge devmap pte entries
  powerpc/mce: Don't reload pte val in addr_to_pfn
  powerpc/perf/callchain: Use __get_user_pages_fast in
    read_user_stack_slow
  powerpc/kvm/book3s: switch from raw_spin_*lock to arch_spin_lock.
  powerpc/kvm/book3s: Add helper to walk partition scoped linux page
    table.
  powerpc/kvm/nested: Add helper to walk nested shadow linux page table.
  powerpc/kvm/book3s: Use kvm helpers to walk shadow or secondary table
  powerpc/kvm/book3s: Add helper for host page table walk
  powerpc/kvm/book3s: Use find_kvm_host_pte in page fault handler
  powerpc/kvm/book3s: Use find_kvm_host_pte in h_enter
  powerpc/kvm/book3s: use find_kvm_host_pte in pute_tce functions
  powerpc/kvm/book3s: Avoid using rmap to protect parallel page table
    update.
  powerpc/kvm/book3s: use find_kvm_host_pte in
    kvmppc_book3s_instantiate_page
  powerpc/kvm/book3s: Use find_kvm_host_pte in kvmppc_get_hpa
  powerpc/kvm/book3s: Use pte_present instead of opencoding
    _PAGE_PRESENT check
  powerpc/mm/book3s64: Avoid sending IPI on clearing PMD

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  6 ++
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  8 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 19 +++--
 arch/powerpc/include/asm/book3s/64/radix.h    |  5 ++
 .../include/asm/book3s/64/tlbflush-hash.h     |  3 +-
 arch/powerpc/include/asm/kvm_book3s.h         |  2 +-
 arch/powerpc/include/asm/kvm_book3s_64.h      | 34 ++++++++-
 arch/powerpc/include/asm/mmu.h                |  9 ---
 arch/powerpc/kernel/mce_power.c               | 14 ++--
 arch/powerpc/kernel/pci_64.c                  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           | 12 ++-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        | 40 +++++-----
 arch/powerpc/kvm/book3s_64_vio_hv.c           | 64 ++++++++--------
 arch/powerpc/kvm/book3s_hv_nested.c           | 37 ++++++---
 arch/powerpc/kvm/book3s_hv_rm_mmu.c           | 58 +++++---------
 arch/powerpc/mm/book3s64/hash_pgtable.c       | 11 ---
 arch/powerpc/mm/book3s64/hash_tlb.c           | 16 +---
 arch/powerpc/mm/book3s64/hash_utils.c         | 62 ++++-----------
 arch/powerpc/mm/book3s64/pgtable.c            |  8 --
 arch/powerpc/mm/book3s64/radix_pgtable.c      | 20 ++---
 arch/powerpc/mm/fault.c                       | 75 +++++++++++++------
 arch/powerpc/perf/callchain.c                 | 53 ++++++-------
 22 files changed, 274 insertions(+), 284 deletions(-)

-- 
2.24.1



More information about the Linuxppc-dev mailing list