[PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE

Qian Cai cai at lca.pw
Fri Jul 17 03:27:14 AEST 2020


On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
> 
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE. 
> 
> To easily check the availability of GTSE, it is made an MMU feature.
> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> handle GTSE as an optionally available feature and to not assume GTSE
> when radix support is available.
> 
> The actual hcall implementation for KVM isn't included in this
> patchset and will be posted separately.
> 
> Changes in v3
> =============
> - Fixed a bug in the hcall wrapper code where we were missing setting
>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>   a full flush for the nested case.
> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> 
> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
> 
> Bharata B Rao (2):
>   powerpc/mm: Enable radix GTSE only if supported.
>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>     enabled
> 
> Nicholas Piggin (1):
>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>     !GTSE

Reverting the whole series fixed random memory corruptions during boot on
POWER9 PowerNV systems below.

IBM 8335-GTH (ibm,witherspoon)
POWER9, altivec supported
262144 MB memory, 2000 GB disk space

.config:
https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

[    9.338996][  T925] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[    9.339026][  T925] Faulting instruction address: 0x00000000
[    9.339051][  T925] Oops: Kernel access of bad area, sig: 11 [#1]
[    9.339064][  T925] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[    9.339098][  T925] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[    9.339150][  T925] CPU: 92 PID: 925 Comm: (md-udevd) Not tainted 5.8.0-rc5-next-20200716 #3
[    9.339186][  T925] NIP:  0000000000000000 LR: c00000000021f2cc CTR: 0000000000000000
[    9.339210][  T925] REGS: c000201cb52d79b0 TRAP: 0400   Not tainted  (5.8.0-rc5-next-20200716)
[    9.339244][  T925] MSR:  9000000040009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24222292  XER: 00000000
[    9.339278][  T925] CFAR: c00000000021f2c8 IRQMASK: 0 
[    9.339278][  T925] GPR00: c00000000021f2cc c000201cb52d7c40 c000000005901000 c000201cb52d7ca8 
[    9.339278][  T925] GPR04: c00800000ea60038 0000000000000000 000000007fff0000 000000007fff0000 
[    9.339278][  T925] GPR08: 0000000000000000 0000000000000000 c000201cb50bd500 0000000000000003 
[    9.339278][  T925] GPR12: 0000000000000000 c000201fff694500 00007fffa4a8a940 00007fffa4a8a6c8 
[    9.339278][  T925] GPR16: 00007fffa4a8a8f8 00007fffa4a8a650 00007fffa4a8a488 0000000000000000 
[    9.339278][  T925] GPR20: 0000000000050001 00007fffa4a8a984 000000007fff0000 c00000000a4545cc 
[    9.339278][  T925] GPR24: c000000000affe28 0000000000000000 0000000000000000 0000000000000166 
[    9.339278][  T925] GPR28: c000201cb52d7ca8 c00800000ea60000 c000201cc3b72600 000000007fff0000 
[    9.339493][  T925] NIP [0000000000000000] 0x0
[    9.339516][  T925] LR [c00000000021f2cc] __seccomp_filter+0xec/0x530
bpf_dispatcher_nop_func at include/linux/bpf.h:567
(inlined by) bpf_prog_run_pin_on_cpu at include/linux/filter.h:597
(inlined by) seccomp_run_filters at kernel/seccomp.c:324
(inlined by) __seccomp_filter at kernel/seccomp.c:937
[    9.339538][  T925] Call Trace:
[    9.339548][  T925] [c000201cb52d7c40] [c00000000021f2cc] __seccomp_filter+0xec/0x530 (unreliable)
[    9.339566][  T925] [c000201cb52d7d50] [c000000000025af8] do_syscall_trace_enter+0xb8/0x470
do_seccomp at arch/powerpc/kernel/ptrace/ptrace.c:252
(inlined by) do_syscall_trace_enter at arch/powerpc/kernel/ptrace/ptrace.c:327
[    9.339600][  T925] [c000201cb52d7dc0] [c00000000002c8f8] system_call_exception+0x138/0x180
[    9.339625][  T925] [c000201cb52d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[    9.339648][  T925] Instruction dump:
[    9.339667][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339706][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339748][  T925] ---[ end trace d89eb80f9a6bc141 ]---
[  OK  ] Started Journal Service.
[   10.452364][  T925] Kernel panic - not syncing: Fatal exception
[   11.876655][  T925] ---[ end Kernel panic - not syncing: Fatal exception ]---

There could also be lots of random userspace segfault like,

[   16.463545][  T771] rngd[771]: segfault (11) at 0 nip 0 lr 0 code 1 in rngd[106d60000+20000]
[   16.463620][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[   16.463656][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

Occasionally, there are many soft-lockups,

[  396.920702][   C99] watchdog: BUG: soft lockup - CPU#99 stuck for 22s! [(spawn):2692]
[  396.920754][   C99] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod bnx2x tg3 ahci mdio libahci libphy firmware_class libata dm_mirror dm_region_hash dm_log dm_mod
[  396.920843][   C99] irq event stamp: 1731717220
[  396.920860][   C99] hardirqs last  enabled at (1731717219): [<c00000000004d6f4>] do_page_fault+0x324/0xd90
[  396.920889][   C99] hardirqs last disabled at (1731717220): [<c000000000015638>] arch_local_irq_restore+0x48/0xd0
[  396.920919][   C99] softirqs last  enabled at (41260): [<c0000000009abbe8>] __do_softirq+0x648/0x8c8
[  396.920948][   C99] softirqs last disabled at (41125): [<c0000000000d717c>] irq_exit+0x15c/0x1c0
[  396.920976][   C99] CPU: 99 PID: 2692 Comm: (spawn) Tainted: G             L    5.8.0-rc5-next-20200716 #3
[  396.921001][   C99] NIP:  c0000000000152b4 LR: c000000000015640 CTR: 0000000000000000
[  396.921037][   C99] REGS: c000201cbc3d7178 TRAP: 0900   Tainted: G             L     (5.8.0-rc5-next-20200716)
[  396.921074][   C99] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 28022482  XER: 20040000
[  396.921122][   C99] CFAR: 0000000000000000 IRQMASK: 3 
[  396.921122][   C99] GPR00: c000000000015640 c000201cbc3d7340 c000000005901000 c000201cbc3d7178 
[  396.921122][   C99] GPR04: c0000000057d7280 0000000000000000 000000000002000a 0000000000000003 
[  396.921122][   C99] GPR08: 0000201cc61c0000 0000000000000000 0000000000000001 c00000000593f868 
[  396.921122][   C99] GPR12: 0000000000002000 c000201fff67e700 00007fffdcda3918 0000000139eeba60 
[  396.921122][   C99] GPR16: 0000000139f30130 00007fffdcda39c8 0000000139eea708 0000000000000000 
[  396.921122][   C99] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 
[  396.921122][   C99] GPR24: 0000000000000e60 0000000000000900 0000000000000500 0000000000000a00 
[  396.921122][   C99] GPR28: 0000000000000f00 0000000000000002 0000000000000003 c000201ca4212400 
[  396.921468][   C99] NIP [c0000000000152b4] replay_soft_interrupts+0x74/0x3b0
replay_soft_interrupts at arch/powerpc/kernel/irq.c:216
[  396.921504][   C99] LR [c000000000015640] arch_local_irq_restore+0x50/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:375
[  396.921539][   C99] Call Trace:
[  396.921560][   C99] [c000201cbc3d7340] [c000000000015640] arch_local_irq_restore+0x50/0xd0 (unreliable)
[  396.921602][   C99] [c000201cbc3d7360] [c0000000009a0c68] lock_is_held_type+0xf8/0x180
[  396.921641][   C99] [c000201cbc3d73c0] [c0000000003e8cf0] mem_cgroup_from_task+0xa0/0x130
[  396.921666][   C99] [c000201cbc3d7400] [c000000000337950] handle_mm_fault+0x140/0x1d20
[  396.921703][   C99] [c000201cbc3d7500] [c00000000004d5ac] do_page_fault+0x1dc/0xd90
[  396.921763][   C99] [c000201cbc3d7600] [c00000000000c028] handle_page_fault+0x10/0x2c
[  396.921804][   C99] --- interrupt: 300 at futex_cleanup+0x3c0/0x740
[  396.921804][   C99]     LR = futex_cleanup+0x35c/0x740
[  396.921879][   C99] [c000201cbc3d79c0] [c0000000001df2e8] futex_exec_release+0x28/0x50
[  396.921929][   C99] [c000201cbc3d79f0] [c0000000000c5e54] exec_mm_release+0x24/0x50
[  396.921968][   C99] [c000201cbc3d7a30] [c000000000421e84] begin_new_exec+0x324/0xea0
[  396.922005][   C99] [c000201cbc3d7af0] [c0000000004d8f1c] load_elf_binary+0x7fc/0x1110
[  396.922042][   C99] [c000201cbc3d7bf0] [c000000000420824] exec_binprm+0x1c4/0x7d0
[  396.922079][   C99] [c000201cbc3d7cb0] [c000000000421540] do_execveat_common+0x710/0x960
[  396.922117][   C99] [c000201cbc3d7d90] [c000000000422a44] sys_execve+0x44/0x60
[  396.922156][   C99] [c000201cbc3d7dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[  396.922205][   C99] [c000201cbc3d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[  396.922253][   C99] Instruction dump:
[  396.922286][   C99] 3b000e60 3b400500 3b600a00 3b800f00 f8010010 f821fe11 38610028 e92d0c70 
[  396.922316][   C99] f9210198 39200000 8aed0989 48037df9 <60000000> 39200003 f9210160 56e90738


[  248.821138][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821170][  T676] khugepaged      D28416   682      2 0x00000800 
[  248.821212][  T676] Call Trace: 
[  248.821241][  T676] [c000001ff310f4e0] [c000000000caeb80] lru_add_drain_work+0x0/0x48 (unreliable) 
[  248.821275][  T676] [c000001ff310f6c0] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821308][  T676] [c000001ff310f720] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.821352][  T676] [c000001ff310f7f0] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.821387][  T676] [c000001ff310f820] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.821432][  T676] [c000001ff310f960] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.821460][  T676] [c000001ff310f9d0] [c0000000000fd0c8] __flush_work+0x3b8/0x770 
[  248.821491][  T676] [c000001ff310faf0] [c0000000002e0ac4] lru_add_drain_all+0x3e4/0x760 
[  248.821521][  T676] [c000001ff310fbf0] [c0000000003e0f18] khugepaged+0xd8/0x1770 
[  248.821560][  T676] [c000001ff310fdb0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.821611][  T676] [c000001ff310fe20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.821655][  T676] INFO: task kworker/56:1:719 blocked for more than 122 seconds. 
[  248.821689][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.821729][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821779][  T676] kworker/56:1    D27584   719      2 0x00000800 
[  248.821839][  T676] Workqueue: rcu_gp wait_rcu_exp_gp 
[  248.821862][  T676] Call Trace: 
[  248.821888][  T676] [c000001ff2b57660] [c0000000057c9fb0] rcu_state+0x4fb0/0x5100 (unreliable) 
[  248.821934][  T676] [c000001ff2b57840] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821977][  T676] [c000001ff2b578a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822021][  T676] [c000001ff2b57970] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822066][  T676] [c000001ff2b579a0] [c0000000009a970c] schedule_timeout+0x1fc/0x520 
[  248.822110][  T676] [c000001ff2b57ae0] [c0000000001a86d0] rcu_exp_wait_wake+0x1b0/0x950 
[  248.822153][  T676] [c000001ff2b57c30] [c0000000000fb754] process_one_work+0x304/0x900 
[  248.822197][  T676] [c000001ff2b57d20] [c0000000000fbdc8] worker_thread+0x78/0x520 
[  248.822242][  T676] [c000001ff2b57db0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.822279][  T676] [c000001ff2b57e20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.822385][  T676] INFO: task lvm:3123 blocked for more than 122 seconds. 
[  248.822413][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822462][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822503][  T676] lvm             D26608  3123      1 0x00040000 
[  248.822552][  T676] Call Trace: 
[  248.822576][  T676] [c000001feee27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.822620][  T676] [c000001feee27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822648][  T676] [c000001feee27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822680][  T676] [c000001feee27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.822724][  T676] [c000001feee27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.822768][  T676] [c000001feee27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.822808][  T676] [c000001feee27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.822840][  T676] [c000001feee27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.822873][  T676] INFO: task lvm:3126 blocked for more than 122 seconds. 
[  248.822901][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822938][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822987][  T676] lvm             D26608  3126      1 0x00040000 
[  248.823017][  T676] Call Trace: 
[  248.823031][  T676] [c000001fc0b27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823075][  T676] [c000001fc0b27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823113][  T676] [c000001fc0b27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823158][  T676] [c000001fc0b27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.823199][  T676] [c000001fc0b27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.823250][  T676] [c000001fc0b27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.823294][  T676] [c000001fc0b27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.823332][  T676] [c000001fc0b27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.823374][  T676] INFO: task auditd:3163 blocked for more than 122 seconds. 
[  248.823424][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.823471][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.823512][  T676] auditd          D27088  3163      1 0x00042408 
[  248.823551][  T676] Call Trace: 
[  248.823583][  T676] [c000001fc080f760] [0000000200000001] 0x200000001 (unreliable) 
[  248.823648][  T676] [c000001fc080f940] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823689][  T676] [c000001fc080f9a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823742][  T676] [c000001fc080fa70] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823784][  T676] [c000001fc080faa0] [c0000000001a9244] synchronize_rcu_expedited+0x394/0x600 
[  248.823837][  T676] [c000001fc080fba0] [c0000000004504c4] namespace_unlock+0xf4/0x230 
[  248.823881][  T676] [c000001fc080fc00] [c000000000456dec] put_mnt_ns+0x5c/0x80 
[  248.823926][  T676] [c000001fc080fc30] [c00000000010ba6c] free_nsproxy+0x2c/0x1e0 
[  248.823966][  T676] [c000001fc080fc60] [c0000000000d5130] do_exit+0x4e0/0xee0 
[  248.823997][  T676] [c000001fc080fd60] [c0000000000d5bec] do_group_exit+0x5c/0xd0 
[  248.824019][  T676] [c000001fc080fda0] [c0000000000d5c7c] sys_exit_group+0x1c/0x20 
[  248.824060][  T676] [c000001fc080fdc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.824103][  T676] [c000001fc080fe20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.824192][  T676]  
[  248.824192][  T676] Showing all locks held in the system: 
[  248.824419][  T676] 1 lock held by khungtaskd/676: 
[  248.824455][  T676]  #0: c0000000057c44c0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.29+0x8/0x30 
[  248.824531][  T676] 1 lock held by khugepaged/682: 
[  248.824565][  T676]  #0: c0000000057f42c8 (lock#4){+.+.}-{3:3}, at: lru_add_drain_all+0x68/0x760 
[  248.824679][  T676] 2 locks held by kworker/56:1/719: 
[  248.824742][  T676]  #0: c00000000bcc8938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.824857][  T676]  #1: c000001ff2b57c90 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.825026][  T676] 3 locks held by (spawn)/2692: 
[  248.825077][  T676] 1 lock held by auditd/3163: 
[  248.825135][  T676]  #0: c0000000057c9ee8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x254/0x600 
[  248.825296][  T676] ============================================= 
[  248.825296][  T676]  

> 
>  .../include/asm/book3s/64/tlbflush-radix.h    | 15 ++++
>  arch/powerpc/include/asm/hvcall.h             | 34 +++++++-
>  arch/powerpc/include/asm/mmu.h                |  4 +
>  arch/powerpc/include/asm/plpar_wrappers.h     | 52 ++++++++++++
>  arch/powerpc/kernel/dt_cpu_ftrs.c             |  1 +
>  arch/powerpc/kernel/prom_init.c               | 13 +--
>  arch/powerpc/mm/book3s64/radix_tlb.c          | 82 +++++++++++++++++--
>  arch/powerpc/mm/init_64.c                     |  5 +-
>  arch/powerpc/platforms/pseries/lpar.c         |  8 +-
>  9 files changed, 197 insertions(+), 17 deletions(-)
> 
> -- 
> 2.21.3
> 


More information about the Linuxppc-dev mailing list