[PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
Michael Ellerman
mpe at ellerman.id.au
Fri Mar 19 22:56:47 AEDT 2021
Daniel Axtens <dja at axtens.net> writes:
> Michael Ellerman <mpe at ellerman.id.au> writes:
>
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
>>
>> One example, which was caught in xmon:
>>
>> [ 14.068327][ T1] devtmpfs: mounted
>> [ 14.069302][ T1] Freeing unused kernel memory: 5568K
>> [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
>> [ 14.142063][ T1] Run /sbin/init as init process
>> [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
>> cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>> pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>> lr: c0000000001862d4: update_rq_clock+0x44/0x110
>> sp: c00000000c747880
>> msr: 8000000040001031
>> current = 0xc00000000c60d380
>> paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
>> pid = 347, comm = kworker/2:1
>> ...
>> enter ? for help
>> [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>> [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>> [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>> [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>> [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>> [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>> [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>> [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>> [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>> [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>> [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>>
>> This shows that CPU 2, which was idle, woke up and then appears to
>> randomly take an instruction fault on a completely valid area of
>> kernel text.
>>
>> The cause turns out to be the call to hash__mark_rodata_ro(), late in
>> boot. Due to the way we layout text and rodata, that function actually
>> changes the permissions for all of text and rodata to read-only plus
>> execute.
>>
>> To do the permission change we use a hypervisor call, H_PROTECT. On
>> phyp that appears to be implemented by briefly removing the mapping of
>> the kernel text, before putting it back with the updated permissions.
>> If any other CPU is executing during that window, it will see spurious
>> faults on the kernel text and/or data, leading to crashes.
>
> Jordan asked why we saw this on phyp but not under KVM? We had a look at
> book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
> about!
>
> Nick suggests that the KVM hypervisor is invalidating the HPTE, but
> because we run guests in VPM mode, the hypervisor would catch the page
> fault and not reflect it down to the guest. It looks like Linux-as-a-HV
> will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
> hypervisor will catch the fault and not pass it to the guest.
Yep.
> But if phyp runs with VPM mode off, the guest will see the fault
> before the hypervisor. (we think this is what's going on anyway.)
Yeah. I assumed phyp always ran with VPM=1, but apparently it can run
with it off or on, depending on various configuration settings.
So I'm fairly sure what we're hitting here is VPM=0, where the faults go
straight to the guest.
cheers
More information about the Linuxppc-dev
mailing list