[PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR

Fri Mar 19 22:56:47 AEDT 2021

Daniel Axtens <dja at axtens.net> writes:
> Michael Ellerman <mpe at ellerman.id.au> writes:
>
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
>>
>> One example, which was caught in xmon:
>>
>>   [   14.068327][    T1] devtmpfs: mounted
>>   [   14.069302][    T1] Freeing unused kernel memory: 5568K
>>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>>   [   14.142063][    T1] Run /sbin/init as init process
>>   [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
>>   cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>>       pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>>       lr: c0000000001862d4: update_rq_clock+0x44/0x110
>>       sp: c00000000c747880
>>      msr: 8000000040001031
>>     current = 0xc00000000c60d380
>>     paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
>>       pid   = 347, comm = kworker/2:1
>>   ...
>>   enter ? for help
>>   [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>>   [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>>   [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>>   [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>>   [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>>   [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>>   [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>>   [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>>   [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>>   [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>>   [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>>
>> This shows that CPU 2, which was idle, woke up and then appears to
>> randomly take an instruction fault on a completely valid area of
>> kernel text.
>>
>> The cause turns out to be the call to hash__mark_rodata_ro(), late in
>> boot. Due to the way we layout text and rodata, that function actually
>> changes the permissions for all of text and rodata to read-only plus
>> execute.
>>
>> To do the permission change we use a hypervisor call, H_PROTECT. On
>> phyp that appears to be implemented by briefly removing the mapping of
>> the kernel text, before putting it back with the updated permissions.
>> If any other CPU is executing during that window, it will see spurious
>> faults on the kernel text and/or data, leading to crashes.
>
> Jordan asked why we saw this on phyp but not under KVM? We had a look at
> book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
> about!
>
> Nick suggests that the KVM hypervisor is invalidating the HPTE, but
> because we run guests in VPM mode, the hypervisor would catch the page
> fault and not reflect it down to the guest. It looks like Linux-as-a-HV
> will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
> hypervisor will catch the fault and not pass it to the guest.

Yep.

> But if phyp runs with VPM mode off, the guest will see the fault
> before the hypervisor. (we think this is what's going on anyway.)

Yeah. I assumed phyp always ran with VPM=1, but apparently it can run
with it off or on, depending on various configuration settings.

So I'm fairly sure what we're hitting here is VPM=0, where the faults go
straight to the guest.

cheers