[PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

Sat Oct 22 03:30:50 AEDT 2022

>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>> could get another machine check from the same address. But then we just follow the usual
>> recovery path.

> Let assume the instruction that cause the COW is in the 63/64 case, aka,
> it is writing a different cache line from the poisoned one. But the new_page
> allocated in COW is dropped right? So might page fault again?

It can, but this should be no surprise to a user that has a signal handler for
a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
problem, but simply returns to re-execute the same instruction that caused
the original trap.

There may be badly written signal handlers that do this. But they just cause
pain for themselves. Linux can keep taking the traps and fixing things up and
sending a new signal over and over.

In this case that loop may involve taking the machine check again, so some
extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
from broadcast to delivery to just the logical CPU that tried to consume the poison
a few generations back. So only a bit more painful than a repeated page fault.

-Tony