[RFC] powerpc/powernv/mce: Don't silently restart the machine

Balbir Singh bsingharora at gmail.com
Wed Feb 21 16:02:08 AEDT 2018


On Wed, Feb 21, 2018 at 3:54 PM, Stewart Smith
<stewart at linux.vnet.ibm.com> wrote:
> Balbir Singh <bsingharora at gmail.com> writes:
>> On MCE the current code will restart the machine with
>> ppc_md.restart(). This case was extremely unlikely since
>> prior to that a skiboot call is made and that resulted in
>> a checkstop for analysis.
>>
>> With newer skiboots, on P9 we don't checkstop the box by
>> default, instead we return back to the kernel to extract
>> useful information at the time of the MCE. While we still
>> get this information, this patch converts the restart to
>> a panic(), so that if configured a dump can be taken and
>> we can track and probably debug the potential issue causing
>> the MCE.
>
> I agree with the patch, although I'd be nervous stating that skiboot is
> going to keep this behaviour. In *theory* we should only ever get a
> platform error when there's actually something that isn't the kernel's
> fault.
>
> Like any firmware promise though, it's slightly less reliable than one
> from a politician.
>
> I'd say that in this case deferring to policy on what to do in event of
> panic() is the right thing.
>

Your right, except that with NPUs and coherent device memory, things
change. It is a platform error on the device side, the large issue is
that the CPU touched this memory. Ideally we want the device and
device driver to handle this error, but its racy as the CPU now treats
it as a platform error on the system side. IOW, the definition of
platform has grown and so has the definition of platform error and it
can no longer be solely contained inside of one boxes firmware.

Balbir Singh.


More information about the Linuxppc-dev mailing list