[PATCH v1 0/4] Revisit MCE handling for UE Errors

Balbir Singh bsingharora at gmail.com
Tue Sep 12 17:11:42 AEST 2017


On Tue, Sep 12, 2017 at 3:03 PM, Nicholas Piggin <npiggin at gmail.com> wrote:
> Hi Balbir,
>
> Very cool. How are you testing it? Is it failing memory pages
> and poisoning them out properly?
>

Yep, I tested it and it seems to work correctly so far. I am testing this
on a simulator with injected MCE UE errors for both the data and
instruction side.

> Looks like you have a printk in the machine_check_early path,
> which you shouldn't. I guess because we don't mark that context
> as an NMI. Which we could... but I think you want to put as
> little as possible in that path, so avoiding the print would
> be preferable. Perhaps you could mark the mce event somehow that
> the failure can be reported during processing it?
>

Good point, I did see that printk handles stuff via printk_nmi_enter/exit,
but its best avoided. Will spin v2

> Firmware logging is a good question, I could not really see
> where this all gets plumbed through. If this is expected to be
> a common problem for some types of attached memory, then we
> really need to build up a log of these errors that can be used
> to exclude the memory after a reboot too. Do we have anything
> like this capability in firmware?

It's to be built, we should log these to NVRAM and revisit at every
boot to isolate these pages

Balbir Singh.


More information about the Linuxppc-dev mailing list