[Skiboot] [PATCH] opal-prd: Do not error out on first failure for soft/hard offline.

Stewart Smith stewart at linux.ibm.com
Fri May 25 10:23:47 AEST 2018


Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> The memory errors (CEs and UEs) that are detected as part of background
> memory scrubbing are reported by PRD asynchronously to opal-prd along with
> affected memory ranges. hservice_memory_error() converts these ranges into
> page granularity before hooking up them to soft/hard offline-ing
> infrastructure.
>
> But the current implementation of hservice_memory_error() does not hookup
> all the pages to soft/hard offline-ing if any of the page offline action
> fails. e.g hard offline can fail for:
>       - Pages that are not part of buddy managed pool.
>       - Pages that are reserved by kernel using memblock_reserved()
>       - Pages that are in use by kernel.
>
> But for the pages that are in use by user space application, the hard
> offline marks the page as hwpoison, sends SIGBUS signal to kill the
> affected application as recovery action and returns success.
>
> Hence, It is possible that some of the pages in that memory range are in
> use by application or free. By stopping on first error we loose the
> opportunity to hwpoison the subsequent pages which may be free or in use by
> application. This patch fixes this issue.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> ---
>  external/opal-prd/opal-prd.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.

Should this also head to stable?

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list