[Skiboot] [PATCH] opal-prd: Do not error out on first failure for soft/hard offline.
Stewart Smith
stewart at linux.ibm.com
Fri May 25 10:23:47 AEST 2018
Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> The memory errors (CEs and UEs) that are detected as part of background
> memory scrubbing are reported by PRD asynchronously to opal-prd along with
> affected memory ranges. hservice_memory_error() converts these ranges into
> page granularity before hooking up them to soft/hard offline-ing
> infrastructure.
>
> But the current implementation of hservice_memory_error() does not hookup
> all the pages to soft/hard offline-ing if any of the page offline action
> fails. e.g hard offline can fail for:
> - Pages that are not part of buddy managed pool.
> - Pages that are reserved by kernel using memblock_reserved()
> - Pages that are in use by kernel.
>
> But for the pages that are in use by user space application, the hard
> offline marks the page as hwpoison, sends SIGBUS signal to kill the
> affected application as recovery action and returns success.
>
> Hence, It is possible that some of the pages in that memory range are in
> use by application or free. By stopping on first error we loose the
> opportunity to hwpoison the subsequent pages which may be free or in use by
> application. This patch fixes this issue.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> ---
> external/opal-prd/opal-prd.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.
Should this also head to stable?
--
Stewart Smith
OPAL Architect, IBM.
More information about the Skiboot
mailing list