[Skiboot] [PATCH] opal-prd: Do not error out on first failure for soft/hard offline.

Stewart Smith stewart at linux.ibm.com
Mon May 28 12:27:52 AEST 2018

Mahesh Jagannath Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> On 05/25/2018 05:53 AM, Stewart Smith wrote:
>> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> The memory errors (CEs and UEs) that are detected as part of background
>>> memory scrubbing are reported by PRD asynchronously to opal-prd along with
>>> affected memory ranges. hservice_memory_error() converts these ranges into
>>> page granularity before hooking up them to soft/hard offline-ing
>>> infrastructure.
>>> But the current implementation of hservice_memory_error() does not hookup
>>> all the pages to soft/hard offline-ing if any of the page offline action
>>> fails. e.g hard offline can fail for:
>>>       - Pages that are not part of buddy managed pool.
>>>       - Pages that are reserved by kernel using memblock_reserved()
>>>       - Pages that are in use by kernel.
>>> But for the pages that are in use by user space application, the hard
>>> offline marks the page as hwpoison, sends SIGBUS signal to kill the
>>> affected application as recovery action and returns success.
>>> Hence, It is possible that some of the pages in that memory range are in
>>> use by application or free. By stopping on first error we loose the
>>> opportunity to hwpoison the subsequent pages which may be free or in use by
>>> application. This patch fixes this issue.
>>> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> ---
>>>  external/opal-prd/opal-prd.c |    6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.
>> Should this also head to stable?
> Yes. We been broken from day 1.

Okay, I've cherry-picked back to:
6.0.x as of 3efceb1691846f450f0541c0156ad7258a57870b
5.10.x as of 6ca368e2e2254eac8682b6af43758ba134aa3763
5.4.x as of 92bd1c4bbebd25886adabe7ff275e3ca0c600234

Which seem to be the branches in use by current distros, which I guess
we now have to have them bump up what they're using to get this fix.

It's things like this that make me wonder if continuing to have opal-prd in
the same repository/verisoning scheme as skiboot continues to make sense....

Stewart Smith
OPAL Architect, IBM.

