[Skiboot] [PATCH] opal-prd: Do not error out on first failure for soft/hard offline.

Fri May 25 15:25:10 AEST 2018

On 05/25/2018 05:53 AM, Stewart Smith wrote:
> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>
>> The memory errors (CEs and UEs) that are detected as part of background
>> memory scrubbing are reported by PRD asynchronously to opal-prd along with
>> affected memory ranges. hservice_memory_error() converts these ranges into
>> page granularity before hooking up them to soft/hard offline-ing
>> infrastructure.
>>
>> But the current implementation of hservice_memory_error() does not hookup
>> all the pages to soft/hard offline-ing if any of the page offline action
>> fails. e.g hard offline can fail for:
>>       - Pages that are not part of buddy managed pool.
>>       - Pages that are reserved by kernel using memblock_reserved()
>>       - Pages that are in use by kernel.
>>
>> But for the pages that are in use by user space application, the hard
>> offline marks the page as hwpoison, sends SIGBUS signal to kill the
>> affected application as recovery action and returns success.
>>
>> Hence, It is possible that some of the pages in that memory range are in
>> use by application or free. By stopping on first error we loose the
>> opportunity to hwpoison the subsequent pages which may be free or in use by
>> application. This patch fixes this issue.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>> ---
>>  external/opal-prd/opal-prd.c |    6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.
> 
> Should this also head to stable?
> 

Yes. We been broken from day 1.

Thanks,
-Mahesh.