[PATCH v13 2/5] arm64: add support for ARCH_HAS_COPY_MC
Tong Tiangen
tongtiangen at huawei.com
Thu Apr 3 13:36:49 AEDT 2025
在 2025/3/29 1:06, Yeoreum Yun 写道:
> Hi,
>
>>
>>
>> 在 2025/2/13 0:21, Catalin Marinas 写道:
>>> (catching up with old threads)
>>>
>>> On Mon, Dec 09, 2024 at 10:42:54AM +0800, Tong Tiangen wrote:
>>>> For the arm64 kernel, when it processes hardware memory errors for
>>>> synchronize notifications(do_sea()), if the errors is consumed within the
>>>> kernel, the current processing is panic. However, it is not optimal.
>>>>
>>>> Take copy_from/to_user for example, If ld* triggers a memory error, even in
>>>> kernel mode, only the associated process is affected. Killing the user
>>>> process and isolating the corrupt page is a better choice.
>>>
>>> I agree that killing the user process and isolating the page is a better
>>> choice but I don't see how the latter happens after this patch. Which
>>> page would be isolated?
>>
>> The SEA is triggered when the page with hardware error is read. After
>> that, the page is isolated in memory_failure() (mf). The processing of
>> mf is mentioned in the comments of do_sea().
>>
>> /*
>> * APEI claimed this as a firmware-first notification.
>> * Some processing deferred to task_work before ret_to_user().
>> */
>>
>> Some processing include mf.
>>
>>>
>>>> Add new fixup type EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR to identify insn
>>>> that can recover from memory errors triggered by access to kernel memory,
>>>> and this fixup type is used in __arch_copy_to_user(), This make the regular
>>>> copy_to_user() will handle kernel memory errors.
>>>
>>> Is the assumption that the error on accessing kernel memory is
>>> transient? There's no way to isolate the kernel page and also no point
>>> in isolating the destination page either.
>>
>> Yes, it's transient, the kernel page in mf can't be isolated, the
>> transient access (ld) of this kernel page is currently expected to kill
>> the user-mode process to avoid error spread.
>
> I'm not sure about how this works.
> IIUC, the memory_failure() wouldn't kill any process if page which
> raises sea is kernel page (because this wasn't mapped).
right.
>
> But, to mark the kernel page as posision, I think it also need to call
> apei_claim_sea() in !user_mode().
> What about calling the apei_claim_sea() when fix_exception_me()
> successed only in !user_mode() case?
This was discussed with Mark in V12:
https://lore.kernel.org/lkml/20240528085915.1955987-3-tongtiangen@huawei.com/
Sorry for didn't catch your reply in time:)
Thanks,
Tong.
>
> Thanks.
>>
>> The SEA processes synchronization errors. Only hardware errors on the
>> source page can be detected (Through synchronous ld insn) and processed.
>> The destination page cannot be processed.
>>
>>>
>>
> .
More information about the Linuxppc-dev
mailing list