[PATCH] powerpc/eeh: Permanently disable the removed device
Ganesh G R
ganeshgr at linux.ibm.com
Mon Apr 15 16:49:32 AEST 2024
On 4/9/24 14:37, Michael Ellerman wrote:
> Hi Ganesh,
>
> Ganesh Goudar <ganeshgr at linux.ibm.com> writes:
>> When a device is hot removed on powernv, the hotplug
>> driver clears the device's state. However, on pseries,
>> if a device is removed by phyp after reaching the error
>> threshold, the kernel remains unaware, leading to the
>> device not being torn down. This prevents necessary
>> remediation actions like failover.
>>
>> Permanently disable the device if the presence check
>> fails.
> You can wrap your changelogs a bit wider, 70 or 80 columns is fine.
ok
>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>> index ab316e155ea9..8d1606406d3f 100644
>> --- a/arch/powerpc/kernel/eeh.c
>> +++ b/arch/powerpc/kernel/eeh.c
>> @@ -508,7 +508,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>> * state, PE is in good state.
>> */
>> if ((ret < 0) ||
>> - (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
>> + (ret == EEH_STATE_NOT_SUPPORT &&
>> + dev->error_state == pci_channel_io_perm_failure) ||
>> + eeh_state_active(ret)) {
>> eeh_stats.false_positives++;
>> pe->false_positives++;
>> rc = 0;
> How does this hunk relate the changelog?
>
> This is adding an extra condition to the false positive check, so
> there's a risk this causes devices to go into failure when previously
> they didn't, right? So please explain why it's a good change. The
> comment above the if needs updating too.
We need this change to log the event and get the device removed, I will explain this
in commit message.
>> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
>> index 48773d2d9be3..10317badf471 100644
>> --- a/arch/powerpc/kernel/eeh_driver.c
>> +++ b/arch/powerpc/kernel/eeh_driver.c
>> @@ -867,7 +867,13 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>> if (!devices) {
>> pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
>> pe->phb->global_number, pe->addr);
>> - goto out; /* nothing to recover */
> The other cases that go to recover_failed usually print something at
> warn level, so this probably should too. So either make the above a
> pr_warn(), or change it to a warn with a more helpful message.
ok
>> + /*
>> + * The device is removed, Tear down its state,
>> + * On powernv hotplug driver would take care of
>> + * it but not on pseries, Permanently disable the
>> + * card as it is hot removed.
>> + */
> Formatting and punctuation is weird. It can be wider, and capital letter
> is only required after a full stop, not a comma.
ok, i will take care of it.
> Also you say that the powernv hotplug driver "would" take care of it,
> that's past tense, is that what you mean? Does the powernv hotplug
> driver still take care of it after this change? And (how) does that
> driver cope with it happening here also?
Yes, hotplug driver can still remove the device and the removal of
device is covered by pci rescan lock.
>> + goto recover_failed;
>> }
>>
> cheers
More information about the Linuxppc-dev
mailing list