[RFC PATCH] powernv/eeh: Fix oops when probing cxl devices

Frederic Barrat fbarrat at linux.ibm.com
Fri Sep 27 22:45:10 AEST 2019


Recent cleanup in the way EEH support is added to a device causes a
kernel oops when the cxl driver probes a device and creates virtual
devices discovered on the FPGA:

    BUG: Kernel NULL pointer dereference at 0x000000a0
    Faulting instruction address: 0xc000000000048070
    Oops: Kernel access of bad area, sig: 7 [#1]
    ...
    NIP [c000000000048070] eeh_add_device_late.part.9+0x50/0x1e0
    LR [c00000000004805c] eeh_add_device_late.part.9+0x3c/0x1e0
    Call Trace:
    [c000200e43983900] [c00000000079e250] _dev_info+0x5c/0x6c (unreliable)
    [c000200e43983980] [c0000000000d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
    [c000200e439839f0] [c0000000000606d0] pcibios_bus_add_device+0x40/0x60
    [c000200e43983a10] [c0000000006aa3a0] pci_bus_add_device+0x30/0x100
    [c000200e43983a80] [c0000000006aa4d4] pci_bus_add_devices+0x64/0xd0
    [c000200e43983ac0] [c00800001c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
    [c000200e43983b00] [c00800001c4242ac] cxl_probe+0x504/0x5b0 [cxl]
    [c000200e43983bb0] [c0000000006bba1c] local_pci_probe+0x6c/0x110
    [c000200e43983c30] [c000000000159278] work_for_cpu_fn+0x38/0x60

The root cause is that those cxl virtual devices don't have a
representation in the device tree and therefore no associated pci_dn
structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
we oops.

We never had explicit support for EEH for those virtual
devices. Instead, EEH events are reported to the (real) pci device and
handled by the cxl driver. Which can then forward to the virtual
devices and handle dependencies. The fact that we try adding EEH
support for the virtual devices is new and a side-effect of the recent
cleanup.

This patch fixes it by skipping adding EEH support on powernv for
devices which don't have a pci_dn structure.

Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
Signed-off-by: Frederic Barrat <fbarrat at linux.ibm.com>
---

Sending as an RFC, as I'm afraid of hiding potential issues and would
be interested in comments. The powernv eeh code expects a struct
pci_dn, so the fix seems safe. I'm wondering if there could be cases
(other than capi virtual devices) where we'd want to blow up and fix
instead of going undetected with this patch.


 arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6bc24a47e9ef..6f300ab7f0e9 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
 {
 	struct pci_dn *pdn = pci_get_pdn(pdev);
 
-	if (eeh_has_flag(EEH_FORCE_DISABLED))
+	if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
 		return;
 
 	dev_dbg(&pdev->dev, "EEH: Setting up device\n");
-- 
2.21.0



More information about the Linuxppc-dev mailing list