[PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

Fri May 26 08:28:23 AEST 2023

On 25.05.23 17:01:01, Bjorn Helgaas wrote:
> On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote:
> > eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:
> > > On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:
> > > > From: Robert Richter <rrichter at amd.com>
> > > > 
> > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> > > > RCiEP, but CXL downstream and upstream ports are not enumerated and
> > > > not visible in the PCIe hierarchy. Protocol and link errors are sent
> > > > to an RCEC.
> > > >
> > > > Restricted CXL host (RCH) downstream port-detected errors are signaled
> > > > as internal AER errors, either Uncorrectable Internal Error (UIE) or
> > > > Corrected Internal Errors (CIE). 
> > > 
> > > From the parallelism with RCD above, I first thought that RCH devices
> > > were non-RCD mode and *were* enumerated as part of the PCIe hierarchy,
> > > but actually I suspect it's more like the following?
> > > 
> > >   ... but CXL downstream and upstream ports are not enumerated and not
> > >   visible in the PCIe hierarchy.
> > > 
> > >   Protocol and link errors from these non-enumerated ports are
> > >   signaled as internal AER errors ... via a CXL RCEC.
> > 
> > Exactly, except the RCEC is standard PCIe and also must not
> > necessarily on the same PCI bus as the CXL RCiEPs are.
> 
> So make it "RCEC" instead of "CXL RCEC", I guess?  PCIe r6.0, sec
> 7.9.10.3, allows an RCEC to be associated with RCiEPs on different
> buses, so nothing to see there.

Yes, nothing special. This makes it more difficult to check if the
RCEC has CXL devices connected, but still it is feasible.

> 
> > > > The error source is the id of the RCEC.
> > > 
> > > This seems odd; I assume this refers to the RCEC's AER Error Source
> > > Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source
> > > Identification would ordinarily be the Requester ID of the RCiEP that
> > > "sent" the Error Message.  But you're saying it's actually the ID of
> > > the *RCEC*, not the RCiEP?
> > 
> > Right, the downstream port has its own AER ext capability in
> > non-config (io mapped) RCRB register range. Errors originating from
> > there are signaled as internal AER errors via the RCEC *with* the
> > RCEC's Requester ID. Code walks through all associated CXL endpoints,
> > determines the dport and checks its AER.
> > 
> > There is also an RDPAS structure defined in CXL but that is only a
> > different way to provide the RCEC to dport association instead of
> > using the RCEC's Endpoint Association Extended Capability. In the end
> > we get all associated RCHs and check the AER of all their dports.
> > 
> > The upstream port is signaled using the RCiEP's AER. CXL spec is
> > strict here: "Upstream Port RCRB shall not implement the AER Extended
> > Capability." The RCiEP's requestor ID is used then and its config
> > space the AER is in.
> > 
> > CXL.cachemem errors are reported with the RCiEP as requester
> > too. Status is in the CXL RAS cap and the UIE or CIE is set
> > respectively in the AER status of the RCiEP.
> >
> > > We're going to call pci_aer_handle_error() as well, to handle the
> > > non-internal errors, and I'm pretty sure that path expects the RCiEP
> > > ID there.
> > > 
> > > Whatever the answer, I'm not sure this sentence is actually relevant
> > > to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or
> > > look at struct aer_err_source.id.
> > 
> > The source id is used in aer_process_err_devices() which finally calls
> > handle_error_source() for the device with the requestor id. This is
> > the place where cxl_rch_handle_error() checks if it is an RCEC that
> > received an internal error and has cxl devices connected to it. Then,
> > the request is forwarded to the cxl_mem handler which also needs to
> > check the dport now. That is, pcie_walk_rcec() in
> > cxl_rch_handle_error() is called with the RCEC's pci handle,
> > cxl_rch_handle_error_iter() with the RCiEP's pci handle.
> 
> I'm still not sure this is relevant.  Isn't that last sentence just
> the way we always use pcie_walk_rcec()?
> 
> If there's something *different* here about CXL, and it's important to
> this patch, sure.  But I don't see that yet.  Maybe a comment in the
> code if you think it's important to clarify something there.

The importance I see is that internal errors of an RCEC indicate an
AER error in an RCH's downstream port. Thus, once that happens, all
involved dports must be checked. Internal errors are typically
non-standard and implementation defined, but here it is CXL standard.

-Robert