[PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
Alex Williamson
alex.williamson at redhat.com
Tue Sep 23 02:01:43 AEST 2025
On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
Timothy Pearson <tpearson at raptorengineering.com> wrote:
> ----- Original Message -----
> > From: "Alex Williamson" <alex.williamson at redhat.com>
> > To: "Timothy Pearson" <tpearson at raptorengineering.com>
> > Cc: "kvm" <kvm at vger.kernel.org>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> > Sent: Friday, September 19, 2025 5:27:21 PM
> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
>
> > On Fri, 19 Sep 2025 15:51:14 -0500 (CDT)
> > Timothy Pearson <tpearson at raptorengineering.com> wrote:
> >
> >> ----- Original Message -----
> >> > From: "Alex Williamson" <alex.williamson at redhat.com>
> >> > To: "Timothy Pearson" <tpearson at raptorengineering.com>
> >> > Cc: "kvm" <kvm at vger.kernel.org>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> >> > Sent: Friday, September 19, 2025 1:56:03 PM
> >> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
> >>
> >> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> >> > Timothy Pearson <tpearson at raptorengineering.com> wrote:
> >> >
> >> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support
> >> >> interrupt masking, leading to a failure when passed through to a KVM guest on
> >> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This
> >> >> failure manifests as receiving and acknowledging a single interrupt in the guest
> >> >> while leaving the host physical device VFIO IRQ pending.
> >> >>
> >> >> Level interrupts in general require special handling due to their inherently
> >> >> asynchronous nature; both the host and guest interrupt controller need to
> >> >> remain in synchronization in order to coordinate mask and unmask operations.
> >> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
> >> >> occurs:
> >> >>
> >> >> * Level IRQ assertion on host
> >> >> * IRQ trigger within host interrupt controller, routed to VFIO driver
> >> >> * Host EOI with hardware level IRQ still asserted
> >> >> * Software mask of interrupt source by VFIO driver
> >> >> * Generation of event and IRQ trigger in KVM guest interrupt controller
> >> >> * Level IRQ deassertion on host
> >> >> * Guest EOI
> >> >> * Guest IRQ level deassertion
> >> >> * Removal of software mask by VFIO driver
> >> >>
> >> >> Note that no actual state change occurs within the host interrupt controller,
> >> >> unlike what would happen with either DisINTx+ hardware or message interrupts.
> >> >> The host EOI is not fired with the hardware level IRQ deasserted, and the
> >> >> level interrupt is not re-armed within the host interrupt controller, leading
> >> >> to an unrecoverable stall of the device.
> >> >>
> >> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.
> >> >
> >> > I'm not really following here. It's claimed above that no actual state
> >> > change occurs within the host interrupt controller, but that's exactly
> >> > what disable_irq_nosync() intends to do, mask the interrupt line at the
> >> > controller.
> >>
> >> While it seems that way on the surface (and this tripped me up
> >> originally), the actual call chain is:
> >>
> >> disable_irq_nosync()
> >> __disable_irq_nosync()
> >> __disable_irq()
> >> irq_disable()
> >>
> >> Inside void irq_disable(), __irq_disable() is gated on
> >> irq_settings_disable_unlazy(). The lazy disable is intended to *not*
> >> touch the interrupt controller itself, instead lazy mode masks the
> >> interrupt at the device level (DisINT+ registers). If the IRQ is set
> >> up to run in lazy mode, the interrupt is not disabled at the actual
> >> interrupt controller by disable_irq_nosync().
> >
> > What chip handler are you using? The comment above irq_disable
> > reiterates the behavior, yes if the chip doesn't support irq_disable it
> > marks the interrupt masked but leaves the hardware unmasked. It does
> > not describe using DisINTx to mask the device, which would be at a
> > different level from the chip. In this case __irq_disable() just calls
> > irq_state_set_disabled(). Only with the change proposed here would we
> > also call mask_irq().
>
> This is all tested on a POWER XIVE controller, so arch/powerpc/sysdev/xive (CONFIG_PPC_XIVE_NATIVE=y)
>
> >> > The lazy optimization that's being proposed here should
> >> > only change the behavior such that the interrupt is masked at the
> >> > call to disable_irq_nosync() rather than at a subsequent
> >> > re-assertion of the interrupt. In any case, enable_irq() should
> >> > mark the line enabled and reenable the controller if necessary.
> >>
> >> If the interrupt was not disabled at the controller, then reenabling
> >> a level interrupt is not guaranteed to actually do anything (although
> >> it *might*). The hardware in the interrupt controller will still
> >> "see" an active level assert for which it fired an interrupt without
> >> a prior acknowledge (or disable/enable cycle) from software, and can
> >> then decide to not re-raise the IRQ on a platform-specific basis.
> >>
> >> The key here is that the interrupt controllers differ somewhat in
> >> behavior across various architectures. On POWER, the controller will
> >> only raise the external processor interrupt once for each level
> >> interrupt when that interrupt changes state to asserted, and will
> >> only re-raise the external processor interrupt once an acknowledge
> >> for that interrupt has been sent to the interrupt controller hardware
> >> while the level interrupt is deasserted. As a result, if the
> >> interrupt handler executes (acknowledging the interrupt), but does
> >> not first clear the interrupt on the device itself, the interrupt
> >> controller will never re-raise that interrupt -- from its
> >> perspective, it has issued another IRQ (because the device level
> >> interrupt was left asserted) and the associated handler has never
> >> completed. Disabling the interrupt causes the controller to reassert
> >> the interrupt if the level interrupt is still asserted when the
> >> interrupt is reenabled at the controller level.
> >
> > This sounds a lot more like the problem than the previous description.
>
> Apologies for that. There's a lot of moving parts here and I guess I muddled it all up in the first description.
>
> > Is the actual scenario something like the irq is marked disabled, the
> > eventfd is delivered to userspace, userspace handles the device, the
> > interrupt is de-asserted at the device, but then the device re-asserts
> > the interrupt before the unmask ioctl, causing the interrupt chip to
> > mask the interrupt, then enable_irq() from the unmask ioctl doesn't
> > reassert the interrupt?
>
> That is exactly it, yes. This particular scenario only occurs on old
> pre-PCI 2.3 devices that advertise DisINT- ; newer devices that
> advertise DisINT+ don't trip the fault since the host interrupt
> handler sets the device interrupt mask (thus deasserting the IRQ at
> the interrupt controller) before exiting.
>
> >> On other platforms the external processor interrupt itself is
> >> disabled until the interrupt handler has finished, and the
> >> controller doesn't auto-mask the level interrupts at the hardware
> >> level; instead, it will happily re-assert the processor interrupt
> >> if the interrupt was not cleared at the device level after IRQ
> >> acknowledge. I suspect on those platforms this bug may be masked
> >> at the expense of a bunch of "spurious" / unwanted interrupts if
> >> the interrupt handler hasn't acked the interrupt at the device
> >> level; as long as the guest interrupt handler is able to somewhat
> >> rapidly clear the device interrupt, performance won't be impacted
> >> too much by the extra interrupt load, further hiding the bug on
> >> these platforms.
> >
> > It seems this is the trade off the lazy handling makes
> > intentionally, we risk some spurious interrupts while the line is
> > disabled to avoid poking the hardware. So long as we're not
> > triggering the spurious interrupt handler to permanently disabling
> > the interrupt line, this is a valid choice.
>
> At least for some platforms, yes, though it's not exactly clear in
> the documentation that this is intentional or that the side effect
> exists at all.
>
> > That's also a consideration for this patch, we're slowing down all
> > non-PCI2.3 INTx handling for the behavior of this platform. Is
> > there some behavior of the interrupt controller we can query to
> > know which path to use? Can this issue be seen with the irqfd INTx
> > handling disabled on other architectures? Do we actually care
> > whether we're making old INTx devices slower? Thanks,
>
> I honestly don't know. In reality, we're talking about amd64 and
> ppc64el as the only two platforms that I know of that both support
> VFIO and such old PCIe INTX pre-2.3 devices, and their respective
> interrupt controller handling in a spurious INTX IRQ context is very
> different from one another.
>
> Personally, I'd argue that such old devices were intended to work
> with much slower host systems, therefore the slowdown probably
> doesn't matter vs. being more correct in terms of interrupt handling.
> In terms of general kernel design, my understanding has always been
> is that best practice is to always mask, disable, or clear a level
> interrupt before exiting the associated IRQ handler, and the current
> design seems to violate that rule. In that context, I'd personally
> want to see an argument as to why echewing this traditional IRQ
> handler design is beneficial enough to justify making the VFIO driver
> dependent on platform-specific behavior.
Yep, I kind of agree. The unlazy flag seems to provide the more
intended behavior. It moves the irq chip masking into the fast path,
whereas it would have been asynchronous on a subsequent interrupt
previously, but the impact is only to ancient devices operating in INTx
mode, so as long as we can verify those still work on both ppc and x86,
I don't think it's worth complicating the code to make setting the
unlazy flag conditional on anything other than the device support.
Care to send out a new version documenting the actual sequence fixed by
this change and updating the code based on this thread? Note that we
can test non-pci2.3 mode for any device/driver that supports INTx using
the nointxmask=1 option for vfio-pci and booting a linux guest with
pci=nomsi. Thanks,
Alex
More information about the Linuxppc-dev
mailing list