[PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

Tue Sep 23 02:01:43 AEST 2025

On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
Timothy Pearson <tpearson at raptorengineering.com> wrote:

> ----- Original Message -----
> > From: "Alex Williamson" <alex.williamson at redhat.com>
> > To: "Timothy Pearson" <tpearson at raptorengineering.com>
> > Cc: "kvm" <kvm at vger.kernel.org>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> > Sent: Friday, September 19, 2025 5:27:21 PM
> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> 
> > On Fri, 19 Sep 2025 15:51:14 -0500 (CDT)
> > Timothy Pearson <tpearson at raptorengineering.com> wrote:
> >   
> >> ----- Original Message -----  
> >> > From: "Alex Williamson" <alex.williamson at redhat.com>
> >> > To: "Timothy Pearson" <tpearson at raptorengineering.com>
> >> > Cc: "kvm" <kvm at vger.kernel.org>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> >> > Sent: Friday, September 19, 2025 1:56:03 PM
> >> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> >>   
> >> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> >> > Timothy Pearson <tpearson at raptorengineering.com> wrote:
> >> >     
> >> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support
> >> >> interrupt masking, leading to a failure when passed through to a KVM guest on
> >> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This
> >> >> failure manifests as receiving and acknowledging a single interrupt in the guest
> >> >> while leaving the host physical device VFIO IRQ pending.
> >> >> 
> >> >> Level interrupts in general require special handling due to their inherently
> >> >> asynchronous nature; both the host and guest interrupt controller need to
> >> >> remain in synchronization in order to coordinate mask and unmask operations.
> >> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
> >> >> occurs:
> >> >>
> >> >>  * Level IRQ assertion on host
> >> >>  * IRQ trigger within host interrupt controller, routed to VFIO driver
> >> >>  * Host EOI with hardware level IRQ still asserted
> >> >>  * Software mask of interrupt source by VFIO driver
> >> >>  * Generation of event and IRQ trigger in KVM guest interrupt controller
> >> >>  * Level IRQ deassertion on host
> >> >>  * Guest EOI
> >> >>  * Guest IRQ level deassertion
> >> >>  * Removal of software mask by VFIO driver
> >> >> 
> >> >> Note that no actual state change occurs within the host interrupt controller,
> >> >> unlike what would happen with either DisINTx+ hardware or message interrupts.
> >> >> The host EOI is not fired with the hardware level IRQ deasserted, and the
> >> >> level interrupt is not re-armed within the host interrupt controller, leading
> >> >> to an unrecoverable stall of the device.
> >> >> 
> >> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.  
> >> > 
> >> > I'm not really following here.  It's claimed above that no actual state
> >> > change occurs within the host interrupt controller, but that's exactly
> >> > what disable_irq_nosync() intends to do, mask the interrupt line at the
> >> > controller.  
> >> 
> >> While it seems that way on the surface (and this tripped me up
> >> originally), the actual call chain is:
> >> 
> >> disable_irq_nosync()
> >> __disable_irq_nosync()
> >> __disable_irq()
> >> irq_disable()
> >> 
> >> Inside void irq_disable(), __irq_disable() is gated on
> >> irq_settings_disable_unlazy().  The lazy disable is intended to *not*
> >> touch the interrupt controller itself, instead lazy mode masks the
> >> interrupt at the device level (DisINT+ registers).  If the IRQ is set
> >> up to run in lazy mode, the interrupt is not disabled at the actual
> >> interrupt controller by disable_irq_nosync().  
> > 
> > What chip handler are you using?  The comment above irq_disable
> > reiterates the behavior, yes if the chip doesn't support irq_disable it
> > marks the interrupt masked but leaves the hardware unmasked.  It does
> > not describe using DisINTx to mask the device, which would be at a
> > different level from the chip.  In this case __irq_disable() just calls
> > irq_state_set_disabled().  Only with the change proposed here would we
> > also call mask_irq().  
> 
> This is all tested on a POWER XIVE controller, so arch/powerpc/sysdev/xive (CONFIG_PPC_XIVE_NATIVE=y)
> 
> >> > The lazy optimization that's being proposed here should
> >> > only change the behavior such that the interrupt is masked at the
> >> > call to disable_irq_nosync() rather than at a subsequent
> >> > re-assertion of the interrupt.  In any case, enable_irq() should
> >> > mark the line enabled and reenable the controller if necessary.  
> >> 
> >> If the interrupt was not disabled at the controller, then reenabling
> >> a level interrupt is not guaranteed to actually do anything (although
> >> it *might*).  The hardware in the interrupt controller will still
> >> "see" an active level assert for which it fired an interrupt without
> >> a prior acknowledge (or disable/enable cycle) from software, and can
> >> then decide to not re-raise the IRQ on a platform-specific basis.
> >> 
> >> The key here is that the interrupt controllers differ somewhat in
> >> behavior across various architectures.  On POWER, the controller will
> >> only raise the external processor interrupt once for each level
> >> interrupt when that interrupt changes state to asserted, and will
> >> only re-raise the external processor interrupt once an acknowledge
> >> for that interrupt has been sent to the interrupt controller hardware
> >> while the level interrupt is deasserted.  As a result, if the
> >> interrupt handler executes (acknowledging the interrupt), but does
> >> not first clear the interrupt on the device itself, the interrupt
> >> controller will never re-raise that interrupt -- from its
> >> perspective, it has issued another IRQ (because the device level
> >> interrupt was left asserted) and the associated handler has never
> >> completed.  Disabling the interrupt causes the controller to reassert
> >> the interrupt if the level interrupt is still asserted when the
> >> interrupt is reenabled at the controller level.  
> > 
> > This sounds a lot more like the problem than the previous description.  
> 
> Apologies for that.  There's a lot of moving parts here and I guess I muddled it all up in the first description.
> 
> > Is the actual scenario something like the irq is marked disabled, the
> > eventfd is delivered to userspace, userspace handles the device, the
> > interrupt is de-asserted at the device, but then the device re-asserts
> > the interrupt before the unmask ioctl, causing the interrupt chip to
> > mask the interrupt, then enable_irq() from the unmask ioctl doesn't
> > reassert the interrupt?  
> 
> That is exactly it, yes.  This particular scenario only occurs on old
> pre-PCI 2.3 devices that advertise DisINT- ; newer devices that
> advertise DisINT+ don't trip the fault since the host interrupt
> handler sets the device interrupt mask (thus deasserting the IRQ at
> the interrupt controller) before exiting.
> 
> >> On other platforms the external processor interrupt itself is
> >> disabled until the interrupt handler has finished, and the
> >> controller doesn't auto-mask the level interrupts at the hardware
> >> level; instead, it will happily re-assert the processor interrupt
> >> if the interrupt was not cleared at the device level after IRQ
> >> acknowledge. I suspect on those platforms this bug may be masked
> >> at the expense of a bunch of "spurious" / unwanted interrupts if
> >> the interrupt handler hasn't acked the interrupt at the device
> >> level; as long as the guest interrupt handler is able to somewhat
> >> rapidly clear the device interrupt, performance won't be impacted
> >> too much by the extra interrupt load, further hiding the bug on
> >> these platforms.  
> > 
> > It seems this is the trade off the lazy handling makes
> > intentionally, we risk some spurious interrupts while the line is
> > disabled to avoid poking the hardware.  So long as we're not
> > triggering the spurious interrupt handler to permanently disabling
> > the interrupt line, this is a valid choice.  
> 
> At least for some platforms, yes, though it's not exactly clear in
> the documentation that this is intentional or that the side effect
> exists at all.
> 
> > That's also a consideration for this patch, we're slowing down all
> > non-PCI2.3 INTx handling for the behavior of this platform.  Is
> > there some behavior of the interrupt controller we can query to
> > know which path to use?  Can this issue be seen with the irqfd INTx
> > handling disabled on other architectures?  Do we actually care
> > whether we're making old INTx devices slower?  Thanks,  
> 
> I honestly don't know.  In reality, we're talking about amd64 and
> ppc64el as the only two platforms that I know of that both support
> VFIO and such old PCIe INTX pre-2.3 devices, and their respective
> interrupt controller handling in a spurious INTX IRQ context is very
> different from one another.
> 
> Personally, I'd argue that such old devices were intended to work
> with much slower host systems, therefore the slowdown probably
> doesn't matter vs. being more correct in terms of interrupt handling.
>  In terms of general kernel design, my understanding has always been
> is that best practice is to always mask, disable, or clear a level
> interrupt before exiting the associated IRQ handler, and the current
> design seems to violate that rule.  In that context, I'd personally
> want to see an argument as to why echewing this traditional IRQ
> handler design is beneficial enough to justify making the VFIO driver
> dependent on platform-specific behavior.

Yep, I kind of agree.  The unlazy flag seems to provide the more
intended behavior.  It moves the irq chip masking into the fast path,
whereas it would have been asynchronous on a subsequent interrupt
previously, but the impact is only to ancient devices operating in INTx
mode, so as long as we can verify those still work on both ppc and x86,
I don't think it's worth complicating the code to make setting the
unlazy flag conditional on anything other than the device support.

Care to send out a new version documenting the actual sequence fixed by
this change and updating the code based on this thread?  Note that we
can test non-pci2.3 mode for any device/driver that supports INTx using
the nointxmask=1 option for vfio-pci and booting a linux guest with
pci=nomsi.  Thanks,

Alex