RFC: issues concerning the next NAPI interface

Tue Aug 28 21:21:09 EST 2007

Hi

On Monday 27 August 2007 23:02, David Miller wrote:
> But there are huger fish to fry for you I think.  Talk to your
> platform maintainers and ask for an interface for obtaining
> a flat static distribution of interrupts to cpus in order to
> support multiqueue NAPI better.
> 
> In your previous postings you made arguments saying that the
> automatic placement of interrupts to cpus made everything
> bunch of to a single cpu and you wanted to propagate the
> NAPI work to other cpu's software interrupts from there.
> 
> That logic is bogus, because it merely proves that the hardware
> interrupt distribution is broken.  If it's a bad cpu to run
> software interrupts on, it's also a bad cpu to run hardware
> interrupts on.
> 

As already mentioned some mails were mixed up here. To clarify the interrupt
issue that has nothing to do with the reduction of interrupts:
- Interrupts are distributed the round robin way on IBM POWER6 processors
- Interrupt distribution can be modified by user/daemons (smp_affinity, pinning)
- NAPI is scheduled on CPU where interrupt is catched
- NAPI polls on that CPU as long as poll has packets to process (default)
(David please correct if there is a misunderstanding here)

Problem for multi queue driver with interrupt distribution scheme set to
round robin for this simple example:
Assuming we have 2 SLOW CPUs. CPU_1 is under heavy load (applications). CPU_2
is not under heavy load. Now we receive a lot of packets (high load situation).
Receive queue 1 (RQ1) is scheduled on CPU_1. NAPI-Poll does not manage to empty
RQ1 ever, so it stays on CPU_1. The second receive queue (RQ2) is scheduled on
CPU_2. As that CPU is not under heavy load, RQ2 can be emptied, and the next IRQ
for RQ2 will go to CPU_1. Then both RQs are on CPU_1 and will stay there if
no IRQ is forced at some time as both RQs are never emptied completely.

This is a simplified example to demonstrate the problem. The interrupt scheme
is not bogus, it is just an effect you see if you don't use pinning.
The user can avoid this problem by pinning the interrupts to CPUs.

As pointed out by David, it will be too expensive to schedule NAPI poll on
a different CPU than the one that gets the IRQ. So I guess one solution 
is to "force" an HW interrupt when two many RQs are processed on the same CPU
(when no IRQ pinning is used). This is something the driver has to handle.

Regards,
Jan-Bernd