[PATCH v3 0/9] powerpc/xive: Map one IPI interrupt per node

Fri Apr 2 04:14:07 AEDT 2021

On 4/1/21 2:45 PM, Greg Kurz wrote:
> On Thu, 1 Apr 2021 11:18:10 +0200
> Cédric Le Goater <clg at kaod.org> wrote:
> 
>> Hello,
>>
>> On 4/1/21 10:04 AM, Greg Kurz wrote:
>>> On Wed, 31 Mar 2021 16:45:05 +0200
>>> Cédric Le Goater <clg at kaod.org> wrote:
>>>
>>>>
>>>> Hello,
>>>>
>>>> ipistorm [*] can be used to benchmark the raw interrupt rate of an
>>>> interrupt controller by measuring the number of IPIs a system can
>>>> sustain. When applied to the XIVE interrupt controller of POWER9 and
>>>> POWER10 systems, a significant drop of the interrupt rate can be
>>>> observed when crossing the second node boundary.
>>>>
>>>> This is due to the fact that a single IPI interrupt is used for all
>>>> CPUs of the system. The structure is shared and the cache line updates
>>>> impact greatly the traffic between nodes and the overall IPI
>>>> performance.
>>>>
>>>> As a workaround, the impact can be reduced by deactivating the IRQ
>>>> lockup detector ("noirqdebug") which does a lot of accounting in the
>>>> Linux IRQ descriptor structure and is responsible for most of the
>>>> performance penalty.
>>>>
>>>> As a fix, this proposal allocates an IPI interrupt per node, to be
>>>> shared by all CPUs of that node. It solves the scaling issue, the IRQ
>>>> lockup detector still has an impact but the XIVE interrupt rate scales
>>>> linearly. It also improves the "noirqdebug" case as showed in the
>>>> tables below. 
>>>>
>>>
>>> As explained by David and others, NUMA nodes happen to match sockets
>>> with current POWER CPUs but these are really different concepts. NUMA
>>> is about CPU memory accesses latency, 
>>
>> This is exactly our problem. we have cache issues because hw threads 
>> on different chips are trying to access the same structure in memory.
>> It happens on virtual platforms and baremetal platforms. This is not
>> restricted to pseries.
>>
> 
> Ok, I get it... the XIVE HW accesses structures in RAM, just like HW threads
> do, so the closer, the better. 

No. That's another problem related to the XIVE internal tables which
should be allocated on the chip where it is "mostly" used. 

The problem is much simpler. As the commit log says : 

 This is due to the fact that a single IPI interrupt is used for all
 CPUs of the system. The structure is shared and the cache line updates
 impact greatly the traffic between nodes and the overall IPI
 performance.

So, we have multiple threads competing for the same IRQ descriptor and 
overloading the PowerBUS with cache update synchronization. 

> This definitely looks NUMA related indeed. So
> yes, the idea of having the XIVE HW to only access local in-RAM data when
> handling IPIs between vCPUs in the same NUMA node makes sense.

yes. That's the goal.

> What is less clear is the exact role of ibm,chip-id actually. This is
> currently used on PowerNV only to pick up a default target on the same
> "chip" as the source if possible. What is the detailed motivation behind
> this ?

The "ibm,chip-id" issue is extra noise and not a requirement for this 
patchset.

>>> while in the case of XIVE you
>>> really need to identify a XIVE chip localized in a given socket.
>>>
>>> PAPR doesn't know about sockets, only cores. In other words, a PAPR
>>> compliant guest sees all vCPUs like they all sit in a single socket.
>>
>> There are also NUMA nodes on PAPR.
>>
> 
> Yes but nothing prevents a NUMA node to span over multiple sockets
> or having several NUMA nodes within the same socket, even if this
> isn't the case in practice with current POWER hardware.

yes. A NUMA node could even be a PCI adapter attached to storage. 
I don't know what to say. We are missing a concept maybe.

>>> Same for the XIVE. Trying to introduce a concept of socket, either
>>> by hijacking OPAL's ibm,chip-id or NUMA node ids, is a kind of
>>> spec violation in this context. If the user cares for locality of
>>> the vCPUs and XIVE on the same socket, then it should bind vCPU
>>> threads to host CPUs from the same socket in the first place.
>>
>> Yes. that's a must have of course. You need to reflect the real HW
>> topology in the guest or LPAR if you are after performance, or 
>> restrict the virtual machine to be on a single socket/chip/node.  
>>
>> And this is not only a XIVE problem. XICS has the same problem with
>> a shared single IPI interrupt descriptor but XICS doesn't scale well 
>> by design, so it doesn't show.
>>
>>
>>> Isn't this enough to solve the performance issues this series
>>> want to fix, without the need for virtual socket ids ?
>> what are virtual socket ids ? A new concept ? 
>>
> 
> For now, we have virtual CPUs identified by a virtual CPU id.
> It thus seems natural to speak of a virtual socket id, but
> anyway, the wording isn't really important here and you
> don't answer the question ;-)

if, on the hypervisor, you restrict the virtual machine vCPUs to be 
on a single POWER processor/chip, there is no problem. But large 
KVM guests or PowerVM LPARs do exist on 16s systems.

C.