[PATCH v3 2/2] cxl: Enable global TLBIs for cxl contexts

Fri Sep 8 17:34:39 AEST 2017

Le 08/09/2017 à 08:56, Nicholas Piggin a écrit :
> On Sun,  3 Sep 2017 20:15:13 +0200
> Frederic Barrat <fbarrat at linux.vnet.ibm.com> wrote:
> 
>> The PSL and nMMU need to see all TLB invalidations for the memory
>> contexts used on the adapter. For the hash memory model, it is done by
>> making all TLBIs global as soon as the cxl driver is in use. For
>> radix, we need something similar, but we can refine and only convert
>> to global the invalidations for contexts actually used by the device.
>>
>> The new mm_context_add_copro() API increments the 'active_cpus' count
>> for the contexts attached to the cxl adapter. As soon as there's more
>> than 1 active cpu, the TLBIs for the context become global. Active cpu
>> count must be decremented when detaching to restore locality if
>> possible and to avoid overflowing the counter.
>>
>> The hash memory model support is somewhat limited, as we can't
>> decrement the active cpus count when mm_context_remove_copro() is
>> called, because we can't flush the TLB for a mm on hash. So TLBIs
>> remain global on hash.
> 
> Sorry I didn't look at this earlier and just wading in here a bit, but
> what do you think of using mmu notifiers for invalidating nMMU and
> coprocessor caches, rather than put the details into the host MMU
> management? npu-dma.c already looks to have almost everything covered
> with its notifiers (in that it wouldn't have to rely on tlbie coming
> from host MMU code).

Does npu-dma.c really do mmio nMMU invalidations? My understanding was 
that those atsd_launch operations are really targeted at the device 
behind the NPU, i.e. the nvidia card.
At some point, it was not possible to do mmio invalidations on the nMMU. 
At least on dd1. I'm checking with the nMMU team the status on dd2.

Alistair: is your code really doing a nMMU invalidation? Considering 
you're trying to also reuse the mm_context_add_copro() from this patch, 
I think I know the answer.

There are also other components relying on broadcasted invalidations 
from hardware: the PSL (for capi FPGA) and the XSL on the Mellanox CX5 
card, when in capi mode. They rely on hardware TLBIs, snooped and 
forwarded to them by the CAPP.
For the PSL, we do have a mmio interface to do targeted invalidations, 
but it was removed from the capi architecture (and left as a debug 
feature for our PSL implementation), because the nMMU would be out of 
sync with the PSL (due to the lack of interface to sync the nMMU, as 
mentioned above).
For the XSL on the Mellanox CX5, it's even more complicated. AFAIK, they 
do have a way to trigger invalidations through software, though the 
interface is private and Mellanox would have to be involved. They've 
also stated the performance is much worse through software invalidation.

Another consideration is performance. Which is best? Short of having 
real numbers, it's probably hard to know for sure.

So the road of getting rid of hardware invalidations for external 
components, if at all possible or even desirable, may be long.

   Fred

> This change is not too bad today, but if we get to more complicated
> MMU/nMMU TLB management like directed invalidation of particular units,
> then putting more knowledge into the host code will end up being
> complex I think.
> 
> I also want to also do optimizations on the core code that assumes we
> only have to take care of other CPUs, e.g.,
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_811068_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=647QnUvvBMO2f-DWP2xkeFceXDYSjpgHeTL3m_I9fiA&m=VaerDVXunKigctgE7NLm8VjaTR90W1m08iMcohAAnPo&s=y25SSoLEB8zDwXOLaTb8FFSpX_qSKiIG3Z5Cf1m7xnw&e=
> 
> Or, another example, directed IPI invalidations from the mm_cpumask
> bitmap.
> 
> I realize you want to get something merged! For the merge window and
> backports this seems fine. I think it would be nice soon afterwards to
> get nMMU knowledge out of the core code... Though I also realize with
> our tlbie instruction that does everything then it may be tricky to
> make a really optimal notifier.
> 
> Thanks,
> Nick
> 
>>
>> Signed-off-by: Frederic Barrat <fbarrat at linux.vnet.ibm.com>
>> Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
>> ---
>> Changelog:
>> v3: don't decrement active cpus count with hash, as we don't know how to flush
>> v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
>> and PWCs (thanks to Ben)
>