[Skiboot] [PATCH skiboot] npu2: Invalidate entire TCE cache if many entries requested
Alexey Kardashevskiy
aik at ozlabs.ru
Tue Aug 20 16:39:47 AEST 2019
On 20/08/2019 15:21, Alistair Popple wrote:
> On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
>> Turned out invalidating entries in NPU TCE cache is so slow that it
>> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
>> through; a 100GB guest takes about 20s to map all 100GB.
>>
>> This falls through to the entire cache invalidation if more than 128
>> TCEs were requested to invalidate, this reduces 20s from the abobe to
>> less than 1s. The KVM change [1] is required to see this difference.
>>
>> The threshold of 128 is chosen in attempt not to affect performance much
>> as it is not clear how expensive it is to populate the TCE cache again;
>> all we know for sure is that mapping the guest produces invalidation
>> requests of 512 TCEs each.
>>
>> Note TCE cache invalidation in PHB4 is faster and does not require
>> the same workaround.
>
> Do you know why PHB4 is so much faster? I suspect it is because the NPU is
> still doing SCOM read/writes (which itself is an indirect access) to perform
> the TCE invalidate rather than direct MMIO. Does the problem go away if you
> use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)?
Doing this:
- npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
+ out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is
still way faster.
>
> - Alistair
>
>> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
>> https://patchwork.ozlabs.org/patch/1149003/
>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>> ---
>> hw/npu2.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/npu2.c b/hw/npu2.c
>> index 1dba8bb00f85..40583e343a98 100644
>> --- a/hw/npu2.c
>> +++ b/hw/npu2.c
>> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb,
> uint32_t kill_type,
>> return OPAL_PARAMETER;
>> }
>>
>> - while (npages--) {
>> - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
>> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
>> - dma_addr += tce_size;
>> + if (npages < 128) {
>> + while (npages--) {
>> + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
>> + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE |
> val);
>> + dma_addr += tce_size;
>> + }
>> + break;
>> }
>> - break;
>> + /*
>> + * For too many TCEs do not bother with the loop above and simply
>> + * flush everything, going to be lot faster.
>> + */
>> + /* Fall through */
>> case OPAL_PCI_TCE_KILL_PE:
>> /*
>> * NPU2 doesn't support killing a PE so fall through
>>
>
>
>
>
--
Alexey
More information about the Skiboot
mailing list