[Skiboot] [PATCH skiboot] npu2: Invalidate entire TCE cache if many entries requested

Tue Aug 20 15:21:55 AEST 2019

On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
> Turned out invalidating entries in NPU TCE cache is so slow that it
> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
> through; a 100GB guest takes about 20s to map all 100GB.
> 
> This falls through to the entire cache invalidation if more than 128
> TCEs were requested to invalidate, this reduces 20s from the abobe to
> less than 1s. The KVM change [1] is required to see this difference.
> 
> The threshold of 128 is chosen in attempt not to affect performance much
> as it is not clear how expensive it is to populate the TCE cache again;
> all we know for sure is that mapping the guest produces invalidation
> requests of 512 TCEs each.
> 
> Note TCE cache invalidation in PHB4 is faster and does not require
> the same workaround.

Do you know why PHB4 is so much faster? I suspect it is because the NPU is 
still doing SCOM read/writes (which itself is an indirect access) to perform 
the TCE invalidate rather than direct MMIO. Does the problem go away if you 
use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of npu2_write(...)?

- Alistair

> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
> https://patchwork.ozlabs.org/patch/1149003/
> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> ---
>  hw/npu2.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/npu2.c b/hw/npu2.c
> index 1dba8bb00f85..40583e343a98 100644
> --- a/hw/npu2.c
> +++ b/hw/npu2.c
> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb, 
uint32_t kill_type,
>  			return OPAL_PARAMETER;
>  		}
>  
> -		while (npages--) {
> -			val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> -			npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
> -			dma_addr += tce_size;
> +		if (npages < 128) {
> +			while (npages--) {
> +				val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> +				npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | 
val);
> +				dma_addr += tce_size;
> +			}
> +			break;
>  		}
> -		break;
> +		/*
> +		 * For too many TCEs do not bother with the loop above and simply
> +		 * flush everything, going to be lot faster.
> +		 */
> +		/* Fall through */
>  	case OPAL_PCI_TCE_KILL_PE:
>  		/*
>  		 * NPU2 doesn't support killing a PE so fall through
>