[Skiboot] [PATCH skiboot] npu2: Invalidate entire TCE cache if many entries requested
Alistair Popple
alistair at popple.id.au
Tue Aug 20 16:44:01 AEST 2019
On Tuesday, 20 August 2019 4:39:47 PM AEST Alexey Kardashevskiy wrote:
>
> On 20/08/2019 15:21, Alistair Popple wrote:
> > On Monday, 19 August 2019 4:17:48 PM AEST Alexey Kardashevskiy wrote:
> >> Turned out invalidating entries in NPU TCE cache is so slow that it
> >> becomes visible when running a 30+GB guest with GPU+NVlink2 passed
> >> through; a 100GB guest takes about 20s to map all 100GB.
> >>
> >> This falls through to the entire cache invalidation if more than 128
> >> TCEs were requested to invalidate, this reduces 20s from the abobe to
> >> less than 1s. The KVM change [1] is required to see this difference.
> >>
> >> The threshold of 128 is chosen in attempt not to affect performance much
> >> as it is not clear how expensive it is to populate the TCE cache again;
> >> all we know for sure is that mapping the guest produces invalidation
> >> requests of 512 TCEs each.
> >>
> >> Note TCE cache invalidation in PHB4 is faster and does not require
> >> the same workaround.
> >
> > Do you know why PHB4 is so much faster? I suspect it is because the NPU is
> > still doing SCOM read/writes (which itself is an indirect access) to
perform
> > the TCE invalidate rather than direct MMIO. Does the problem go away if
you
> > use out_be64(npu->regs + NPU2_ATS_TCE_KILL, ...) instead of
npu2_write(...)?
>
>
> Doing this:
>
> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
> + out_be64(npu->regs + NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE | val);
>
> brings 10sec to 3-4sec for 100GB guest. Invalidating the entire cache is
> still way faster.
Good to know, thanks.
In that case this does look like the best option given TCE based access should
be rare anyway so:
Reviewed-by: Alistair Popple <alistair at popple.id.au>
>
>
>
>
> >
> > - Alistair
> >
> >> [1] KVM: PPC: vfio/spapr_tce: Split out TCE invalidation from TCE updates
> >> https://patchwork.ozlabs.org/patch/1149003/
> >> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
> >> ---
> >> hw/npu2.c | 17 ++++++++++++-----
> >> 1 file changed, 12 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/npu2.c b/hw/npu2.c
> >> index 1dba8bb00f85..40583e343a98 100644
> >> --- a/hw/npu2.c
> >> +++ b/hw/npu2.c
> >> @@ -1277,12 +1277,19 @@ static int64_t npu2_tce_kill(struct phb *phb,
> > uint32_t kill_type,
> >> return OPAL_PARAMETER;
> >> }
> >>
> >> - while (npages--) {
> >> - val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr, pe_number);
> >> - npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE |
val);
> >> - dma_addr += tce_size;
> >> + if (npages < 128) {
> >> + while (npages--) {
> >> + val = SETFIELD(NPU2_ATS_TCE_KILL_PENUM, dma_addr,
pe_number);
> >> + npu2_write(npu, NPU2_ATS_TCE_KILL, NPU2_ATS_TCE_KILL_ONE |
> > val);
> >> + dma_addr += tce_size;
> >> + }
> >> + break;
> >> }
> >> - break;
> >> + /*
> >> + * For too many TCEs do not bother with the loop above and simply
> >> + * flush everything, going to be lot faster.
> >> + */
> >> + /* Fall through */
> >> case OPAL_PCI_TCE_KILL_PE:
> >> /*
> >> * NPU2 doesn't support killing a PE so fall through
> >>
> >
> >
> >
> >
>
>
More information about the Skiboot
mailing list