[Skiboot] [PATCH skiboot] npu2: Allow ATSD for LPAR other than 0

Alexey Kardashevskiy aik at ozlabs.ru
Wed Dec 12 11:52:48 AEDT 2018



On 12/12/2018 11:31, Alistair Popple wrote:
> On Tuesday, 11 December 2018 6:07:41 PM AEDT Stewart Smith wrote:
>> Alexey Kardashevskiy <aik at ozlabs.ru> writes:
>>> On 11/12/2018 13:53, Alistair Popple wrote:
>>>> On Wednesday, 5 December 2018 3:52:22 PM AEDT Alexey Kardashevskiy wrote:
>>>>> Each XTS MMIO ATSD# register is accompanied by another register -
>>>>> XTS MMIO ATSD0 LPARID# - which controls LPID filtering for ATSD
>>>>> transactions.
>>>>>
>>>>> When a host system passes a GPU through to a guest, we need to enable
>>>>> some ATSD for an LPAR. At the moment the host assigns one ATSD to
>>>>> a NVLink bridge and this maps it to an LPAR when GPU is assigned to
>>>>> the LPAR. The link number is used for an ATSD index.
>>>>>
>>>>> ATSD6&7 stay mapped to the host (LPAR=0) all the time which seems to be
>>>>> acceptable price for the simplicity.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik at ozlabs.ru>
>>>>> ---
>>>>>
>>>>>  include/npu2-regs.h |  2 ++
>>>>>  hw/npu2.c           | 22 ++++++++++++++++++----
>>>>>  2 files changed, 20 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/npu2-regs.h b/include/npu2-regs.h
>>>>> index 8273b2b..ae5e225 100644
>>>>> --- a/include/npu2-regs.h
>>>>> +++ b/include/npu2-regs.h
>>>>> @@ -547,6 +547,8 @@ void npu2_scom_write(uint64_t gcid, uint64_t
>>>>> scom_base,
>>>>>
>>>>>  #define NPU2_XTS_MMIO_ATSD5_LPARID		NPU2_REG_OFFSET(NPU2_STACK_MISC,
>>>>>
>>>>> NPU2_BLOCK_XTS, 0x128) #define
>>>>> NPU2_XTS_MMIO_ATSD6_LPARID		NPU2_REG_OFFSET(NPU2_STACK_MISC,
>>>>> NPU2_BLOCK_XTS, 0x130) #define
>>>>> NPU2_XTS_MMIO_ATSD7_LPARID		NPU2_REG_OFFSET(NPU2_STACK_MISC,
>>>>> NPU2_BLOCK_XTS, 0x138) +#define   NPU2_XTS_MMIO_ATSD_MSR_HV		
> PPC_BIT(51)
>>>>> +#define   NPU2_XTS_MMIO_ATSD_LPARID		PPC_BITMASK(52,63)
>>>>>
>>>>>  #define NPU2_XTS_BDF_MAP			NPU2_REG_OFFSET(NPU2_STACK_MISC,
>>>>
>>>> NPU2_BLOCK_XTS,
>>>>
>>>>> 0x4000) #define   NPU2_XTS_BDF_MAP_VALID		PPC_BIT(0)
>>>>>
>>>>>  #define   NPU2_XTS_BDF_MAP_UNFILT		PPC_BIT(1)
>>>>>
>>>>> diff --git a/hw/npu2.c b/hw/npu2.c
>>>>> index 41563b4..767306f 100644
>>>>> --- a/hw/npu2.c
>>>>> +++ b/hw/npu2.c
>>>>> @@ -2255,9 +2255,14 @@ static int opal_npu_map_lpar(uint64_t phb_id,
>>>>> uint64_t bdf, uint64_t lparid, struct phb *phb = pci_get_phb(phb_id);
>>>>>
>>>>>  	struct npu2 *p;
>>>>>  	struct npu2_dev *ndev = NULL;
>>>>>
>>>>> -	uint64_t xts_bdf_lpar, rc = OPAL_SUCCESS;
>>>>> +	uint64_t xts_bdf_lpar, atsd_lpar, rc = OPAL_SUCCESS;
>>>>>
>>>>>  	int i;
>>>>>  	int id;
>>>>>
>>>>> +	static uint64_t atsd_lpar_regs[] = {
>>>>> +		NPU2_XTS_MMIO_ATSD0_LPARID, NPU2_XTS_MMIO_ATSD1_LPARID,
>>>>> +		NPU2_XTS_MMIO_ATSD2_LPARID, NPU2_XTS_MMIO_ATSD3_LPARID,
>>>>> +		NPU2_XTS_MMIO_ATSD4_LPARID, NPU2_XTS_MMIO_ATSD5_LPARID,
>>>>> +		NPU2_XTS_MMIO_ATSD6_LPARID, NPU2_XTS_MMIO_ATSD7_LPARID };
>>>>>
>>>>>  	if (!phb || phb->phb_type != phb_type_npu_v2)
>>>>>  	
>>>>>  		return OPAL_PARAMETER;
>>>>>
>>>>> @@ -2297,11 +2302,20 @@ static int opal_npu_map_lpar(uint64_t phb_id,
>>>>> uint64_t bdf, uint64_t lparid, xts_bdf_lpar =
>>>>> SETFIELD(NPU2_XTS_BDF_MAP_LPARID, xts_bdf_lpar, lparid); xts_bdf_lpar =
>>>>> SETFIELD(NPU2_XTS_BDF_MAP_LPARSHORT, xts_bdf_lpar, id);
>>>>>
>>>>> -	/* Need to find an NVLink to send the ATSDs for this device over */
>>>>> +	/*
>>>>> +	 * Need to find an NVLink to send the ATSDs for this device over.
>>>>> +	 * Also, the host allocates an ATSD per NVLink, enable filtering now.
>>>>> +	 */
>>>>> +	atsd_lpar = SETFIELD(NPU2_XTS_MMIO_ATSD_LPARID, 0, lparid);
>>>>> +	if (!lparid)
>>>>> +		atsd_lpar = SETFIELD(NPU2_XTS_MMIO_ATSD_MSR_HV, atsd_lpar, 1);
>>>>> +
>>>>>
>>>>>  	for (i = 0; i < p->total_devices; i++) {
>>>>>  	
>>>>>  		if (p->devices[i].nvlink.gpu_bdfn == bdf) {
>>>>>
>>>>> -			ndev = &p->devices[i];
>>>>> -			break;
>>>>> +			if (!ndev)
>>>>> +				ndev = &p->devices[i];
>>>>> +			if (i < ARRAY_SIZE(atsd_lpar_regs))
>>>>> +				npu2_write(p, atsd_lpar_regs[i], atsd_lpar);
>>>>
>>>> I'm not sure I really like this as ATSD resources should be allocated and
>>>> passed through to guests by the hypervisor independently of the NVLink
>>>> devices themselves, and a single ATSD register can serve more than one
>>>> device.> 
>>> Sure it can serve more, and all passed ATSDs are assembled in one vPHB
>>> property and the guest then does not make distinction between them.  I
>>> just pass maximum 6 of 8 ATSDs but until recently it was just a single
>>> ATSD per PHB for all devices and yet nobody complained.
>>>
>>> The problem here is that I need to mmap() ATSD to QEMU, then map them to
>>> KVM and then present this somehow via the device tree.
>>>
>>> Now, I can do mmap() on a fd such as a VFIO device fd or a IOMMU
>>> container fd.  If I choose the dev fd, VFIO in QEMU can mmap it, store
>>> the pointer in the QEMU PCI device, and SPAPR vPHB can put atsd property
>>> in vPHB. If I choose the container fd, then VFIO in QEMU can mmap it but
>>> SPAPR vPHB has no concept of IOMMU/VFIO groups/containers and I'd rather
>>> leave like this.
>>>
>>>> This also creates an implicit assumption that the order of NVLink devices
>>>> in p->devices[] matches ATSD register numbers in some way that the
>>>> hypervisor assumes, which seems like it would be especially brittle as
>>>> devices[] is essentially unordered.
>>>
>>> The order does not matter here, all that matters is that the order does
>>> not change - any ATSD works with any NVLink bridge on that NPU PHB.
>>>
>>>> How does the hypervisor know which of the 8 ATSD registers has been
>>>> assigned to the LPARID when calling opal_npu_map_lpar?
>>>
>>> One ATSD register is exposed to the userspace via mmap() on a
>>> "corresponding" NVLink bridge VFIO device fd, register index == link
>>> index. An NVLink device belongs to an IOMMU group. A KVM pointer (i.e.
>>> LPID) is set to the IOMMU group from an ioctl() called by QEMU.
> 
> Right, but doesn't the hypervisor have to know which of the 8 ATSD registers 
> to pass through to mmap to the guest? Only one of them gets mapped to the LPAR 
> in opal_npu_map_lpar(), so how do you work out which one to mmap?


You got me there :) It is "ibm,npu-link-index" in the hypervisor/vfio
but nothing says the index from npu2::devices[index] is the same thing,
I need this conversion here as "index" is not visible to the host system :-/



> 
>>>> It seems like you might need a
>>>> seperate OPAL call for this along the lines of
>>>> npu2_map_mmio_atsd(atsd_index, lparid).
>>>
>>> I do not see what it buys us. I need an allocator on the host or skiboot
>>> anyway, either static or dynamic, dynamic seems excessive as the
>>> devices[] order or the number of ATSDs do not change.
>>
>> So, of course I spot discussion after having hit the merge
>> button[1]. Maybe I should care/look closer at all the NPU2 things?
> 
> I could of course review things in a more timely and consistent manner :-)
> 
>> (it'd help if there was a sane way to test things that I could plug into
>> op-test)
>>
>> Any recommendation what I should do from here, is keeping the patch okay?
> 
> I thought it was ok but after re-reading the discussion I'm less convinced 
> again.

Yeah, Stewart needs to remove the patch for now.



> 
> - Alistair
> 
>> [1] not an actual button.
> 
> 

-- 
Alexey


More information about the Skiboot mailing list