[PATCH v3 1/3] PCI: Introduce pcibios_ignore_alignment_request

Alexey Kardashevskiy aik at ozlabs.ru
Mon Jun 3 18:35:01 AEST 2019



On 03/06/2019 15:02, Alexey Kardashevskiy wrote:
> 
> 
> On 03/06/2019 12:23, Shawn Anastasio wrote:
>>
>>
>> On 5/30/19 10:56 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 31/05/2019 08:49, Shawn Anastasio wrote:
>>>> On 5/29/19 10:39 PM, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 28/05/2019 17:39, Shawn Anastasio wrote:
>>>>>>
>>>>>>
>>>>>> On 5/28/19 1:27 AM, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 28/05/2019 15:36, Oliver wrote:
>>>>>>>> On Tue, May 28, 2019 at 2:03 PM Shawn Anastasio <shawn at anastas.io>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Introduce a new pcibios function pcibios_ignore_alignment_request
>>>>>>>>> which allows the PCI core to defer to platform-specific code to
>>>>>>>>> determine whether or not to ignore alignment requests for PCI
>>>>>>>>> resources.
>>>>>>>>>
>>>>>>>>> The existing behavior is to simply ignore alignment requests when
>>>>>>>>> PCI_PROBE_ONLY is set. This is behavior is maintained by the
>>>>>>>>> default implementation of pcibios_ignore_alignment_request.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Shawn Anastasio <shawn at anastas.io>
>>>>>>>>> ---
>>>>>>>>>     drivers/pci/pci.c   | 9 +++++++--
>>>>>>>>>     include/linux/pci.h | 1 +
>>>>>>>>>     2 files changed, 8 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>>>>>> index 8abc843b1615..8207a09085d1 100644
>>>>>>>>> --- a/drivers/pci/pci.c
>>>>>>>>> +++ b/drivers/pci/pci.c
>>>>>>>>> @@ -5882,6 +5882,11 @@ resource_size_t __weak
>>>>>>>>> pcibios_default_alignment(void)
>>>>>>>>>            return 0;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> +int __weak pcibios_ignore_alignment_request(void)
>>>>>>>>> +{
>>>>>>>>> +       return pci_has_flag(PCI_PROBE_ONLY);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>     #define RESOURCE_ALIGNMENT_PARAM_SIZE COMMAND_LINE_SIZE
>>>>>>>>>     static char
>>>>>>>>> resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
>>>>>>>>>     static DEFINE_SPINLOCK(resource_alignment_lock);
>>>>>>>>> @@ -5906,9 +5911,9 @@ static resource_size_t
>>>>>>>>> pci_specified_resource_alignment(struct pci_dev *dev,
>>>>>>>>>            p = resource_alignment_param;
>>>>>>>>>            if (!*p && !align)
>>>>>>>>>                    goto out;
>>>>>>>>> -       if (pci_has_flag(PCI_PROBE_ONLY)) {
>>>>>>>>> +       if (pcibios_ignore_alignment_request()) {
>>>>>>>>>                    align = 0;
>>>>>>>>> -               pr_info_once("PCI: Ignoring requested alignments
>>>>>>>>> (PCI_PROBE_ONLY)\n");
>>>>>>>>> +               pr_info_once("PCI: Ignoring requested
>>>>>>>>> alignments\n");
>>>>>>>>>                    goto out;
>>>>>>>>>            }
>>>>>>>>
>>>>>>>> I think the logic here is questionable to begin with. If the user
>>>>>>>> has
>>>>>>>> explicitly requested re-aligning a resource via the command line
>>>>>>>> then
>>>>>>>> we should probably do it even if PCI_PROBE_ONLY is set. When it
>>>>>>>> breaks
>>>>>>>> they get to keep the pieces.
>>>>>>>>
>>>>>>>> That said, the real issue here is that PCI_PROBE_ONLY probably
>>>>>>>> shouldn't be set under qemu/kvm. Under the other hypervisor
>>>>>>>> (PowerVM)
>>>>>>>> hotplugged devices are configured by firmware before it's passed to
>>>>>>>> the guest and we need to keep the FW assignments otherwise things
>>>>>>>> break. QEMU however doesn't do any BAR assignments and relies on
>>>>>>>> that
>>>>>>>> being handled by the guest. At boot time this is done by SLOF, but
>>>>>>>> Linux only keeps SLOF around until it's extracted the device-tree.
>>>>>>>> Once that's done SLOF gets blown away and the kernel needs to do
>>>>>>>> it's
>>>>>>>> own BAR assignments. I'm guessing there's a hack in there to make it
>>>>>>>> work today, but it's a little surprising that it works at all...
>>>>>>>
>>>>>>>
>>>>>>> The hack is to run a modified qemu-aware "/usr/sbin/rtas_errd" in the
>>>>>>> guest which receives an event from qemu (RAS_EPOW from
>>>>>>> /proc/interrupts), fetches device tree chunks (and as I understand
>>>>>>> it -
>>>>>>> they come with BARs from phyp but without from qemu) and writes
>>>>>>> "1" to
>>>>>>> "/sys/bus/pci/rescan" which calls pci_assign_resource() eventually:
>>>>>>
>>>>>> Interesting. Does this mean that the PHYP hotplug path doesn't
>>>>>> call pci_assign_resource?
>>>>>
>>>>>
>>>>> I'd expect dlpar_add_slot() to be called under phyp and eventually
>>>>> pci_device_add() which (I think) may or may not trigger later
>>>>> reassignment.
>>>>>
>>>>>
>>>>>> If so it means the patch may not
>>>>>> break that platform after all, though it still may not be
>>>>>> the correct way of doing things.
>>>>>
>>>>>
>>>>> We should probably stop enforcing the PCI_PROBE_ONLY flag - it seems
>>>>> that (unless resource_alignment= is used) the pseries guest should just
>>>>> walk through all allocated resources and leave them unchanged.
>>>>
>>>> If we add a pcibios_default_alignment() implementation like was
>>>> suggested earlier, then it will behave as if the user has
>>>> specified resource_alignment= by default and SLOF's assignments
>>>> won't be honored (I think).
>>>
>>>
>>> I removed pci_add_flags(PCI_PROBE_ONLY) from pSeries_setup_arch and
>>> tried booting with and without pci=resource_alignment= and I can see no
>>> difference - BARs are still aligned to 64K as programmed in SLOF; if I
>>> hack SLOF to align to 4K or 32K - BARs get packed and the guest leaves
>>> them unchanged.
>>>
>>>
>>>> I guess it boils down to one question - is it important that we
>>>> observe SLOF's initial BAR assignments?
>>>
>>> It isn't if it's SLOF but it is if it's phyp. It used to not
>>> allow/support BAR reassignment and even if it does not, I'd rather avoid
>>> touching them.
>>
>> A quick update. I tried removing pci_add_flags(PCI_PROBE_ONLY) which
>> worked, but if I add an implementation of pcibios_default_alignment
>> which simply returns PAGE_SIZE, my VM fails to boot and many errors
>> from the virtio disk driver are printed to the console.
>>
>> After some investigation, it seems that with pcibios_default_alignment
>> present, Linux will reallocate all resources provided by SLOF on
>> boot. I'm still not sure why exactly this causes the virtio driver
>> to fail, but it does indicate that there is a reason to keep
>> SLOF's initial assignments.
>>
>> Anybody have an idea what's causing this?
> 
> With your changes the guest feels the urge to reassign bars (no idea why
> but ok), when it does so, it puts both BARs (one is prefetchable) into
> the 32bit non-prefetchable window of the PHB (SLOF puts the prefetchable
> bar to a 64bit prefetchable window, I have no idea why the guest does it
> different either but this must still work) and then qemu does not
> emulate something properly - unassigned_mem_accepts() is triggered on
> the bar access - no idea why - I am debugging it right now.


Sooo the problem is that resouce::flags has 2 bits to describe 64bit
BARs - PCI_BASE_ADDRESS_MEM_TYPE_64 and IORESOURCE_MEM_64 - and we don't
set IORESOURCE_MEM_64 for 64bit BARs when parsing the fdt.

So the BAR reallocator moves the BAR to 32bit window (which is not
desirable but permitted, I still have to chase it) and then
pci_std_update_resource() writes BAR back but since now it is 32bit BAR,
it does not write to the upper 32bits so that half remains 0x2100, QEMU
does not move BAR to the right window and the MMIO stops working.

Try this in the guest kernel, it seems to keep bars where they were
after slof.


diff --git a/arch/powerpc/kernel/pci_of_scan.c
b/arch/powerpc/kernel/pci_of_scan.c
index 24191ea2d9a7..64ad92016b63 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -45,6 +45,8 @@ unsigned int pci_parse_of_flags(u32 addr0, int bridge)
        if (addr0 & 0x02000000) {
                flags = IORESOURCE_MEM | PCI_BASE_ADDRESS_SPACE_MEMORY;
                flags |= (addr0 >> 22) & PCI_BASE_ADDRESS_MEM_TYPE_64;
+               if (flags & PCI_BASE_ADDRESS_MEM_TYPE_64)
+                       flags |= IORESOURCE_MEM_64;
                flags |= (addr0 >> 28) & PCI_BASE_ADDRESS_MEM_TYPE_1M;
                if (addr0 & 0x40000000)





-- 
Alexey


More information about the Linuxppc-dev mailing list