[Skiboot] [RFC PATCH] pci: Change bus number assignment policy
Sergey Miroshnichenko
s.miroshnichenko at yadro.com
Thu Mar 14 00:01:42 AEDT 2019
Hello Oliver,
On 3/12/19 9:57 AM, Oliver O'Halloran wrote:
> Change the way we assign bus numbers so that rather than assigning
> bus numbers densely, we spread the available bus numbers across the
> downstream bridges of a given topology.
>
> For PCIe topologies with a wide-fan out this allows hot-plugging
> large numbers of devices into a downstream port near the top of the
> PCIe topology.
>
> This patch changes the bus number assignment policy to evenly split the
> available bus numbers across the bridges on bus. This allows us to
> support hot-adding of large numbers of devices near the root of the
> topology. Currently we only assign a handful of spare buses to each
> hotplug port so if a large number of devices are added at the root
> we won't have enough spare bus numbers for them.
>
> e.g. On a zaius system with a two-switch (HBA switch and drawer switch) topology:
>
> Before:
>
> -+-[0030:00]---00.0-[01-40]----00.0-[02-40]--+-04.0-[03-1e]----00.0-[04-1e]--+-08.0-[05]----00.0
> | | +-09.0-[06]----00.0
> | | +-0a.0-[07-0b]--
> | | +-0b.0-[0c-10]--
> | | +-10.0-[11]----00.0
> | | +-11.0-[12]----00.0
> | | +-12.0-[13-17]--
> | | +-13.0-[18-1c]--
> | | +-15.0-[1d]----00.0
> | | \-16.0-[1e]----00.0
> | +-05.0-[1f-36]----00.0-[20-36]--+-04.0-[21]----00.0
> | | +-05.0-[22]----00.0
> | | +-06.0-[23]----00.0
> | | +-07.0-[24]----00.0
> | | +-0c.0-[25-29]--
> | | +-0d.0-[2a]----00.0
> | | +-0e.0-[2b-2f]--
> | | +-0f.0-[30]----00.0
> | | +-14.0-[31-35]--
> | | \-17.0-[36]----00.0
> | +-06.0-[37-3b]--
> | \-07.0-[3c-40]--
>
> After:
>
> -+-[0030:00]---00.0-[01-fe]----00.0-[02-fd]--+-04.0-[03-41]----00.0-[04-40]--+-08.0-[05-0a]----00.0
> | | +-09.0-[0b-10]----00.0
> | | +-0a.0-[11-16]--
> | | +-0b.0-[17-1c]--
> | | +-10.0-[1d-22]----00.0
> | | +-11.0-[23-28]----00.0
> | | +-12.0-[29-2e]--
> | | +-13.0-[2f-34]--
> | | +-15.0-[35-3a]----00.0
> | | \-16.0-[3b-40]----00.0
> | +-05.0-[42-80]----00.0-[43-7f]--+-04.0-[44-49]----00.0
> | | +-05.0-[4a-4f]--
> | | +-06.0-[50-55]----00.0
> | | +-07.0-[56-5b]----00.0
> | | +-0c.0-[5c-61]--
> | | +-0d.0-[62-67]----00.0
> | | +-0e.0-[68-6d]--
> | | +-0f.0-[6e-73]----00.0
> | | +-14.0-[74-79]--
> | | \-17.0-[7a-7f]----00.0
> | +-06.0-[81-bf]--
> | \-07.0-[c0-fe]--
>
> This does however have the disadvantage that we can't really support
> deep rather than wide topologies. In the above example you can see that
> when we hit the lowest level there is only 5 buses available per port,
> so if we had an architecture where downstream storage was daisy-chained
> we would run out bus numbers pretty quickly.
>
> These are solvable problems, but I figure I should see what people think
> before spending a lot of time on this.
>
> Not-Signed-off-by: Oliver O'Halloran <oohall at gmail.com>
> Cc: Sergey Miroshnichenko <s.miroshnichenko at yadro.com>
> ---
> Sergey, would something like this work for the NVMe drawers you've been
> working with? I think we'll need to support bus-number reassignment at
> eventually, but if we could kick that can down the road a bit it'd be
> helpful.
Reserving more bus numbers per downstream port, covering all the 0-0xff
range, would be definitely helpful for the current state of Linux kernel.
But with the "pci=realloc" command line argument the kernel will
reassign the bus numbers during boot: compactly or based on the
"hpbussize" argument.
I have a patchset for Linux that handles a bridge hot-plugged in the
middle of a dense PCIe tree, we are actively using it internally for
about a year on our Vesnin servers and also Xeons, but we was going to
propose upstreaming it later - after the "Movable BARs" serie. These
patches increment bus numbers of a "tail" of the tree, freeing a space
for the new bridge, while all the working drivers remain working. The
hardest part was rebuilding procfs and sysfs entries and simlinks.
Best regards,
Serge
> ---
> core/pci.c | 44 ++++++++++++++++----------------------------
> 1 file changed, 16 insertions(+), 28 deletions(-)
>
> diff --git a/core/pci.c b/core/pci.c
> index 454b50102e59..d5537b6a376b 100644
> --- a/core/pci.c
> +++ b/core/pci.c
> @@ -743,9 +743,10 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
> bool scan_downstream)
> {
> struct pci_device *pd = NULL, *rc = NULL;
> - uint8_t dev, fn, next_bus, max_sub, save_max;
> + uint8_t dev, fn, next_bus, max_sub;
> uint32_t scan_map;
> bool use_max;
> + int bridges = 0, buses_per_bridge;
>
> /* Decide what to scan */
> scan_map = parent ? parent->scan_map : phb->scan_map;
> @@ -810,7 +811,18 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>
> next_bus = bus + 1;
> max_sub = bus;
> - save_max = max_bus;
> +
> + list_for_each(list, pd, link)
> + if (pd->is_bridge)
> + bridges++;
> +
> + buses_per_bridge = max_bus - next_bus - 1;
> + if (bridges)
> + buses_per_bridge /= bridges;
> +
> + PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n",
> + bridges, next_bus, max_bus, scan_downstream ? "" : "not ",
> + buses_per_bridge);
>
> /* Scan down bridges */
> list_for_each(list, pd, link) {
> @@ -819,32 +831,8 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
> if (!pd->is_bridge)
> continue;
>
> - /* We need to figure out a new bus number to start from.
> - *
> - * This can be tricky due to our HW constraints which differ
> - * from bridge to bridge so we are going to let the phb
> - * driver decide what to do. This can return us a maximum
> - * bus number to assign as well
> - *
> - * This function will:
> - *
> - * - Return the bus number to use as secondary for the
> - * bridge or 0 for a failure
> - *
> - * - "max_bus" will be adjusted to represent the max
> - * subordinate that can be associated with the downstream
> - * device
> - *
> - * - "use_max" will be set to true if the returned max_bus
> - * *must* be used as the subordinate bus number of that
> - * bridge (when we need to give aligned powers of two's
> - * on P7IOC). If is is set to false, we just adjust the
> - * subordinate bus number based on what we probed.
> - *
> - */
> - max_bus = save_max;
> - next_bus = phb->ops->choose_bus(phb, pd, next_bus,
> - &max_bus, &use_max);
> + use_max = 1;
> + max_bus = next_bus + buses_per_bridge;
>
> /* Configure the bridge with the returned values */
> if (next_bus <= bus) {
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/skiboot/attachments/20190313/3455d7a6/attachment-0001.sig>
More information about the Skiboot
mailing list