[Skiboot] [RFC PATCH] pci: Change bus number assignment policy

Sergey Miroshnichenko s.miroshnichenko at yadro.com
Thu Mar 14 00:01:42 AEDT 2019


Hello Oliver,

On 3/12/19 9:57 AM, Oliver O'Halloran wrote:
> Change the way we assign bus numbers so that rather than assigning
> bus numbers densely, we spread the available bus numbers across the
> downstream bridges of a given topology.
> 
> For PCIe topologies with a wide-fan out this allows hot-plugging
> large numbers of devices into a downstream port near the top of the
> PCIe topology.
> 
> This patch changes the bus number assignment policy to evenly split the
> available bus numbers across the bridges on bus. This allows us to
> support hot-adding of large numbers of devices near the root of the
> topology. Currently we only assign a handful of spare buses to each
> hotplug port so if a large number of devices are added at the root
> we won't have enough spare bus numbers for them.
> 
> e.g. On a zaius system with a two-switch (HBA switch and drawer switch) topology:
> 
> Before:
> 
> -+-[0030:00]---00.0-[01-40]----00.0-[02-40]--+-04.0-[03-1e]----00.0-[04-1e]--+-08.0-[05]----00.0
>  |                                           |                               +-09.0-[06]----00.0
>  |                                           |                               +-0a.0-[07-0b]--
>  |                                           |                               +-0b.0-[0c-10]--
>  |                                           |                               +-10.0-[11]----00.0
>  |                                           |                               +-11.0-[12]----00.0
>  |                                           |                               +-12.0-[13-17]--
>  |                                           |                               +-13.0-[18-1c]--
>  |                                           |                               +-15.0-[1d]----00.0
>  |                                           |                               \-16.0-[1e]----00.0
>  |                                           +-05.0-[1f-36]----00.0-[20-36]--+-04.0-[21]----00.0
>  |                                           |                               +-05.0-[22]----00.0
>  |                                           |                               +-06.0-[23]----00.0
>  |                                           |                               +-07.0-[24]----00.0
>  |                                           |                               +-0c.0-[25-29]--
>  |                                           |                               +-0d.0-[2a]----00.0
>  |                                           |                               +-0e.0-[2b-2f]--
>  |                                           |                               +-0f.0-[30]----00.0
>  |                                           |                               +-14.0-[31-35]--
>  |                                           |                               \-17.0-[36]----00.0
>  |                                           +-06.0-[37-3b]--
>  |                                           \-07.0-[3c-40]--
> 
> After:
> 
> -+-[0030:00]---00.0-[01-fe]----00.0-[02-fd]--+-04.0-[03-41]----00.0-[04-40]--+-08.0-[05-0a]----00.0
>  |                                           |                               +-09.0-[0b-10]----00.0
>  |                                           |                               +-0a.0-[11-16]--
>  |                                           |                               +-0b.0-[17-1c]--
>  |                                           |                               +-10.0-[1d-22]----00.0
>  |                                           |                               +-11.0-[23-28]----00.0
>  |                                           |                               +-12.0-[29-2e]--
>  |                                           |                               +-13.0-[2f-34]--
>  |                                           |                               +-15.0-[35-3a]----00.0
>  |                                           |                               \-16.0-[3b-40]----00.0
>  |                                           +-05.0-[42-80]----00.0-[43-7f]--+-04.0-[44-49]----00.0
>  |                                           |                               +-05.0-[4a-4f]--
>  |                                           |                               +-06.0-[50-55]----00.0
>  |                                           |                               +-07.0-[56-5b]----00.0
>  |                                           |                               +-0c.0-[5c-61]--
>  |                                           |                               +-0d.0-[62-67]----00.0
>  |                                           |                               +-0e.0-[68-6d]--
>  |                                           |                               +-0f.0-[6e-73]----00.0
>  |                                           |                               +-14.0-[74-79]--
>  |                                           |                               \-17.0-[7a-7f]----00.0
>  |                                           +-06.0-[81-bf]--
>  |                                           \-07.0-[c0-fe]--
> 
> This does however have the disadvantage that we can't really support
> deep rather than wide topologies. In the above example you can see that
> when we hit the lowest level there is only 5 buses available per port,
> so if we had an architecture where downstream storage was daisy-chained
> we would run out bus numbers pretty quickly.
> 
> These are solvable problems, but I figure I should see what people think
> before spending a lot of time on this.
> 
> Not-Signed-off-by: Oliver O'Halloran <oohall at gmail.com>
> Cc: Sergey Miroshnichenko <s.miroshnichenko at yadro.com>
> ---
> Sergey, would something like this work for the NVMe drawers you've been
> working with? I think we'll need to support bus-number reassignment at
> eventually, but if we could kick that can down the road a bit it'd be
> helpful.

Reserving more bus numbers per downstream port, covering all the 0-0xff
range, would be definitely helpful for the current state of Linux kernel.

But with the "pci=realloc" command line argument the kernel will
reassign the bus numbers during boot: compactly or based on the
"hpbussize" argument.

I have a patchset for Linux that handles a bridge hot-plugged in the
middle of a dense PCIe tree, we are actively using it internally for
about a year on our Vesnin servers and also Xeons, but we was going to
propose upstreaming it later - after the "Movable BARs" serie. These
patches increment bus numbers of a "tail" of the tree, freeing a space
for the new bridge, while all the working drivers remain working. The
hardest part was rebuilding procfs and sysfs entries and simlinks.

Best regards,
Serge

> ---
>  core/pci.c | 44 ++++++++++++++++----------------------------
>  1 file changed, 16 insertions(+), 28 deletions(-)
> 
> diff --git a/core/pci.c b/core/pci.c
> index 454b50102e59..d5537b6a376b 100644
> --- a/core/pci.c
> +++ b/core/pci.c
> @@ -743,9 +743,10 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  		     bool scan_downstream)
>  {
>  	struct pci_device *pd = NULL, *rc = NULL;
> -	uint8_t dev, fn, next_bus, max_sub, save_max;
> +	uint8_t dev, fn, next_bus, max_sub;
>  	uint32_t scan_map;
>  	bool use_max;
> +	int bridges = 0, buses_per_bridge;
>  
>  	/* Decide what to scan  */
>  	scan_map = parent ? parent->scan_map : phb->scan_map;
> @@ -810,7 +811,18 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  
>  	next_bus = bus + 1;
>  	max_sub = bus;
> -	save_max = max_bus;
> +
> +	list_for_each(list, pd, link)
> +		if (pd->is_bridge)
> +			bridges++;
> +
> +	buses_per_bridge = max_bus - next_bus - 1;
> +	if (bridges)
> +		buses_per_bridge /= bridges;
> +
> +	PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n",
> +		bridges, next_bus, max_bus,  scan_downstream ? "" : "not ",
> +		buses_per_bridge);
>  
>  	/* Scan down bridges */
>  	list_for_each(list, pd, link) {
> @@ -819,32 +831,8 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  		if (!pd->is_bridge)
>  			continue;
>  
> -		/* We need to figure out a new bus number to start from.
> -		 *
> -		 * This can be tricky due to our HW constraints which differ
> -		 * from bridge to bridge so we are going to let the phb
> -		 * driver decide what to do. This can return us a maximum
> -		 * bus number to assign as well
> -		 *
> -		 * This function will:
> -		 *
> -		 *  - Return the bus number to use as secondary for the
> -		 *    bridge or 0 for a failure
> -		 *
> -		 *  - "max_bus" will be adjusted to represent the max
> -		 *    subordinate that can be associated with the downstream
> -		 *    device
> -		 *
> -		 *  - "use_max" will be set to true if the returned max_bus
> -		 *    *must* be used as the subordinate bus number of that
> -		 *    bridge (when we need to give aligned powers of two's
> -		 *    on P7IOC). If is is set to false, we just adjust the
> -		 *    subordinate bus number based on what we probed.
> -		 *
> -		 */
> -		max_bus = save_max;
> -		next_bus = phb->ops->choose_bus(phb, pd, next_bus,
> -						&max_bus, &use_max);
> +		use_max = 1;
> +		max_bus = next_bus + buses_per_bridge;
>  
>  		/* Configure the bridge with the returned values */
>  		if (next_bus <= bus) {
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/skiboot/attachments/20190313/3455d7a6/attachment-0001.sig>


More information about the Skiboot mailing list