[RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

Wed Jul 5 17:52:28 AEST 2023

Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle at linux.ibm.com>
> > wrote:
> > > 
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > > 
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > > 
> > > This introduces a performance degredation in the following scenario:
> > > 
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > > 
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > > 
> > > The first 5 tasks are distributed evenly across the two groups.
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > >          x x x          x x
> > > 
> > > Adding a 6th task yields the following distribution:
> > > 
> > > Socket   -----1-----    -2-
> > > CPU      1 2 3 4 5 6    7 8
> > > SMT1     x x x          x x
> > > SMT2                    x
> > 
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> > 
> > Could you share with us your scheduler topology ?
> > 
> 
> You are correct, it should say core instead of CPU.
> 
> One actual configuration from one of my test machines (lscpu -e):
> 

[...]

> 
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

> 
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>                 x     x     x     x      x  x  x  x
> 
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

> 
> > > 
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > > 
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > > 
> > > Signed-off-by: Tobias Huschle <huschle at linux.ibm.com>
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > >          * group's child domain.
> > >          */
> > >         if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
> > > +           busiest->sum_nr_running * local->group_weight >
> > > +                       local->sum_nr_running *
> > > busiest->group_weight + 1)


what you want to test here is that moving 1 task from busiest to local group
would help and balance the ratio of tasks per cpu

(busiest->sum_nr_running - 1) / busiest->group_weight > (local->sum_nr_running + 1) / local->group_weight

which can be develop into

busiest->sum_nr_running * local->group_weight >= local->sum_nr_running * busiest->group_weight + busiest->group_weight + local->group_weight

and you also have to change how we calculate the imbalance which just provide the half of the diff of nr_running

by something like

(busiest->sum_nr_running * local->group_weight) - (local->sum_nr_running * busiest->group_weight) / (busiest->group_weight + local->group_weight)

> > 
> > This is the prefer_sibling path. Could it be that you should disable
> > prefer_siling between your sockets for such topology ? the default
> > path compares the number of idle CPUs when groups has spare capacity
> > 
> > 
> 
> If I disable prefer_sibling (I played around with it a bit), I run into the
> problem,
> that the tasks are distributed s.t. each group has the same amount of idle
> CPUs, which
> yields distributions similar to this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>     x  x  x     x  x     x     x  x
> 
> Now both groups have 4 idle CPUs which fulfills the criteria imposed by the
> load balancer,
> but the larger group is now running SMT while the smaller one is just idle.
> 
> So, in this asymmetric setup, both criteria seem to not work in an optimal
> way. Going for
> the same number of idle CPUs or alternatively for the same number of running
> processes
> both cause a sub-optimal distribution of tasks, leading to unnecessary SMT.

there is the same behavior and assumption here too


> 
> It seems also to be possible to address the regular load balancing path by
> aiming to have the
> same unused capacity between groups instead of the same number of idle CPUs.
> This seems to
> have been considered in the past, but the choice went in favor of the number
> of idle CPUs.

unused capacity doesn't give the instantaneous state so a group can be idle but without
unused capacity

> Since this decision was actively taken already, I focused on the
> prefer_sibling path.
> 
> The question now would be how to address this properly (or if I'm missing
> something here).
> As mentioned in the cover letter, this was the most simplistic and least
> invasive approach
> I could find, others might be more sophisticated but also have some
> side-effects.
> 
> I have a bit of a hard time leaving this one as-is, as it just introduces
> two additional
> multiplications with no effect for most architectures. Maybe an
> architectures specific
> inline function that the compiler can optimize away if not needed?
> 
> > >                 goto force_balance;
> > > 
> > >         if (busiest->group_type != group_overloaded) {
> > > --
> > > 2.34.1
> > > 
>