[RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

Fri Jul 7 17:45:15 AEST 2023

On 2023-07-06 19:19, Shrikanth Hegde wrote:
> On 5/15/23 5:16 PM, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>> 
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>> 
>> This introduces a performance degredation in the following scenario:
>> 
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> 
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>> 
>> The first 5 tasks are distributed evenly across the two groups.
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>>          x x x          x x
>> 
>> Adding a 6th task yields the following distribution:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> SMT1     x x x          x x
>> SMT2                    x
>> 
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>> 
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>> 
>> Signed-off-by: Tobias Huschle <huschle at linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>  	 * group's child domain.
>>  	 */
>>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
>> +	    busiest->sum_nr_running * local->group_weight >
>> +			local->sum_nr_running * busiest->group_weight + 1)
>>  		goto force_balance;
>> 
> 
> 
> I assume its SMT2 here. sched group is enclosed in 
> [busy_cpus/idle_cpus/weight]
> 
> Lets take an example:  we will begin the with the said imbalance.
> [3/9/12] -- local group
> [3/1/4] -- busy group.
> we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to 
> balance.
> but calculate_imbalance will lead to zero as the imbalance no? in case
> of prefer sibling
> case it find the difference of sum_nr_running of the two groups. It
> will be 3-3 = 0
> 
> this would need modifications to calculate_imbalance.
> Maybe something like this will help? NOT TESTED AT ALL.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e027d4086edc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         return;
>                 }
> 
> -               if (busiest->group_weight == 1 || sds->prefer_sibling) 
> {
> +               if (busiest->group_weight == 1 ||
> +                       (sds->prefer_sibling &&
> +                        busiest->group_weight == local->group_weight)) 
> {
>                         unsigned int nr_diff = busiest->sum_nr_running;
>                         /*
>                          * When prefer sibling, evenly spread running 
> tasks on
> @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff;
> +               }
> +               if (sds->prefer_sibling &&
> +                   busiest->group_weight != local->group_weight) {
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = 1;
>                 } else {
> 

I also had a look at this when Vincent pointed out that this part is 
missing.
The formula proposed by Vincent works pretty well, it is not even 
necessary
to add additional if-statements here.
> 
> ---------------------------------------------------------------------------------------------------
> On a tangential dimension.
> 
> 
> Since prefer_siblings make inherent assumption that all groups have
> equal weight,
> it will cause  complications when group_weights are different. I think
> that becomes very
> tricky when there are more than two groups. Depending on which cpu is 
> doing
> load_balance there can be unnecessary migrations.
> 
> 
> Currently even in NUMA this can be similar case right? There will be
> unequal number of CPU's right?
> If we fix that case and remove prefer siblings in your arch, will that 
> work?
> 
> Maybe something like this? NOT TESTED AT ALL.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..fc6377f48ced 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10514,6 +10514,7 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                         goto out_balanced;
> 
>                 if (busiest->group_weight > 1 &&
> +                   busiest->group_weight == local->group_weight &&
>                     local->idle_cpus <= (busiest->idle_cpus + 1))
>                         /*
>                          * If the busiest group is not overloaded
> @@ -10526,6 +10527,11 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                          */
>                         goto out_balanced;
> 
> +               if ((busiest->group_weight != local->group_weight) &&
> +                     (local->idle_cpus * busiest->group_weight <=
> +                              local->group_weight * 
> (busiest->idle_cpus + 1)))
> +                       goto out_balanced;
> +
>                 if (busiest->sum_h_nr_running == 1)
>                         /*
>                          * busiest doesn't have any tasks waiting to 
> run
> 
> 
> 
> 
> 
>>  	if (busiest->group_type != group_overloaded) {

I played around with alternate solutions as well, yours could be 
interesting in order
to declare the problematic state as balanced essentially. I wouldn't be 
opposed to
remove prefer_siblings. The question would be if it is necessary to 
address both
scenarios anyway.