[RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked

Tue Feb 18 16:58:38 AEDT 2025

On 2/17/25 17:02, Tobias Huschle wrote:
> Changes to v1
> 
> parked vs idle
> - parked CPUs are now never considered to be idle
> - a scheduler group is now considered parked iff there are parked CPUs
>    and there are no idle CPUs, i.e. all non parked CPUs are busy or there
>    are only parked CPUs. A scheduler group with parked tasks can be
>    considered to not be parked, if it has idle CPUs which can pick up
>    the parked tasks.
> - idle_cpu_without always returns that the CPU will not be idle if the
>    CPU is parked
> 
> active balance, no_hz, queuing
> - should_we_balance always returns true if a scheduler groups contains
>    a parked CPU and that CPU has a running task
> - stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
>    if a task is running
> - tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond
> 
> cleanup
> - removed duplicate checks for parked CPUs
> 
> CPU capacity
> - added a patch which removes parked cpus and their capacity from
>    scheduler statistics
> 
> 
> Original description:
> 
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
> 
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
> 
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:
> 
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
> 
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
> 
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
> 
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
> 
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
>      - vertical high   : On these CPUs, the guest has always highest
>                          priority over other guests. This means
>                          especially that if the guest executes tasks on
>                          these CPUs, it will encounter no steal time.
>      - vertical medium : These CPUs are meant to cover fractions of
>                          entitlement.
>      - vertical low    : These CPUs will have no priority when being
>                          scheduled. This implies especially, that while
>                          all other guests are using their full
>                          entitlement, these CPUs might not be ran for a
>                          significant amount of time.
> 
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
> 
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
> 
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.
> 
> There are a couple of issues and corner cases which need further
> considerations:
> - rt & dl:      Realtime and deadline scheduling require some additional
>                  attention.

I think we need to address atleast rt, there would be some non percpu 
kworker threads which need to move out of parked cpus.

> - ext:          Probably affected as well. Needs some conceptional
>                  thoughts first.
> - raciness:     Right now, there are no synchronization efforts. It needs
>                  to be considered whether those might be necessary or if
>                  it is alright that the parked-state of a CPU might change
>                  during load-balancing.
> 
> Patches apply to tip:sched/core
> 
> The s390 patch serves as a simplified implementation example.

Gave it a try on powerpc with the debugfs file. it works for 
sched_normal tasks.

> 
> Tobias Huschle (3):
>    sched/fair: introduce new scheduler group type group_parked
>    sched/fair: adapt scheduler group weight and capacity for parked CPUs
>    s390/topology: Add initial implementation for selection of parked CPUs
> 
>   arch/s390/include/asm/smp.h    |   2 +
>   arch/s390/kernel/smp.c         |   5 ++
>   include/linux/sched/topology.h |  19 ++++++
>   kernel/sched/core.c            |  13 ++++-
>   kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
>   kernel/sched/syscalls.c        |   3 +
>   6 files changed, 130 insertions(+), 16 deletions(-)
>