[RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
Shrikanth Hegde
sshegde at linux.ibm.com
Tue Feb 18 16:58:38 AEDT 2025
On 2/17/25 17:02, Tobias Huschle wrote:
> Changes to v1
>
> parked vs idle
> - parked CPUs are now never considered to be idle
> - a scheduler group is now considered parked iff there are parked CPUs
> and there are no idle CPUs, i.e. all non parked CPUs are busy or there
> are only parked CPUs. A scheduler group with parked tasks can be
> considered to not be parked, if it has idle CPUs which can pick up
> the parked tasks.
> - idle_cpu_without always returns that the CPU will not be idle if the
> CPU is parked
>
> active balance, no_hz, queuing
> - should_we_balance always returns true if a scheduler groups contains
> a parked CPU and that CPU has a running task
> - stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
> if a task is running
> - tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond
>
> cleanup
> - removed duplicate checks for parked CPUs
>
> CPU capacity
> - added a patch which removes parked cpus and their capacity from
> scheduler statistics
>
>
> Original description:
>
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
>
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
>
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:
>
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
>
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
>
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
>
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
>
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
> - vertical high : On these CPUs, the guest has always highest
> priority over other guests. This means
> especially that if the guest executes tasks on
> these CPUs, it will encounter no steal time.
> - vertical medium : These CPUs are meant to cover fractions of
> entitlement.
> - vertical low : These CPUs will have no priority when being
> scheduled. This implies especially, that while
> all other guests are using their full
> entitlement, these CPUs might not be ran for a
> significant amount of time.
>
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
>
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
>
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.
>
> There are a couple of issues and corner cases which need further
> considerations:
> - rt & dl: Realtime and deadline scheduling require some additional
> attention.
I think we need to address atleast rt, there would be some non percpu
kworker threads which need to move out of parked cpus.
> - ext: Probably affected as well. Needs some conceptional
> thoughts first.
> - raciness: Right now, there are no synchronization efforts. It needs
> to be considered whether those might be necessary or if
> it is alright that the parked-state of a CPU might change
> during load-balancing.
>
> Patches apply to tip:sched/core
>
> The s390 patch serves as a simplified implementation example.
Gave it a try on powerpc with the debugfs file. it works for
sched_normal tasks.
>
> Tobias Huschle (3):
> sched/fair: introduce new scheduler group type group_parked
> sched/fair: adapt scheduler group weight and capacity for parked CPUs
> s390/topology: Add initial implementation for selection of parked CPUs
>
> arch/s390/include/asm/smp.h | 2 +
> arch/s390/kernel/smp.c | 5 ++
> include/linux/sched/topology.h | 19 ++++++
> kernel/sched/core.c | 13 ++++-
> kernel/sched/fair.c | 104 ++++++++++++++++++++++++++++-----
> kernel/sched/syscalls.c | 3 +
> 6 files changed, 130 insertions(+), 16 deletions(-)
>
More information about the Linuxppc-dev
mailing list