[PATCH v2 1/2] sched/core: Option if steal should update CPU capacity

Srikar Dronamraju srikar at linux.ibm.com
Wed Oct 29 17:07:56 AEDT 2025


At present, scheduler scales CPU capacity for fair tasks based on time
spent on irq and steal time. If a CPU sees irq or steal time, its
capacity for fair tasks decreases causing tasks to migrate to other CPU
that are not affected by irq and steal time. All of this is gated by
scheduler feature NONTASK_CAPACITY.

In virtualized setups, a CPU that reports steal time (time taken by the
hypervisor) can cause tasks to migrate unnecessarily to sibling CPUs that
appear to be less busy, only for the situation to reverse shortly.

To mitigate this ping-pong behaviour, this change introduces a new
static branch sched_acct_steal_cap which will control whether steal time
contributes to non-task capacity adjustments (used for fair scheduling).

Signed-off-by: Srikar Dronamraju <srikar at linux.ibm.com>
---
Changelog v1->v2:
v1: https://lkml.kernel.org/r/20251028104255.1892485-1-srikar@linux.ibm.com
Peter suggested to use static branch instead of sched feat

 include/linux/sched/topology.h |  6 ++++++
 kernel/sched/core.c            | 15 +++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..88e34c60cffd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -285,4 +285,10 @@ static inline int task_node(const struct task_struct *p)
 	return cpu_to_node(task_cpu(p));
 }
 
+#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
+extern void sched_disable_steal_acct(void);
+#else
+static __always_inline void sched_disable_steal_acct(void) { }
+#endif
+
 #endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81c6df746df1..09884da6b085 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -738,6 +738,14 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 /*
  * RQ-clock updating methods:
  */
+#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
+static DEFINE_STATIC_KEY_TRUE(sched_acct_steal_cap);
+
+void sched_disable_steal_acct(void)
+{
+	return static_branch_disable(&sched_acct_steal_cap);
+}
+#endif
 
 static void update_rq_clock_task(struct rq *rq, s64 delta)
 {
@@ -792,8 +800,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	rq->clock_task += delta;
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
-		update_irq_load_avg(rq, irq_delta + steal);
+	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) {
+		if (steal && static_branch_likely(&sched_acct_steal_cap))
+			irq_delta += steal;
+		update_irq_load_avg(rq, irq_delta);
+	}
 #endif
 	update_rq_clock_pelt(rq, delta);
 }
-- 
2.47.3



More information about the Linuxppc-dev mailing list