[patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism

Shrikanth Hegde sshegde at linux.ibm.com
Mon Oct 27 16:11:02 AEDT 2025


Hi Thomas,

On 10/22/25 6:25 PM, Thomas Gleixner wrote:
> The MM CID management has two fundamental requirements:
> 
>    1) It has to guarantee that at no given point in time the same CID is
>       used by concurrent tasks in userspace.
> 
>    2) The CID space must not exceed the number of possible CPUs in a
>       system. While most allocators (glibc, tcmalloc, jemalloc) do not
>       care about that, there seems to be at least some LTTng library
>       depending on it.
> 
> The CID space compaction itself is not a functional correctness
> requirement, it is only a useful optimization mechanism to reduce the
> memory foot print in unused user space pools.
> 

Just wondering, if there is no user space request for CID, this whole mechanism
should be under a static check to avoid any overhead?

(Trying to understand this change. Very interesting. ignore if this is not applicable)

> The optimal CID space is:
> 
>      min(nr_tasks, nr_cpus_allowed);
> 
> Where @nr_tasks is the number of actual user space threads associated to
> the mm and @nr_cpus_allowed is the superset of all task affinities. It is
> growth only as it would be insane to take a racy snapshot of all task
> affinities when the affinity of one task changes just do redo it 2
> milliseconds later when the next task changes it's affinity.
> 
> That means that as long as the number of tasks is lower or equal than the
> number of CPUs allowed, each task owns a CID. If the number of tasks
> exceeds the number of CPUs allowed it switches to per CPU mode, where the
> CPUs own the CIDs and the tasks borrow them as long as they are scheduled
> in.
> 
> For transition periods CIDs can go beyond the optimal space as long as they
> don't go beyond the number of possible CPUs.
> 
> The current upstream implementation tries to keep the CID with the task
> even in overcommit situations, which complicates task migration. It also
> has to do the CID space consolidation work from a task work in the exit to
> user space path. As that work is assigned to a random task related to a MM
> this can inflict unwanted exit latencies.
> 
> Implement the context switch parts of a strict ownership mechanism to
> address this.
> 
> This removes most of the work from the task which schedules out. Only
> during transitioning from per CPU to per task ownership it is required to
> drop the CID when leaving the CPU to prevent CID space exhaustion. Other
> than that scheduling out is just a single check and branch.
> 
> The task which schedules in has to check whether:
> 
>      1) The ownership mode changed
>      2) The CID is within the optimal CID space
> 
> In stable situations this results in zero work. The only short disruption
> is when ownership mode changes or when the associated CID is not in the
> optimal CID space. The latter only happens when tasks exit and therefore
> the optimal CID space shrinks.
> 
> That mechanism is strictly optimized for the common case where no change
> happens. The only case where it actually causes a temporary one time spike
> is on mode changes when and only when a lot of tasks related to a MM
> schedule exactly at the same time and have eventually to compete on
> allocating a CID from the bitmap.
> 
> In the sysbench test case which triggered the spinlock contention in the
> initial CID code, __schedule() drops significantly in perf top on a 128
> Core (256 threads) machine when running sysbench with 255 threads, which
> fits into the task mode limit of 256 together with the parent thread:
> 
>    Upstream  rseq/perf branch  +CID rework
>    0.42%     0.37%             0.32%          [k] __schedule
> 
> Increasing the number of threads to 256, which puts the test process into
> per CPU mode looks about the same.
> 
> Signed-off-by: Thomas Gleixner <tglx at linutronix.de>
> ---

[...]   
> +static inline unsigned int __mm_get_cid(struct mm_struct *mm, unsigned int max_cids)
> +{
> +	unsigned int cid = find_first_zero_bit(mm_cidmask(mm), max_cids);
> +
> +	if (cid >= max_cids)
> +		return MM_CID_UNSET;
> +	if (test_and_set_bit(cid, mm_cidmask(mm)))
> +		return MM_CID_UNSET;
> +	return cid;
> +}
> +
> +static inline unsigned int mm_get_cid(struct mm_struct *mm)
> +{
> +	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
> +
> +	for (; cid == MM_CID_UNSET; cpu_relax())

This triggers an compile error on ppc64le.

In file included from ./include/vdso/processor.h:10,
                  from ./arch/powerpc/include/asm/processor.h:9,
                  from ./include/linux/sched.h:13,
                  from ./include/linux/sched/affinity.h:1,
                  from kernel/sched/sched.h:8,
                  from kernel/sched/rq-offsets.c:5:
kernel/sched/sched.h: In function ‘mm_get_cid’:
./arch/powerpc/include/asm/vdso/processor.h:26:9: error: expected expression before ‘asm’
    26 |         asm volatile(ASM_FTR_IFCLR(                                     \
       |         ^~~
kernel/sched/sched.h:3615:37: note: in expansion of macro ‘cpu_relax’
  3615 |         for (; cid == MM_CID_UNSET; cpu_relax())


+linux-ppc dev.


Moving it into a while seems to make compiling happy.
Need to see why, but thought of informing fist.

  kernel/sched/sched.h | 7 +++++--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50bc0eb08a66..8a6ad4fc9534 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3612,8 +3612,11 @@ static inline unsigned int mm_get_cid(struct mm_struct *mm)
  {
  	unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
  
-	for (; cid == MM_CID_UNSET; cpu_relax())
-		cid = __mm_get_cid(mm, num_possible_cpus());
+	while(1) {
+		if (cid == MM_CID_UNSET)
+			break;
+		cpu_relax();
+	}
  
  	return cid;
  }




> +		cid = __mm_get_cid(mm, num_possible_cpus());
> +
> +	return cid;
> +}
> +


More information about the Linuxppc-dev mailing list