[patch V2 16/20] sched/mmcid: Provide new scheduler CID mechanism
Shrikanth Hegde
sshegde at linux.ibm.com
Mon Oct 27 16:11:02 AEDT 2025
Hi Thomas,
On 10/22/25 6:25 PM, Thomas Gleixner wrote:
> The MM CID management has two fundamental requirements:
>
> 1) It has to guarantee that at no given point in time the same CID is
> used by concurrent tasks in userspace.
>
> 2) The CID space must not exceed the number of possible CPUs in a
> system. While most allocators (glibc, tcmalloc, jemalloc) do not
> care about that, there seems to be at least some LTTng library
> depending on it.
>
> The CID space compaction itself is not a functional correctness
> requirement, it is only a useful optimization mechanism to reduce the
> memory foot print in unused user space pools.
>
Just wondering, if there is no user space request for CID, this whole mechanism
should be under a static check to avoid any overhead?
(Trying to understand this change. Very interesting. ignore if this is not applicable)
> The optimal CID space is:
>
> min(nr_tasks, nr_cpus_allowed);
>
> Where @nr_tasks is the number of actual user space threads associated to
> the mm and @nr_cpus_allowed is the superset of all task affinities. It is
> growth only as it would be insane to take a racy snapshot of all task
> affinities when the affinity of one task changes just do redo it 2
> milliseconds later when the next task changes it's affinity.
>
> That means that as long as the number of tasks is lower or equal than the
> number of CPUs allowed, each task owns a CID. If the number of tasks
> exceeds the number of CPUs allowed it switches to per CPU mode, where the
> CPUs own the CIDs and the tasks borrow them as long as they are scheduled
> in.
>
> For transition periods CIDs can go beyond the optimal space as long as they
> don't go beyond the number of possible CPUs.
>
> The current upstream implementation tries to keep the CID with the task
> even in overcommit situations, which complicates task migration. It also
> has to do the CID space consolidation work from a task work in the exit to
> user space path. As that work is assigned to a random task related to a MM
> this can inflict unwanted exit latencies.
>
> Implement the context switch parts of a strict ownership mechanism to
> address this.
>
> This removes most of the work from the task which schedules out. Only
> during transitioning from per CPU to per task ownership it is required to
> drop the CID when leaving the CPU to prevent CID space exhaustion. Other
> than that scheduling out is just a single check and branch.
>
> The task which schedules in has to check whether:
>
> 1) The ownership mode changed
> 2) The CID is within the optimal CID space
>
> In stable situations this results in zero work. The only short disruption
> is when ownership mode changes or when the associated CID is not in the
> optimal CID space. The latter only happens when tasks exit and therefore
> the optimal CID space shrinks.
>
> That mechanism is strictly optimized for the common case where no change
> happens. The only case where it actually causes a temporary one time spike
> is on mode changes when and only when a lot of tasks related to a MM
> schedule exactly at the same time and have eventually to compete on
> allocating a CID from the bitmap.
>
> In the sysbench test case which triggered the spinlock contention in the
> initial CID code, __schedule() drops significantly in perf top on a 128
> Core (256 threads) machine when running sysbench with 255 threads, which
> fits into the task mode limit of 256 together with the parent thread:
>
> Upstream rseq/perf branch +CID rework
> 0.42% 0.37% 0.32% [k] __schedule
>
> Increasing the number of threads to 256, which puts the test process into
> per CPU mode looks about the same.
>
> Signed-off-by: Thomas Gleixner <tglx at linutronix.de>
> ---
[...]
> +static inline unsigned int __mm_get_cid(struct mm_struct *mm, unsigned int max_cids)
> +{
> + unsigned int cid = find_first_zero_bit(mm_cidmask(mm), max_cids);
> +
> + if (cid >= max_cids)
> + return MM_CID_UNSET;
> + if (test_and_set_bit(cid, mm_cidmask(mm)))
> + return MM_CID_UNSET;
> + return cid;
> +}
> +
> +static inline unsigned int mm_get_cid(struct mm_struct *mm)
> +{
> + unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
> +
> + for (; cid == MM_CID_UNSET; cpu_relax())
This triggers an compile error on ppc64le.
In file included from ./include/vdso/processor.h:10,
from ./arch/powerpc/include/asm/processor.h:9,
from ./include/linux/sched.h:13,
from ./include/linux/sched/affinity.h:1,
from kernel/sched/sched.h:8,
from kernel/sched/rq-offsets.c:5:
kernel/sched/sched.h: In function ‘mm_get_cid’:
./arch/powerpc/include/asm/vdso/processor.h:26:9: error: expected expression before ‘asm’
26 | asm volatile(ASM_FTR_IFCLR( \
| ^~~
kernel/sched/sched.h:3615:37: note: in expansion of macro ‘cpu_relax’
3615 | for (; cid == MM_CID_UNSET; cpu_relax())
+linux-ppc dev.
Moving it into a while seems to make compiling happy.
Need to see why, but thought of informing fist.
kernel/sched/sched.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50bc0eb08a66..8a6ad4fc9534 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3612,8 +3612,11 @@ static inline unsigned int mm_get_cid(struct mm_struct *mm)
{
unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
- for (; cid == MM_CID_UNSET; cpu_relax())
- cid = __mm_get_cid(mm, num_possible_cpus());
+ while(1) {
+ if (cid == MM_CID_UNSET)
+ break;
+ cpu_relax();
+ }
return cid;
}
> + cid = __mm_get_cid(mm, num_possible_cpus());
> +
> + return cid;
> +}
> +
More information about the Linuxppc-dev
mailing list