[RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

Fri Mar 15 22:12:45 AEDT 2019

On 04/03/2019 20:59, Laurent Vivier wrote:
> When we hotplug a CPU in a memoryless/cpuless node,
> the kernel crashes when it rebuilds the sched_domains data.
> 
> I reproduce this problem on POWER and with a pseries VM, with the following
> QEMU parameters:
> 
>   -machine pseries -enable-kvm -m 8192 \
>   -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
>   -numa node,nodeid=0,cpus=0-1,mem=0 \
>   -numa node,nodeid=1,cpus=2-3,mem=8192 \
>   -numa node,nodeid=2,cpus=4-5,mem=0 \
>   -numa node,nodeid=3,cpus=6-7,mem=0
> 
> Then I can trigger the crash by hotplugging a CPU on node-id 3:
> 
>   (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
> 
>     Built 2 zonelists, mobility grouping on.  Total pages: 130162
>     Policy zone: Normal
>     WARNING: workqueue cpumask: online intersect > possible intersect
>     BUG: Kernel NULL pointer dereference at 0x00000400
>     Faulting instruction address: 0xc000000000170edc
>     Oops: Kernel access of bad area, sig: 11 [#1]
>     LE SMP NR_CPUS=2048 NUMA pSeries
>     Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
>     CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
>     Workqueue: events cpuset_hotplug_workfn
>     NIP:  c000000000170edc LR: c000000000170f98 CTR: 0000000000000000
>     REGS: c000000003e931a0 TRAP: 0380   Not tainted  (5.0.0-rc6+)
>     MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22284028  XER: 00000000
>     CFAR: c000000000170f20 IRQMASK: 0
>     GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000
>     GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010
>     GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000
>     GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0
>     GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000
>     GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830
>     GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000
>     GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00
>     NIP [c000000000170edc] free_sched_groups+0x5c/0xf0
>     LR [c000000000170f98] destroy_sched_domain+0x28/0x90
>     Call Trace:
>     [c000000003e93430] [000000000000102f] 0x102f (unreliable)
>     [c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90
>     [c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920
>     [c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370
>     [c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400
>     [c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0
>     [c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50
>     [c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
>     [c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480
>     [c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540
>     [c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0
>     [c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80
>     Instruction dump:
>     2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074
>     ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4
>     ---[ end trace f992c4a7d47d602a ]---
> 
>     Kernel panic - not syncing: Fatal exception
> 
> This happens in free_sched_groups() because the linked list of the
> sched_groups is corrupted. Here what happens when we hotplug the CPU:
> 
>  - build_sched_groups() builds a sched_groups linked list for
>    sched_domain D1, with only one entry A, refcount=1
> 
>    D1: A(ref=1)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>    sched_domain D2, with the same entry A
> 
>    D2: A(ref=2)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>    sched_domain D3, with the same entry A and a new entry B:
> 
>    D3: A(ref=3) -> B(ref=1)
> 
>  - destroy_sched_domain() is called for D1:
> 
>    D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
>                                              but A->next always points to B
> 
>  - destroy_sched_domain() is called for D3:
> 
>    D3: A(ref=2) -> B(ref=0)
> 
> kernel crashes when it tries to use data inside B, as the memory has been
> corrupted as it has been freed, the linked list (next) is broken too.
> 
> This problem appears with commit 051f3ca02e46
> ("sched/topology: Introduce NUMA identity node sched domain").
> 
> If I compare function calls sequence before and after this commit I can see
> in the working case build_overlap_sched_groups() is called instead of
> build_sched_groups() and in this case the reference counters have all the
> same value and the linked list can be correctly unallocated.
> The involved commit has introduced the node domain, and in the case of
> powerpc the node domain can overlap, whereas it should not happen.
> 
> This happens because initially powerpc code computes
> sched_domains_numa_masks of offline nodes as if they were merged with
> node 0 (because firmware doesn't provide the distance information for
> memoryless/cpuless nodes):
> 
>   node   0   1   2   3
>     0:  10  40  10  10
>     1:  40  10  40  40
>     2:  10  40  10  10
>     3:  10  40  10  10
> 
> We should have:
> 
>   node   0   1   2   3
>     0:  10  40  40  40
>     1:  40  10  40  40
>     2:  40  40  10  40
>     3:  40  40  40  10
> 
> And once a new CPU is added, node is onlined, numa masks are updated
> but initial set bits are not cleared. This explains why nodes can overlap.
> 
> This patch changes the initial code to not initialize the distance for
> offline nodes. The distances will be updated when node will become online
> (on CPU hotplug) as it is already done.
> 
> This patch has been tested on powerpc but not on the other architectures.
> They are impacted because the modified part is in the common code.
> All comments are welcome (how to move the change to powerpc specific code
> or if the other architectures can work with this change).
> 
> Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
> Cc: Suravee Suthikulpanit <suravee.suthikulpanit at amd.com>
> Cc: Srikar Dronamraju <srikar at linux.vnet.ibm.com>
> Cc: Borislav Petkov <bp at suse.de>
> Cc: David Gibson <david at gibson.dropbear.id.au>
> Cc: Michael Ellerman <mpe at ellerman.id.au>
> Cc: Nathan Fontenot <nfont at linux.vnet.ibm.com>
> Cc: Michael Bringmann <mwb at linux.vnet.ibm.com>
> Cc: linuxppc-dev at lists.ozlabs.org
> Cc: Ingo Molnar <mingo at redhat.com>
> Cc: Peter Zijlstra <peterz at infradead.org>
> Signed-off-by: Laurent Vivier <lvivier at redhat.com>
> ---
> 
> Notes:
>     v3: fix the root cause of the problem (sched numa mask initialization)
>     v2: add scheduler maintainers in the CC: list
> 
>  kernel/sched/topology.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3f35ba1d8fde..24831b86533b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1622,8 +1622,10 @@ void sched_init_numa(void)
>  				return;
>  
>  			sched_domains_numa_masks[i][j] = mask;
> +			if (!node_state(j, N_ONLINE))
> +				continue;
>  
> -			for_each_node(k) {
> +			for_each_online_node(k) {
>  				if (node_distance(j, k) > sched_domains_numa_distance[i])
>  					continue;
>  
> 

Another way to avoid the nodes overlapping for the offline nodes at
startup is to ensure the default values don't define a distance that
merge all offline nodes into node 0.

A powerpc specific patch can workaround the kernel crash by doing this:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87f0dd0..3ba29bb 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
        struct device_node *memory;
        int default_nid = 0;
        unsigned long i;
+       int nid, dist;

        if (numa_enabled == 0) {
                printk(KERN_WARNING "NUMA disabled by user\n");
@@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)

        dbg("NUMA associativity depth for CPU/Memory: %d\n",
min_common_depth);

+       for (nid = 0; nid < MAX_NUMNODES; nid ++)
+               for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
+                       distance_lookup_table[nid][dist] = nid;
+
        /*
         * Even though we connect cpus to numa domains later in SMP
         * init, we need to know the node ids now. This is because

Any comment?

If this is not the good way to do, does someone have a better idea how
to fix the kernel crash?

Thanks,
Laurent