Very unreliable booting (WARNING: CPU: 0 PID: 1 at kernel/context_tracking.c:215 ct_nmi_exit+0xa0/0xc0) with PPC_EARLY_DEBUG_G5 set on a PowerMac G5, kernel 6.7-rc6

Erhard Furtner erhard_f at mailbox.org
Thu Dec 21 09:20:06 AEDT 2023


Greetings!

I wanted to check whether there are any changes on issue https://lore.kernel.org/all/20231114003721.4a9bfd37@yea/T/ on kernel 6.7-rc. KCSAN enabled kernels still won't boot on this G5 it seems.

What I also found out is that booting with an PPC_EARLY_DEBUG_G5=y enabled kernel is very unreliable. It works, but only on 1 ot 10 attemps or so. On the failed attemps the G5 freezes on the white screen only showing 3 lines that the kernel will be booted, no additional output.

On a successful boot with PPC_EARLY_DEBUG_G5=y the machine boots up fine, showing no side effects and is perfectly usable. In this case the dmesg throws up this:

[...]
PowerMac SMP probe found 2 cpus
KeyWest i2c @0xf8001003 irq 16 /u4 at 0,f8000000/i2c at f8001000
 channel 1 bus i2c-bus at 1
KeyWest i2c @0x80018000 irq 27 /ht at 0,f2000000/pci at 8/mac-io at 7/i2c at 18000
 channel 0 bus i2c-bus at 0
 channel 0 bus i2c-bus at 0
SMU i2c /smu at 0,0/smu-i2c-control at 0
 channel b bus /smu at 0,0/smu-i2c-control at 0/i2c-bus at b
 channel e bus /smu at 0,0/smu-i2c-control at 0/i2c-bus at e
Processor timebase sync using platform function
mpic: requesting IPIs...
PPC970/FX/MP performance monitor hardware support registered
rcu: Hierarchical SRCU implementation.
rcu: 	Max phase no-delay instances is 1000.
smp: Bringing up secondary CPUs ...
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/context_tracking.c:215 ct_nmi_exit+0xa0/0xc0
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc6-PMacG5 #4
Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac
NIP:  c000000000bbf320 LR: c0000000000ab040 CTR: 0000000000000000
REGS: c000000003103300 TRAP: 0700   Not tainted  (6.7.0-rc6-PMacG5)
MSR:  9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24040840  XER: 20000000
IRQMASK: 1 
GPR00: c0000000000ab068 c0000000031035a0 c000000000e43500 0000000000000000 
GPR04: c00000000fffbeb0 c00000000fffbeb8 0000000000000000 000000047ef66000 
GPR08: c00000047fe059b0 c000000000e9f9b0 c000000000e9f9c0 0000000000000000 
GPR12: c000000000149e10 c00000000110a000 c00000000000dc10 0000000000000000 
GPR16: 0000000000000000 0000000000000000 c00000000108b9c8 0000000000000008 
GPR20: 000000000000005d 0000000000000000 0000000000000000 0000000000000001 
GPR24: c0000000010d04c4 c00000000108b900 c0000000010d04c0 000000047ef66000 
GPR28: c000000003103650 c000000000e9a400 000000000000006e c000000003170000 
NIP [c000000000bbf320] ct_nmi_exit+0xa0/0xc0
LR [c0000000000ab040] irq_exit+0x90/0x110
Call Trace:
[c0000000031035a0] [c0000000000ab068] irq_exit+0xb8/0x110 (unreliable)
[c0000000031035d0] [c0000000000196f0] timer_interrupt+0x1b0/0x370
[c000000003103620] [c0000000000089f4] decrementer_common_virt+0x214/0x220
--- interrupt: 900 at real_readb+0x44/0x68
NIP:  c00000000001f24c LR: c000000000074258 CTR: c0000000000087c0
REGS: c000000003103650 TRAP: 0900   Not tainted  (6.7.0-rc6-PMacG5)
MSR:  9000000000009022 <SF,HV,EE,ME,IR,RI>  CR: 24040242  XER: 20000000
IRQMASK: 0 
GPR00: 9000000000009022 c0000000031038f0 c000000000e43500 0000000000000071 
GPR04: 0000000080013030 0000011000000040 0000001000000040 9000000000009032 
GPR08: 0000000000000000 c0000000010d3500 000000047ef66000 0000000052454753 
GPR12: 9000000000009032 c00000000110a000 c00000000000dc10 0000000000000000 
GPR16: 0000000000000000 0000000000000000 c00000000108b9c8 0000000000000008 
GPR20: 000000000000005d 0000000000000000 0000000000000000 0000000000000001 
GPR24: c0000000010d04c4 c00000000108b900 c0000000010d04c0 0000000000000001 
GPR28: 000000007db14ba6 0000000000000000 000000000000006e c0000000010d3b18 
NIP [c00000000001f24c] real_readb+0x44/0x68
LR [c000000000074258] udbg_real_scc_putc+0x38/0x80
--- interrupt: 900
[c0000000031038f0] [c000000002611294] 0xc000000002611294 (unreliable)
[c000000003103920] [c0000000000743c0] udbg_adb_putc+0x30/0x50
[c000000003103940] [c00000000001e7d4] udbg_puts+0x64/0xb0
[c000000003103970] [c000000000e46124] udbg_progress+0x18/0x30
[c000000003103990] [c000000000075a20] smp_core99_kick_cpu+0x140/0x180
[c000000003103a10] [c00000000003371c] __cpu_up+0x12c/0x3c0
[c000000003103ad0] [c0000000000a3068] bringup_cpu+0x68/0x200
[c000000003103b30] [c0000000000a1bd0] cpuhp_invoke_callback+0x170/0x300
[c000000003103b80] [c0000000000a4598] _cpu_up.constprop.0+0x308/0x730
[c000000003103c10] [c0000000000a4aec] cpu_up+0x12c/0x180
[c000000003103ca0] [c000000000e5371c] bringup_nonboot_cpus+0x7c/0xec
[c000000003103cf0] [c000000000e5a30c] smp_init+0x40/0xa4
[c000000003103d50] [c000000000e41300] kernel_init_freeable+0x188/0x358
[c000000003103de0] [c00000000000dc38] kernel_init+0x30/0x158
[c000000003103e50] [c00000000000bf94] ret_from_kernel_user_thread+0x14/0x1c
--- interrupt: 0 at 0x0
Code: 39200000 38600004 f9280010 4bfffd4d e92d0128 38210020 81290000 e8010010 7c0803a6 4e800020 60000000 60000000 <0fe00000> 4bffff9c 60000000 60000000 
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit.constprop.0+0xc0/0xd0
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.7.0-rc6-PMacG5 #4
Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac
NIP:  c000000000bbf260 LR: c000000000bbf594 CTR: c0000000001541a0
REGS: c00000000107fb70 TRAP: 0700   Tainted: G        W           (6.7.0-rc6-PMacG5)
MSR:  9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 48000284  XER: 00000000
IRQMASK: 1 
GPR00: c0000000000feb2c c00000000107fe10 c000000000e43500 0000000000000000 
GPR04: c00000047fe04f50 00000000ffffffff 0000000196e6aa55 c00000000110a000 
GPR08: c000000000e9f9b0 c000000000e9f9b0 ffffffffffffffe8 0000000000002200 
GPR12: c0000000001541a0 c00000000110a000 000000000014aa88 0000000000000000 
GPR16: 00000000ff9aac70 0000000002117cd8 0000000002118040 0000000002117da0 
GPR20: 00000000021182d0 00000000021a5b08 00000000021182f8 00000000024b3500 
GPR24: 00000000024b3500 c000000000d4bd78 c00000000108ba88 0000000000000001 
GPR28: c000000001062b00 c000000001062a80 c00000000108ba38 c00000047fe059b0 
NIP [c000000000bbf260] ct_kernel_exit.constprop.0+0xc0/0xd0
LR [c000000000bbf594] default_idle_call+0x44/0x74
Call Trace:
[c00000000107fe10] [c00000000107fe60] 0xc00000000107fe60 (unreliable)
[c00000000107fe40] [c00000000108ba38] 0xc00000000108ba38
[c00000000107fe60] [c0000000000feb2c] do_idle+0x10c/0x1d0
[c00000000107feb0] [c0000000000fee54] cpu_startup_entry+0x44/0x50
[c00000000107fee0] [c00000000000dc04] rest_init+0xe4/0xe8
[c00000000107ff10] [c000000000e4094c] arch_post_acpi_subsys_init+0x0/0x4
[c00000000107ff30] [c000000000e410e0] start_kernel+0x780/0x784
[c00000000107ffe0] [c00000000000cb18] start_here_common+0x1c/0x20
Code: 4b579829 60000000 fbdf0008 38600005 e8010040 38210030 ebc1fff0 ebe1fff8 7c0803a6 4bfffdf4 60000000 60000000 <0fe00000> 4bffff7c 60000000 60000000 
---[ end trace 0000000000000000 ]---
smp: Brought up 1 node, 2 CPUs


Looks like SMP or cpu bringup is messing around with early G5 debug console. Hopefully this output is of some help to get insight what the problem may be.

Without PPC_EARLY_DEBUG_G5=y working more reliable it seems impossible to get more data on why KCSAN would freeze the machine that early. Debugging via serial console or netconsole kicks in too late for getting any output in this case.

Kernel .config and full dmesg attached.

Regards,
Erhard
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config_67-rc6_g5
Type: application/octet-stream
Size: 102536 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20231220/decad7bf/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg_67-rc6_g5
Type: application/octet-stream
Size: 50127 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20231220/decad7bf/attachment-0003.obj>


More information about the Linuxppc-dev mailing list