[Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load

Wed Feb 26 20:29:19 AEDT 2020

https://bugzilla.kernel.org/show_bug.cgi?id=206669

--- Comment #3 from npiggin at gmail.com ---
bugzilla-daemon at bugzilla.kernel.org's on February 26, 2020 5:26 pm:
> https://bugzilla.kernel.org/show_bug.cgi?id=206669
> 
> --- Comment #2 from John Paul Adrian Glaubitz (glaubitz at physik.fu-berlin.de)
> ---
> (In reply to npiggin from comment #1)
>> Thanks for the report, we need to get more data about the first BUG if 
>> we can. What function in your vmlinux contains address 
>> 0xc00000000017a778? (use nm or objdump etc)
> 
> Seems to be t select_task_rq_fair:
> 
> root at watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5
> c00000000017a
> c000000000448550 T select_estimate_accuracy
> c000000000170d20 t select_fallback_rq
> c000000000e4c940 D select_idle_mask
> c000000000179f10 t select_idle_sibling
> c00000000018fd80 t select_task_rq_dl
> c00000000017a640 t select_task_rq_fair
> c000000000177f50 t select_task_rq_idle
> c00000000018c9e0 t select_task_rq_rt
> c00000000019c800 t select_task_rq_stop
> c000000000927710 t selem_alloc.isra.6
> c000000000926e50 t selem_link_map
> root at watson:/boot#
> 
>> Is that the first message you
>> get,
>> No warnings or anything else earlier in the dmesg?
> 
> Correct. You can see the login prompt of the host VM watson directly after
> booting up.
> 
>> Also 0xc0000000002659a0 would be interesting.
> 
> Looks like that's ring_buffer_record_off:
> 
> root at watson:/boot# nm vmlinux-5.4.0-0.bpo.3-powerpc64le |grep -C5
> c0000000002659
> c0000000002667e0 T ring_buffer_read_finish
> c00000000026b4b0 T ring_buffer_read_page
> c000000000265e10 T ring_buffer_read_prepare
> c000000000265ef0 T ring_buffer_read_prepare_sync
> c000000000269ae0 T ring_buffer_read_start
> c000000000265950 T ring_buffer_record_disable
> c000000000266070 T ring_buffer_record_disable_cpu
> c000000000265970 T ring_buffer_record_enable
> c0000000002660c0 T ring_buffer_record_enable_cpu
> c00000000026d470 T ring_buffer_record_is_on
> c00000000026d480 T ring_buffer_record_is_set_on
> c000000000265990 T ring_buffer_record_off
> c000000000265a10 T ring_buffer_record_on
> c000000000266da0 T ring_buffer_reset
> c000000000266a90 T ring_buffer_reset_cpu
> c000000000267cd0 T ring_buffer_resize
> c00000000026d400 T ring_buffer_set_clock
> root at watson:/boot#

Thanks.

Okay it looks like what's happening here is something crashes in
select_task_rq_fair (kernel data access fault). It's then able to
print out those first two lines but then it calls die(), which
ends up calling oops_enter() which calls tracing_off(), which calls
tracer_tracing_off and crashes there, which goes around the same
cycle only printing out the first two lines.

Nothing obvious as to why those accesses in particular are crashing.
The first data address is 0xc000000002bfd038, the second is
0xc0000007f9070c08. Not vmalloc space, not above the 1TB segment.

Do you have tracing / ftrace enabled in the host kernel for any
reason? Turning that off might let the oops message get printed.

> 
> FWIW, the kernel image comes from this Debian package:
> 
>>
>>
>> http://snapshot.debian.org/archive/debian/20200211T210433Z/pool/main/l/linux/linux-image-5.4.0-0.bpo.3-powerpc64le_5.4.13-1%7Ebpo10%2B1_ppc64el.deb

Okay. Any chance you could test an upstream kernel? 
> 
>> When reproducing, do you ever get a clean trace of the first bug?
> 
> I have logged everything that showed in the console during and after the
> crash.
> After that, the machine no longer responds and has to be hard-resetted.
> 
>> Could you try setting /proc/sys/kernel/panic_on_oops and reproducing?
> 
> I will try that.

Don't bother testing that after the above -- panic_on_oops happens
after oops_begin(), so it won't help unfortunately.

Attmepting to get into xmon might though, if you boot with xmon=on.
Try that if tracing wasn't enabled, or disabling it doesn't help.

> 
> Anything to be considered for the kernel running inside the big-endian VM?
> 

Not that I'm aware of really. Certainly it shouldn't be able to crash
the host even if the guest was doing something stupid.

Thanks,
Nick

-- 
You are receiving this mail because:
You are watching the assignee of the bug.