[v4.12-rc1 regression] mount ext4 fs results in kernel crash on PPC64le host

Thu Jun 29 14:54:47 AEST 2017

Hi Eryu,

Thanks for the bug report.

Eryu Guan <eguan at redhat.com> writes:
> Hi all,
>
> Li Wang and I are constantly seeing ppc64le hosts crashing due to bad

I'm curious why you're seeing this and not other folks. What compiler
are you using?

> page access. But it's not reproducing on every ppc64le host we've
> tested, but it usually happened in filesystem testings.
>
> [  207.403459] Unable to handle kernel paging request for unaligned access at address 0xc0000001c52c5e7f
                                                            ^^^^^^^^^                                    ^

> [  207.403470] Faulting instruction address: 0xc0000000004d470c

Which is:

ldarx   r10,0,r5

r5 = c0000001c52c5e7f 

So that makes sense, if you ldarx an unaligned address you get an
alignment fault.

> [  207.403475] Oops: Kernel access of bad area, sig: 7 [#1]
> [  207.403477] SMP NR_CPUS=2048
> [  207.403478] NUMA
> [  207.403480] pSeries
> [  207.403483] Modules linked in: ext4 jbd2 mbcache sg pseries_rng ghash_generic gf128mul xts vmx_crypto nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp
> [  207.403503] CPU: 0 PID: 2263 Comm: mount Not tainted 4.12.0-rc7 #26
> [  207.403506] task: c0000003ef2fde00 task.stack: c0000003de394000
> [  207.403509] NIP: c0000000004d470c LR: c00000000011cd24 CTR: c000000000130de0
> [  207.403512] REGS: c0000003de397450 TRAP: 0600   Not tainted  (4.12.0-rc7)
> [  207.403515] MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>
> [  207.403521]   CR: 28028844  XER: 00000001
> [  207.403525] CFAR: c00000000011cd20 DAR: c0000001c52c5e7f DSISR: 00000000 SOFTE: 0
> [  207.403525] GPR00: c00000000011cce8 c0000003de3976d0 c000000001049500 c0000003f2c6ec20
> [  207.403525] GPR04: c0000003f2c6ec20 c0000001c52c5e7f 0000000000000000 0000000000000001
> [  207.403525] GPR08: 000c5543cab19830 0000000198e19900 0000000000000008 0000000000000000
> [  207.403525] GPR12: c000000000130de0 c00000000fac0000 0000000000000000 c0000003f1328000
> [  207.403525] GPR16: 0000000000000000 c0000003de700400 0000000000000000 c0000003de700594
> [  207.403525] GPR20: 0000000000000002 0000000000000000 0000000000004000 c000000000cc5780
> [  207.403525] GPR24: 00000001c45ffc5f 0000000000000000 00000001c45ffc5f c00000000107dd00
> [  207.403525] GPR28: c0000003f2c6f434 0000000000000004 0000000000000800 c0000003f2c6ec00
> [  207.403567] NIP [c0000000004d470c] llist_add_batch+0xc/0x40

bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
		     struct llist_head *head)
{
	struct llist_node *first;

	do {
		new_last->next = first = ACCESS_ONCE(head->first);
	} while (cmpxchg(&head->first, first, new_first) != first);

So it's the cmpxchg().

__cmpxchg_u64(volatile unsigned long *p, unsigned long old, unsigned long new)
{
	unsigned long prev;

	__asm__ __volatile__ (
	PPC_ATOMIC_ENTRY_BARRIER
"1:	ldarx	%0,0,%2		# __cmpxchg_u64\n\

> [  207.403571] LR [c00000000011cd24] try_to_wake_up+0x4a4/0x5b0

try_to_wake_up(p, ..)
  -> ttwu_queue(p, cpu, wake_flags);
     -> ttwu_queue_remote(p, cpu, wake_flags);

static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
{
	struct rq *rq = cpu_rq(cpu);

	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {

static inline bool llist_add(struct llist_node *new, struct llist_head *head)
{
	return llist_add_batch(new, new, head);
}

So the cmpxchg is:

        cmpxchg(&head->first, first, new_first) != first)

Where head is &cpu_rq(cpu)->wake_list.

cpu came from try_to_wake_up() which did:

	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

So possibly the cpu value is bogus. Or ..

#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))

That runqueues variable has become corrupted?

You might be able to work it out from the register dump, and the full
disassembly of the kernel. Or you could add some printks() in there and
reproduce it.

cheers

> [  207.403573] Call Trace:
> [  207.403576] [c0000003de3976d0] [c00000000011cce8] try_to_wake_up+0x468/0x5b0 (unreliable)
> [  207.403581] [c0000003de397750] [c000000000102cc8] create_worker+0x148/0x250
> [  207.403585] [c0000003de3977f0] [c000000000105e7c] alloc_unbound_pwq+0x3bc/0x4c0
> [  207.403589] [c0000003de397850] [c0000000001064bc] apply_wqattrs_prepare+0x2ac/0x320
> [  207.403593] [c0000003de3978c0] [c00000000010656c] apply_workqueue_attrs_locked+0x3c/0xa0
> [  207.403597] [c0000003de3978f0] [c000000000106acc] apply_workqueue_attrs+0x4c/0x80
> [  207.403601] [c0000003de397930] [c00000000010866c] __alloc_workqueue_key+0x16c/0x4e0
> [  207.403615] [c0000003de3979f0] [d000000013de5ce0] ext4_fill_super+0x1c70/0x3390 [ext4]
> [  207.403620] [c0000003de397b30] [c00000000031739c] mount_bdev+0x21c/0x250
> [  207.403633] [c0000003de397bd0] [d000000013dddb80] ext4_mount+0x20/0x40 [ext4]
> [  207.403637] [c0000003de397bf0] [c000000000318944] mount_fs+0x74/0x210
> [  207.403641] [c0000003de397ca0] [c000000000340638] vfs_kern_mount+0x68/0x1d0
> [  207.403644] [c0000003de397d10] [c000000000345348] do_mount+0x278/0xef0
> [  207.403648] [c0000003de397de0] [c0000000003463e4] SyS_mount+0x94/0x100
> [  207.403652] [c0000003de397e30] [c00000000000af84] system_call+0x38/0xe0
> [  207.403655] Instruction dump:
> [  207.403658] 60420000 38600000 4e800020 60000000 60420000 7c832378 4e800020 60000000
> [  207.403663] 60000000 e9250000 f9240000 7c0004ac <7d4028a8> 7c2a4800 40c20010 7c6029ad
> [  207.403669] ---[ end trace 4fa94bf890f28f69 ]---
>
> Today I've finally found a host that could reliably trigger the crash by
> mounting an ext4 filesystem and I've done a git bisect. The first bad
> pointed to this commit:
>
> commit 9c355917fcf006af47ffaa5ae43a1a804764a6f6
> Author: Balbir Singh <bsingharora at gmail.com>
> Date:   Wed Apr 12 16:35:19 2017 +1000
>
>     powerpc/tracing: Allow tracing of mmap syscalls
>     
>     Currently sys_mmap() and sys_mmap2() (32-bit only), are not visible to the
>     syscall tracing machinery. This means users are not able to see the execution of
>     mmap() syscalls using the syscall tracer.
>     
>     Fix that by using SYSCALL_DEFINE6 for sys_mmap() and sys_mmap2() so that the
>     meta-data associated with these syscalls is visible to the syscall tracer.
>     
>     A side-effect of this change is that the return type has changed from unsigned
>     long to long. However this should have no effect, the only code in the kernel
>     which uses the result of these syscalls is in the syscall return path, which is
>     written in asm and treats the result as unsigned regardless.
>     
>     Example output:
>       cat-3399  [001] ....   196.542410: sys_mmap(addr: 7fff922a0000, len: 20000, prot: 3, flags: 812, fd: 3, offset: 1b0000)
>       cat-3399  [001] ....   196.542443: sys_mmap -> 0x7fff922a0000
>       cat-3399  [001] ....   196.542668: sys_munmap(addr: 7fff922c0000, len: 6d2c)
>       cat-3399  [001] ....   196.542677: sys_munmap -> 0x0
>     
>     Signed-off-by: Balbir Singh <bsingharora at gmail.com>
>     [mpe: Massage change log, add detail on return type change]
>     Signed-off-by: Michael Ellerman <mpe at ellerman.id.au>
>
> And I've confirmed that reverting above commit 'resolves' the crash. I
> appended memory and cpu information of the host to the end of this
> email, if you need more detailed information please let me know.
>
> Thanks,
> Eryu
>
> [root at ibm-p8-03-lp6 ~]# free
>               total        used        free      shared  buff/cache   available
> Mem:       18756864      399552    17880704       12672      476608    17470592
> Swap:       7864256           0     7864256
> [root at ibm-p8-03-lp6 ~]# lscpu
> Architecture:          ppc64le
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    8
> Core(s) per socket:    1
> Socket(s):             2
> NUMA node(s):          3
> Model:                 2.1 (pvr 004b 0201)
> Model name:            POWER8 (architected), altivec supported
> Hypervisor vendor:     (null)
> Virtualization type:   full
> L1d cache:             64K
> L1i cache:             32K
> NUMA node0 CPU(s):     0-7
> NUMA node2 CPU(s):     8-15
> NUMA node3 CPU(s):