[OOPS] hugetlbfs tests with 2.6.30-rc8-git1

Sat Jun 6 06:17:42 EST 2009

On Fri, 2009-06-05 at 16:59 +0530, Sachin Sant wrote:
> While executing Hugetlbfs tests against 2.6.30-rc8-git1 on a
> Power 6 box observed the following OOPS message.

> NIP [c000000000038240] .hpte_need_flush+0x1bc/0x2d8
> LR [c0000000000380f0] .hpte_need_flush+0x6c/0x2d8

Weird. I don't really see what happened there.

> Call Trace:
> [c0000000fa8ff710] [c000000000038264] .hpte_need_flush+0x1e0/0x2d8 (unreliable)
> [c0000000fa8ff7d0] [c000000000039fa4] .huge_ptep_get_and_clear+0x40/0x5c
> [c0000000fa8ff850] [c00000000012d46c] .__unmap_hugepage_range+0x178/0x2b8
> [c0000000fa8ff940] [c00000000012d600] .unmap_hugepage_range+0x54/0x88
> [c0000000fa8ff9e0] [c0000000001173a0] .unmap_vmas+0x178/0x8f4
> [c0000000fa8ffb30] [c00000000011cab8] .unmap_region+0xfc/0x1e4
> [c0000000fa8ffc00] [c00000000011e248] .do_munmap+0x2f4/0x38c
> [c0000000fa8ffcc0] [c0000000002f6d74] .SyS_shmdt+0xc0/0x188
> [c0000000fa8ffd70] [c00000000000c430] .sys_ipc+0x274/0x2fc
> [c0000000fa8ffe30] [c000000000008534] syscall_exit+0x0/0x40
> Instruction dump:
> 78090220 2fbd0000 409e0010 7929e0e4 7be00120 4800000c 792945c6 7be00600 
> 7d3f0378 7c1cb82e 3d360001 2f800000 <eb898000> 409e0028 7fe3fb78 7f24cb78 

The call trace looks rather ordinary. In fact, the DAR address doesn't
even look that bad, depends how much RAM you have in this partition I
suppose.

> I first noticed this with 2.6.30-rc7-git3 on a power6 machine,
> but could not recreate again on the same machine. Now the problem
> has resurfaced again with 2.6.30-rc8 (and with git1 as well) on
> another Power6 box.
> 
> I had seen similar failures(although the back trace was different,
> crash point was same) with older kernels and Mel submitted a patch
> to fix that issue. Here is the link to that patch.
> 
> http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-May/071395.html
> 
> I have attached the .config.

No, Mel's patch is for a different problem and has been fixed upstream
already. This is more concerning... I'm not sure what's up but would
you be able to send a disassembly of the hpte_need_flush() function in
your kernel binary for me to see what access precisely caused the
fault ?

Cheers,
Ben.