[Lguest] pae bug

Fri Mar 27 10:46:53 EST 2009

On Friday 27 March 2009 05:23:43 Matias Zabaljauregui wrote:
> hello everybody, 
> 
> due to my lack of kernel debugging skills I'm having a hard time trying to find a bug in my PAE code.
> I don't want to bother you with code, but maybe you can give me some hints on how to debug this.
> 
> Depending on 
> 
>   a) the size of the 	struct pgdir pgdirs[4]    array   ( if I use 16 slots, for example, my guest will work for some time)
>   b) the number of processes running on the guest (I don't have any problems with very simple guests, like initrd guests)
> 
> my PAE guests eventually die like this:
> 
> 
> [   79.257627] BUG: unable to handle kernel NULL pointer dereference at 0000000c
> [   79.257627] IP: [<c01021ea>] __switch_to+0xe/0x16c
> [   79.257627] *pdpt = 0000000005a9f001 *pde = 0000000000000000
> [   79.257627] Oops: 0000 [#1]
> [   79.257627] last sysfs file: /sys/kernel/uevent_seqnum
> [   79.257627] Modules linked in:
> [   79.257627]
> [   79.257627] Pid: 806, comm: find Not tainted (2.6.29-rc8 #27)
> [   79.257627] EIP: 0061:[<c01021ea>] EFLAGS: 00000092 CPU: 0
> [   79.257627] EIP is at __switch_to+0xe/0x16c
> [   79.257627] EAX: 00000000 EBX: c59d9660 ECX: 00000004 EDX: c59d9660
> [   79.257627] ESI: c5a53e00 EDI: c5aca200 EBP: c59d9000 ESP: c5b35edc
> [   79.257627]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0069
> [   79.257627] Process find (pid: 806, ti=c5b34000 task=c59d9000 task.ti=c5b50000)
> [   79.257627] Stack:
> [   79.257627]  00000000 00000001 c59d9660 c59d9000 c5aca200 c59d9000 c010194d 00000004
> [   79.257627]  c5aa0040 c59d9660 c5aca200 c59d9000 c02b5631 c59d9660 00000282 c59d9000
> [   79.257627]  c59d9660 c59d9740 c5b34000 c0118019 c5b35f70 00000003 c59d9658 c59d9660
> [   79.257627] Call Trace:
> [   79.257627]  [<c010194d>] lazy_hcall1+0x11/0xc8
> [   79.257627]  [<c02b5631>] schedule+0x1bd/0x2d0
> [   79.257627]  [<c0118019>] do_wait+0x105/0x35c
> [   79.257627]  [<c0118158>] do_wait+0x244/0x35c
> [   79.257627]  [<c011223c>] default_wake_function+0x0/0x8
> [   79.257627]  [<c01182c1>] sys_wait4+0x51/0xa0
> [   79.257627]  [<c0118323>] sys_waitpid+0x13/0x18
> [   79.257627]  [<c0103b7a>] syscall_call+0x7/0xb
> [   79.257627] Code: 00 6a 00 6a 00 8d 4c 24 10 31 d2 89 f0 e8 2f 31 01 00 83 c4 50 5b 5e 5f 5d c3 8d 76 00 55 57 56 53 83 ec 08 89 c6 89 d3 8b 40 04 <8b> 40 0c a8 01 74 3f a8 10 0f 85 e3 00 00 00 8b 86 2c 02 00 00
> [   79.257627] EIP: [<c01021ea>] __switch_to+0xe/0x16c SS:ESP 0069:c5b35edc
> [   79.257627] ---[ end trace 0261563366a297b4 ]---
> 
> 
> So, if I get it right, during __switch_to(), the guest kernel accesses a guest virtual address 0000000c (i don't even know how this can happen!)
> and this seems to happen after the guest issues a wait() system call. I guess the lazy_hcall1 corresponds to lguest_write_cr3().
> 
> 
> any ideas, or techniques to further debug this, or any words of inspiration will be very helpful 

Yep!  There's a bug.

I tracked it down yesterday, and it should help quite a lot!
Rusty.

lguest: wire up pte_update/pte_update_defer

Impact: intermittant guest segv/crash fix

I've been seeing random guest bad address crashes and segmentation faults:
bisect led to 4f98a2fee8 (vmscan: split LRU lists into anon & file sets),
but that's a red herring.

It turns out that lguest never hooked up the pte_update/pte_update_defer
calls, so our ptes were not always in sync.  After the vmscan commit, the
bug became reproducible; now a fsck in a 64MB guest causes reproducible
pagetable corruption.

Signed-off-by: Rusty Russell <rusty at rustcorp.com.au>
Cc: jeremy at xensource.com
Cc: virtualization at lists.osdl.org
Cc: stable at kernel.org

diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index 65f0b8a..c3bdf0b 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -475,11 +480,17 @@ static void lguest_write_cr4(unsigned long val)
  * into a process' address space.  We set the entry then tell the Host the
  * toplevel and address this corresponds to.  The Guest uses one pagetable per
  * process, so we need to tell the Host which one we're changing (mm->pgd). */
+static void lguest_pte_update(struct mm_struct *mm, unsigned long addr,
+			       pte_t *ptep)
+{
+	lazy_hcall(LHCALL_SET_PTE, __pa(mm->pgd), addr, ptep->pte_low);
+}
+
 static void lguest_set_pte_at(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pteval)
 {
 	*ptep = pteval;
-	lazy_hcall(LHCALL_SET_PTE, __pa(mm->pgd), addr, pteval.pte_low);
+	lguest_pte_update(mm, addr, ptep);
 }
 
 /* The Guest calls this to set a top-level entry.  Again, we set the entry then
@@ -1018,6 +1046,8 @@ __init void lguest_init(void)
 	pv_mmu_ops.read_cr3 = lguest_read_cr3;
 	pv_mmu_ops.lazy_mode.enter = paravirt_enter_lazy_mmu;
 	pv_mmu_ops.lazy_mode.leave = lguest_leave_lazy_mode;
+	pv_mmu_ops.pte_update = lguest_pte_update;
+	pv_mmu_ops.pte_update_defer = lguest_pte_update;
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	/* apic read/write intercepts */