[Lguest] PAE and lguest

Rusty Russell rusty at rustcorp.com.au
Thu May 28 17:23:47 EST 2009


On Thu, 28 May 2009 02:18:33 am Matias Zabaljauregui wrote:
> Rusty, some updated news about pae patch:
> >    That's excellent news!  I'd be happy to apply it with the NX support
> > caveat (we can, if we have to, just disable it in the guest bootup code
> > somehow for the moment), but I too would like to know what the issue is
> > with sysenter.
>
> 1) mapping the switcher with  PAGE_KERNEL_EXEC  both in host and guest
>    solves the need for noexec=off in host kernel params

Oops, we should have been doing that anyway of course!

Good catch,

> 2) trapping and emulating rdmsr and wrmsr solved the rest of the problems I
> had. when kernel is compiled with PAE, set_nx()  reads and write EFER.NXE
> and this was generating a GPF not handled by lguest.

Right, or we can use paravirt ops in the guest to override them, which I'd 
prefer if we can.

> I'd like to know if you think this is an acceptable solution regarding NX
> for THIS version of PAE patch: I just clear NX feature from cpuid when
> guest invokes function 80000001, so guest won't even try to use NX bit.

Yep, that makes sense.  I've done it for other bits in the past.

> My main concern now (beside all the bugs you guys will find, so actually
> that's not a concern anymore :) ) is performance. Just one example :
>
> +void guest_set_pmd(struct lguest *lg, unsigned long pmdp, u32 idx)
> +{
> +	guest_pagetable_clear_all(&lg->cpus[0]);
> +}

Correctness first, performance second.  And it's usually best to have some 
benchmark to indicate whether the extra complexity is worth it.

> Finally, Rusty I'd like to know your opinion about Jeremy's suggestion
> in relation with hypercall names used:
> (This is not a PAE subject exclusively)
> http://ozlabs.org/pipermail/lguest/2009-May/001502.html

He raises a good point.  We should rename the existing hcall in a patch which 
precedes PAE support.

> --- a/arch/x86/lguest/boot.c
> +++ b/arch/x86/lguest/boot.c
> @@ -375,8 +375,12 @@ static void lguest_cpuid(unsigned int *ax, unsigned
> int *bx, case 1:	/* Basic feature request. */
>  		/* We only allow kernel to see SSE3, CMPXCHG16B and SSSE3 */
>  		*cx &= 0x00002201;
> -		/* SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, TSC, FPU. */
> +		/* SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, TSC, FPU, PAE. */
> +#ifdef CONFIG_X86_PAE
> +		*dx &= 0x07808151;
> +#else
>  		*dx &= 0x07808111;
> +#endif

Since PAE is a compile-time requirement for Linux, I think we can just 
advertise it all the time and avoid the #ifdef here.

> @@ -395,6 +399,13 @@ static void lguest_cpuid(unsigned int *ax, unsigned
> int *bx, if (*ax > 0x80000008)
>  			*ax = 0x80000008;
>  		break;
> +#ifdef CONFIG_X86_PAE
> +	case 0x80000001:
> +		/* Here we should fix nx cap depending on host. */
> +		/* For this version of PAE, we just clear NX bit. */
> +	       	*dx &= ~(1 << 20);
> +		break;
> +#endif

Probably don't need #ifdef here either.  Linux only looks for it when PAE is 
enabled, but that's not really a concern here.

>  static void lguest_set_pte_at(struct mm_struct *mm, unsigned long addr,
>  			      pte_t *ptep, pte_t pteval)
>  {
> -	*ptep = pteval;
> +	native_set_pte(ptep, pteval);
>  	lguest_pte_update(mm, addr, ptep);
>  }

Many of these are really nice cleanups: a separate patch for them would help 
simplify the transition as well.

> index a3d3cba..8f63845 100644
> --- a/drivers/lguest/Kconfig
> +++ b/drivers/lguest/Kconfig
> @@ -1,6 +1,6 @@
>  config LGUEST
>  	tristate "Linux hypervisor example code"
> -	depends on X86_32 && EXPERIMENTAL && !X86_PAE && FUTEX
> +	depends on X86_32 && EXPERIMENTAL && FUTEX

Hmm, I don't think we depend on futex any more.  But we do depend on
eventfd, so I've fixed that in one of my patches.

> @@ -171,7 +233,7 @@ static void release_pte(pte_t pte)
>  	/* Remember that get_user_pages_fast() took a reference to the page, in
>  	 * get_pfn()?  We have to put it back now. */
>  	if (pte_flags(pte) & _PAGE_PRESENT)
> -		put_page(pfn_to_page(pte_pfn(pte)));
> +		put_page(pte_page(pte));
>  }
>  /*:*/
>

Another nice cleanup.

>  /*H:450 If we chase down the release_pgd() code, it looks like this: */
> -static void release_pgd(struct lguest *lg, pgd_t *spgd)
> +static void release_pgd(pgd_t *spgd)
>  {

This one too!

> @@ -752,21 +1087,21 @@ static __init void
> populate_switcher_pte_page(unsigned int cpu,
>
>  	/* The first entries are easy: they map the Switcher code. */
>  	for (i = 0; i < pages; i++) {
> -		pte[i] = mk_pte(switcher_page[i],
> -				__pgprot(_PAGE_PRESENT|_PAGE_ACCESSED));
> +		native_set_pte(&pte[i], mk_pte(switcher_page[i],
> +				__pgprot(_PAGE_PRESENT|_PAGE_ACCESSED)));
>  	}
>
>  	/* The only other thing we map is this CPU's pair of pages. */
>  	i = pages + cpu*2;
>
>  	/* First page (Guest registers) is writable from the Guest */
> -	pte[i] = pfn_pte(page_to_pfn(switcher_page[i]),
> -			 __pgprot(_PAGE_PRESENT|_PAGE_ACCESSED|_PAGE_RW));
> +	native_set_pte(&pte[i], pfn_pte(page_to_pfn(switcher_page[i]),
> +			 __pgprot(_PAGE_PRESENT|_PAGE_ACCESSED|_PAGE_RW)));
>
>  	/* The second page contains the "struct lguest_ro_state", and is
>  	 * read-only. */
> -	pte[i+1] = pfn_pte(page_to_pfn(switcher_page[i+1]),
> -			   __pgprot(_PAGE_PRESENT|_PAGE_ACCESSED));
> +	native_set_pte(&pte[i+1],pfn_pte(page_to_pfn(switcher_page[i+1]),
> +			   __pgprot(_PAGE_PRESENT|_PAGE_ACCESSED)) );
>  }

More good cleanups...

Nice to have a cleanup patch which does all the preparation work (or maybe 
several patches), then the one which actually allows PAE.

Thanks!
Rusty.



More information about the Lguest mailing list