[2/5] Cleanup management of kmem_caches for pagetables

Mon Sep 14 14:56:36 EST 2009

On Mon, Sep 14, 2009 at 08:28:11AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2009-09-09 at 15:59 +1000, David Gibson wrote:
> 
> > 6 files changed, 73 insertions(+), 108 deletions(-)
> 
> That's a pretty good start :-)

Isn't it :)

> > +struct kmem_cache *pgtable_cache[PGF_SHIFT_MASK];
> > +
> > +void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
> > +{
> > +	char *name;
> > +	unsigned long table_size = sizeof(void *) << shift;
> > +	struct kmem_cache *new;
> > +
> > +	BUG_ON((shift < 1) || (shift > PGF_SHIFT_MASK));
> > +	if (PGT_CACHE(shift))
> > +		return; /* Already have a cache of this size */
> > +	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
> > +	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
> > +	PGT_CACHE(shift) = new;
> > +}
> 
> I'm getting partial to verbose boot nowadays :-) At least a pr_debug if
> not a pr_info up there "Allocated pgtable for order %d" or something
> like that might end up being of some use when debugging things.

Ok.

> >  void pgtable_cache_init(void)
> >  {
> > -	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
> > -	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
> > +	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
> > +	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
> > +	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
> > +		panic("Couldn't allocate pgtable caches");
> > +	BUG_ON(!PGT_CACHE(PUD_INDEX_SIZE));
> 
> What if PUD_INDEX_SIZE is 0 ? (64k pages)

Uh.. good question.  Um.. I booted with 64k pages, how did that
work...

> Couldn't we just do a
> 
> 	if (PUD_INDEX_SIZE)
> 		pgtable_cache_add(PUD_INDEX_SIZE...)
> 
> If it's the same size as another cache it would just not do anything... 

Yeah, that makes sense.

> > -static inline void pgtable_free(pgtable_free_t pgf)
> > +static inline void pgtable_free(void *table, unsigned index_size)
> >  {
> > -	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
> > -	int cachenum = pgf.val & PGF_CACHENUM_MASK;
> > -
> > -	if (cachenum == PTE_NONCACHE_NUM)
> > -		free_page((unsigned long)p);
> > -	else
> > -		kmem_cache_free(pgtable_cache[cachenum], p);
> > +	if (!index_size)
> > +		free_page((unsigned long)table);
> > +	else {
> > +		BUG_ON(index_size > PGF_SHIFT_MASK);
> > +		kmem_cache_free(PGT_CACHE(index_size), table);
> > +	}
> >  }
> 
> Out of curiosity, what is the index_size == 0 case for ? Do we still use
> it ? Agh... found it ... we don't put PTE pages in a cache, just use
> gfp. I suppose that's lower overhead.

Well, and I didn't want to have the extra churn of converting it to
use a kmem_cache in any case.

> >  /* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
> >   * and small enough to fit in the low bits of any naturally aligned page
> >   * table cache entry. Arbitrarily set to 0x1f, that should give us some
> >   * room to grow
> >   */
> 
> The comment above will need updating (don't just remove it, please -do-
> explain what it's all about :-)

Ah, yes, oops.

> > -#define PGF_CACHENUM_MASK	0x1f
> > -
> > -static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
> > -						unsigned long mask)
> > -{
> > -	BUG_ON(cachenum > PGF_CACHENUM_MASK);
> > -
> > -	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
> > -}
> > +#define PGF_SHIFT_MASK		0xf
> 
> That does bring one question tho... You still fill the batch by sticking
> the shift into the low bits of the table right ? Which means that your
> table must have at least 4 bits 0 at the bottom, ie, it must at least
> have 2 pointers on 64-bit and 4 on 32-bit. Maybe you should add some
> runtime test for the later (your comparisons to 1 do the job for 64-bit
> but not for 32-bit or did I miss something ?)

Ah.. yes, I believe that's a bug, which will actually be removed as a
side-effect of the next patch.  The new hugepage pagetable format
introduces a similar alignment requirement on the pagetable chunks,
because the hugepage pgd/pud/pmd pointers also contain a shift.

That patch includes a change to pgtable_cache_add() to force the
alignment of each new kmem_cache() up to the minimum.  But yes, that
should be adjusted to cover the pgtable free batches requirement as
well and moved back to this patch.

> Overall, a really nice cleanup... the previous stuff was hairy and prone
> to breakage (see how it broke it twice due to misaligned cache names
> array during the build-up of the book3e support). I'm actually tempted,
> with a bit more testing, to sneak that into .32 despite arriving a bit
> late, because the current code is really fragile.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson