[2.4] [PATCH] hash_page rework, take 2

Wed Feb 4 15:18:37 EST 2004

On Tue, 3 Feb 2004, Julie DeWandel wrote:

> Hi Olof,
>
> The patch wasn't attached to your email so I included it along with
> comments below (my comments preceded by "JSD:").

Ugh. This is the second time I've forgotten to attach a patch. Not sure
what is going on here...

Thanks for taking time to review! Comments below each remark below, and
new patch attached (this time for real).


-Olof




+static inline void pSeries_unlock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	asm volatile("lwsync":::"memory");
+	clear_bit(HPTE_LOCK_BIT, word);
+}

JSD: Other places within the kernel do an smp_mb__before_clear_bit() when
JSD: clearing a bit representing a lock. Would that be a better choice here
JSD: (it resolves to a sync)?

I just copied the 2.6 code here. lwsync is cheaper on Power4 and
beyond, and on Power3/RS64 it equals a sync. Cheapest sufficient
syncronization is always to be preferred..

JSD: I should have added that the clearing of the _PAGE_BUSY bit should be
JSD: done using an atomic op (clear_bit() routine)

Not needed, since noone is modifying the contents of the word if the
bit is set. I.e, similar to spin_unlock().

JSD: The inline assembly code will not set _PAGE_BUSY if it finds that
JSD: the user's access rights doesn't allow access to the page.
JSD: So, if access_ok = 0, _PAGE_BUSY is not set, and the code may
JSD: branch to out_unlock where _PAGE_BUSY is unconditionally cleared.
JSD: It could be possible for another processor to coincidently have
JSD: set _PAGE_BUSY in the PTE but then have this processor clear it
JSD: before the other processor wanted it clear. Do you agree this can
JSD: happen?

Yes, it's a small window but it might happen. See below.

JSD: Furthermore, the "ea" condition check (above) might yield false and
JSD: the code then continue as though access_ok were true, modifying the
JSD: PTE without the _PAGE_BUSY bit set. Seems bad.

Right. The proper thing to do is to set _PAGE_BUSY always, and clear it
if access is denied. Fixed.

JSD: NOTE: at this point, the *ptep = new_pte just unlocked the PTE by
JSD: clearing _PAGE_BUSY. The code then goes on to clear it again, thereby
JSD: possibly rendering unsafe updates that some other processor might be
JSD: doing after it thought it had set _PAGE_BUSY. There is a small window
JSD: here.

Fixed as well. There's only a window at the second updated of *ptep,
in the first one _PAGE_BUSY is still set.

-       spin_unlock(&hash_table_lock[lock_slot].lock);
+out_unlock:
+       smp_wmb();

JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or
JSD: at least an lwsync because this is an unlock.

I'm not sure we really need any syncronization at all, since the only
memery that's protected by the _PAGE_BUSY bit is the word itself, so
the lock and contents change will be seen at the same time (or at least
in the right order).

smp_wmb() is cheaper (eieio) than smp_mb() (sync), so either the
smp_wmb() should stay or it should be taken out alltogether. I'd rather
err on the side of caution, so I'll keep it.

JSD: In the above two hunks of code, a pte_update is done which returns
JSD: the old pte value. This value is checked to determine if the hpte
JSD: should be invalidated. However, there is no lock held between the
JSD: time the pte value is read and the time the hpte is invalidated.
JSD: The hpte_invalidate routine doesn't check to make sure the va
JSD: passed in is really the one being invalidated in the slot -- it just
JSD: assumes the slot, etc are enough to locate it. So we might be
JSD: invalidating the wrong thing here. Probably a don't care, but thought
JSD: I'd ask.

Yes, this is a general behaviour though, since entries are not updated
whenever they're thrown out of the HPTE. We always risk invalidating
an entry that's been reused. In the grand scheme of things it's not a
big problem, since it should happen fairly rarely, and when it happens the
invalidated entry will be faulted right back in. 2.6 is the same.

 	/* Invalidate the hpte. */
 	hptep->dw0.dword0 = 0;

JSD: It would really be nice to add a comment here that the above
JSD: assignment statement is also unlocking the hpte as well.

Done.

JSD: Two questions here. (1) Shouldn't interrupts be disabled for the
JSD: write_lock/unlock here regardless of what processor we are running
JSD: on? (2) I don't see how this code is preventing another processor
JSD: from grabbing the read_lock immediately after this processor has
JSD: checked to make sure it isn't held.

JSD: Better question for (1) is why are interrupts being disabled here?
JSD: Can this routine be called from interrupt context?

Without disabling interrupts, there's a risk for deadlocks if the
processor gets interrupted and the interrupt handler causes a page fault
that needs to be resolved Since the lock is held for writing, the handler
will wait forever when locking for reading. This is actually similar to
the original deadlock that this whole patch is meant to remove, but the
window is really small (just a few instructions) now. Likewise an
interrupt on a different processor is not a problem since forward progress
is still guaranteed on the processor holding it for writing so the reader
will eventually get the lock.

And on the second question: This is the trick with using the rwlock,
there's no need to _prevent_ reading, all I needed to know is that all
readers that started before pte_free_sync() have completed:

Noone can get a reference to a PTE after it's been free'd, so the only
risk is if someone has walked the table right during pte_free() and is
still holding a reference to it.  hash_page will hold the lock for
reading while this takes place, so all we need to know is that we
_could_ take the lock for writing (i.e. no readers for the table). Even
if another CPU comes in and traverses the tree we're safe since there's
no way they can end up in that PTE (since it's been removed from the
tree).

This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
another light-weight syncronization method.

JSD: Since the pte_freelist_cur is a per-processor structure, I don't
JSD: think you need the batch->lock at all. What other thing could be
JSD: running on this same processor at the same time?

I thought there was a risk that pte_free() could be called from interrupt
context through pte_alloc(), but with some closer examination it seems
like it shouldn't happen. If so, the locks can come out.

JSD: Why were these spinlocks changed to _irq? I noticed this change was
JSD: not present in the 2.6 code.

No, it's my mistake. It was part of the workaround patch that for some reason
got left in the patch I attached to the bug. They were not there in the patch
that I was supposed to have posted to the list.

JSD: Same question about the _irq addition. Doesn't seem necessary. I'm
JSD: probably missing something -- please explain.

Same answer as above; dirty patch.

JSD: Why was the UL (unsigned long) dropped from the bit definitions?

Sync with 2.6, at one time I used the definitions in assembly and the UL
syntax is illegal there. It's no longer needed.

-------------- next part --------------
===== arch/ppc64/kernel/htab.c 1.11 vs edited =====

--- 1.11/arch/ppc64/kernel/htab.c	Thu Dec 18 16:13:25 2003
+++ edited/arch/ppc64/kernel/htab.c	Tue Feb  3 22:01:59 2004
@@ -48,6 +48,29 @@
 #include <asm/iSeries/HvCallHpt.h>
 #include <asm/cputable.h>
 
+#define HPTE_LOCK_BIT 3
+
+static inline void pSeries_lock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	while (1) {
+		if (!test_and_set_bit(HPTE_LOCK_BIT, word))
+			break;
+		while(test_bit(HPTE_LOCK_BIT, word))
+			cpu_relax();
+	}
+}
+
+static inline void pSeries_unlock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	asm volatile("lwsync":::"memory");
+	clear_bit(HPTE_LOCK_BIT, word);
+}
+
+
 /*
  * Note:  pte   --> Linux PTE
  *        HPTE  --> PowerPC Hashed Page Table Entry
@@ -64,6 +87,7 @@
 
 extern unsigned long _SDR1;
 extern unsigned long klimit;
+extern rwlock_t pte_hash_lock[] __cacheline_aligned_in_smp;
 
 void make_pte(HPTE *htab, unsigned long va, unsigned long pa,
 	      int mode, unsigned long hash_mask, int large);
@@ -320,51 +344,73 @@
 	unsigned long va, vpn;
 	unsigned long newpp, prpn;
 	unsigned long hpteflags, lock_slot;
+	unsigned long access_ok, tmp;
 	long slot;
 	pte_t old_pte, new_pte;
+	int ret = 0;
 
 	/* Search the Linux page table for a match with va */
 	va = (vsid << 28) | (ea & 0x0fffffff);
 	vpn = va >> PAGE_SHIFT;
 	lock_slot = get_lock_slot(vpn); 
 
-	/* Acquire the hash table lock to guarantee that the linux
-	 * pte we fetch will not change
+	/* 
+	 * Check the user's access rights to the page.  If access should be
+	 * prevented then send the problem up to do_page_fault.
 	 */
-	spin_lock(&hash_table_lock[lock_slot].lock);
-	
+
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
 	 */
-#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
+
 	access |= _PAGE_PRESENT;
-	if (unlikely(access & ~(pte_val(*ptep)))) {
+
+	/* We'll do access checking and _PAGE_BUSY setting in assembly, since
+	 * it needs to be atomic. 
+	 */
+
+	__asm__ __volatile__ ("\n
+	1:	ldarx	%0,0,%3\n
+		# Check if PTE is busy\n
+		andi.	%1,%0,%4\n
+		bne-	1b\n
+		ori	%0,%0,%4\n
+		# Write the linux PTE atomically (setting busy)\n
+		stdcx.	%0,0,%3\n
+		bne-	1b\n
+		# Check access rights (access & ~(pte_val(*ptep)))\n
+		andc.	%1,%2,%0\n
+		bne-	2f\n
+		li      %1,1\n
+		b	3f\n
+	2:      li      %1,0\n
+	3:"
+	: "=r" (old_pte), "=r" (access_ok)
+	: "r" (access), "r" (ptep), "i" (_PAGE_BUSY)
+        : "cr0", "memory");
+
+#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
+	if (unlikely(!access_ok)) {
 		if(!(((ea >> SMALLOC_EA_SHIFT) == 
 		      (SMALLOC_START >> SMALLOC_EA_SHIFT)) &&
 		     ((current->thread.flags) & PPC_FLAG_SHARED))) {
-			spin_unlock(&hash_table_lock[lock_slot].lock);
-			return 1;
+			ret = 1;
+			goto out_unlock;
 		}
 	}
 #else
-	access |= _PAGE_PRESENT;
-	if (unlikely(access & ~(pte_val(*ptep)))) {
-		spin_unlock(&hash_table_lock[lock_slot].lock);
-		return 1;
+	if (unlikely(!access_ok)) {
+		ret = 1;
+		goto out_unlock;
 	}
 #endif
 
 	/* 
-	 * We have found a pte (which was present).
-	 * The spinlocks prevent this status from changing
-	 * The hash_table_lock prevents the _PAGE_HASHPTE status
-	 * from changing (RPN, DIRTY and ACCESSED too)
-	 * The page_table_lock prevents the pte from being 
-	 * invalidated or modified
-	 */
-
-	/*
+	 * We have found a proper pte. The hash_table_lock protects
+	 * the pte from deallocation and the _PAGE_BUSY bit protects
+	 * the contents of the PTE from changing.
+	 *
 	 * At this point, we have a pte (old_pte) which can be used to build
 	 * or update an HPTE. There are 2 cases:
 	 *
@@ -385,7 +431,7 @@
 	else
 		pte_val(new_pte) |= _PAGE_ACCESSED;
 
-	newpp = computeHptePP(pte_val(new_pte));
+	newpp = computeHptePP(pte_val(new_pte) & ~_PAGE_BUSY);
 	
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) {
@@ -400,12 +446,13 @@
 		slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
 		slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12;
 
-		/* XXX fix large pte flag */
+	        /* XXX fix large pte flag */
 		if (ppc_md.hpte_updatepp(slot, secondary, 
 					 newpp, va, 0) == -1) {
 			pte_val(old_pte) &= ~_PAGE_HPTEFLAGS;
 		} else {
 			if (!pte_same(old_pte, new_pte)) {
+				/* _PAGE_BUSY is still set in new_pte */
 				*ptep = new_pte;
 			}
 		}
@@ -425,12 +472,19 @@
 		pte_val(new_pte) |= ((slot<<12) & 
 				     (_PAGE_GROUP_IX | _PAGE_SECONDARY));
 
+		smp_wmb();
+		/* _PAGE_BUSY is not set in new_pte */
 		*ptep = new_pte;
+
+		return 0;
 	}
 
-	spin_unlock(&hash_table_lock[lock_slot].lock);
+out_unlock:
+	smp_wmb();
 
-	return 0;
+	pte_val(*ptep) &= ~_PAGE_BUSY;
+
+	return ret;
 }
 
 /*
@@ -497,11 +551,14 @@
 	pgdir = mm->pgd;
 	if (pgdir == NULL) return 1;
 
-	/*
-	 * Lock the Linux page table to prevent mmap and kswapd
-	 * from modifying entries while we search and update
+	/* The pte_hash_lock is used to block any PTE deallocations
+	 * while we walk the tree and use the entry. While technically
+	 * we both read and write the PTE entry while holding the read
+	 * lock, the _PAGE_BUSY bit will block pte_update()s to the
+	 * specific entry.
 	 */
-	spin_lock(&mm->page_table_lock);
+	
+	read_lock(&pte_hash_lock[smp_processor_id()]);
 
 	ptep = find_linux_pte(pgdir, ea);
 	/*
@@ -514,8 +571,7 @@
 		/* If no pte, send the problem up to do_page_fault */
 		ret = 1;
 	}
-
-	spin_unlock(&mm->page_table_lock);
+	read_unlock(&pte_hash_lock[smp_processor_id()]);
 
 	return ret;
 }
@@ -540,8 +596,6 @@
 	lock_slot = get_lock_slot(vpn); 
 	hash = hpt_hash(vpn, large);
 
-	spin_lock_irqsave(&hash_table_lock[lock_slot].lock, flags);
-
 	pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
 	secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15;
 	if (secondary) hash = ~hash;
@@ -551,8 +605,6 @@
 	if (pte_val(pte) & _PAGE_HASHPTE) {
 		ppc_md.hpte_invalidate(slot, secondary, va, large, local);
 	}
-
-	spin_unlock_irqrestore(&hash_table_lock[lock_slot].lock, flags);
 }
 
 long plpar_pte_enter(unsigned long flags,
@@ -787,6 +839,8 @@
 
 	avpn = vpn >> 11;
 
+	pSeries_lock_hpte(hptep);
+
 	dw0 = hptep->dw0.dw0;
 
 	/*
@@ -794,9 +848,13 @@
 	 * the AVPN, hash group, and valid bits.  By doing it this way,
 	 * it is common with the pSeries LPAR optimal path.
 	 */
-	if (dw0.bolted) return;
+	if (dw0.bolted) {
+		pSeries_unlock_hpte(hptep);
 
-	/* Invalidate the hpte. */
+		return;
+	}
+
+	/* Invalidate the hpte. This clears the lock as well. */
 	hptep->dw0.dword0 = 0;
 
 	/* Invalidate the tlb */
@@ -875,6 +933,8 @@
 
 	avpn = vpn >> 11;
 
+	pSeries_lock_hpte(hptep);
+
 	dw0 = hptep->dw0.dw0;
 	if ((dw0.avpn == avpn) && 
 	    (dw0.v) && (dw0.h == secondary)) {
@@ -900,10 +960,14 @@
 		hptep->dw0.dw0 = dw0;
 		
 		__asm__ __volatile__ ("ptesync" : : : "memory");
+
+		pSeries_unlock_hpte(hptep);
 		
 		return 0;
 	}
 
+	pSeries_unlock_hpte(hptep);
+
 	return -1;
 }
 
@@ -1062,9 +1126,11 @@
 		dw0 = hptep->dw0.dw0;
 		if (!dw0.v) {
 			/* retry with lock held */
+			pSeries_lock_hpte(hptep);
 			dw0 = hptep->dw0.dw0;
 			if (!dw0.v)
 				break;
+			pSeries_unlock_hpte(hptep);
 		}
 		hptep++;
 	}
@@ -1079,9 +1145,11 @@
 			dw0 = hptep->dw0.dw0;
 			if (!dw0.v) {
 				/* retry with lock held */
+				pSeries_lock_hpte(hptep);
 				dw0 = hptep->dw0.dw0;
 				if (!dw0.v)
 					break;
+				pSeries_unlock_hpte(hptep);
 			}
 			hptep++;
 		}
@@ -1304,9 +1372,11 @@
 
 		if (dw0.v && !dw0.bolted) {
 			/* retry with lock held */
+			pSeries_lock_hpte(hptep);
 			dw0 = hptep->dw0.dw0;
 			if (dw0.v && !dw0.bolted)
 				break;
+			pSeries_unlock_hpte(hptep);
 		}
 
 		slot_offset++;
===== arch/ppc64/mm/init.c 1.8 vs edited =====
--- 1.8/arch/ppc64/mm/init.c	Tue Jan  6 17:54:44 2004
+++ edited/arch/ppc64/mm/init.c	Tue Feb  3 21:49:59 2004
@@ -104,16 +104,94 @@
  */
 mmu_gather_t     mmu_gathers[NR_CPUS];
 
+/* PTE free batching structures. We need a lock since not all
+ * operations take place under page_table_lock. Keep it per-CPU
+ * to avoid bottlenecks.
+ */
+
+struct pte_freelist_batch ____cacheline_aligned pte_freelist_cur[NR_CPUS] __cacheline_aligned_in_smp;
+rwlock_t pte_hash_lock[NR_CPUS] __cacheline_aligned_in_smp = { [0 ... NR_CPUS-1] = RW_LOCK_UNLOCKED };
+
+unsigned long pte_freelist_forced_free;
+
+static inline void pte_free_sync(void)
+{
+	unsigned long flags;
+	int i;
+
+	/* All we need to know is that we can get the write lock if
+	 * we wanted to, i.e. that no hash_page()s are holding it for reading.
+	 * If none are reading, that means there's no currently executing
+	 * hash_page() that might be working on one of the PTE's that will
+	 * be deleted. Likewise, if there is a reader, we need to get the
+	 * write lock to know when it releases the lock.
+	 */
+
+	for (i = 0; i < smp_num_cpus; i++)
+		if (is_read_locked(&pte_hash_lock[i])) {
+			/* So we don't deadlock with a reader on current cpu */
+			if(i == smp_processor_id())
+				local_irq_save(flags);
+
+			write_lock(&pte_hash_lock[i]);
+			write_unlock(&pte_hash_lock[i]);
+
+			if(i == smp_processor_id())
+				local_irq_restore(flags);
+		}
+}
+
+
+/* This is only called when we are critically out of memory
+ * (and fail to get a page in pte_free_tlb).
+ */
+void pte_free_now(pte_t *pte)
+{
+	pte_freelist_forced_free++;
+
+	pte_free_sync();
+
+	pte_free_kernel(pte);
+}
+
+/* Deallocates the pte-free batch after syncronizing with readers of
+ * any page tables.
+ */
+void pte_free_batch(void **batch, int size)
+{
+	unsigned int i;
+
+	pte_free_sync();
+
+	for (i = 0; i < size; i++)
+		pte_free_kernel(batch[i]);
+
+	free_page((unsigned long)batch);
+}
+
+
 int do_check_pgt_cache(int low, int high)
 {
 	int freed = 0;
+	struct pte_freelist_batch *batch;
+
+	/* We use this function to push the current pte free batch to be
+	 * deallocated, since do_check_pgt_cache() is callEd at the end of each
+	 * free_one_pgd() and other parts of VM relies on all PTE's being
+	 * properly freed upon return from that function.
+	 */
+
+	batch = &pte_freelist_cur[smp_processor_id()];
+
+	if(batch->entry) {
+		pte_free_batch(batch->entry, batch->index);
+		batch->entry = NULL;
+	}
 
 	if (pgtable_cache_size > high) {
 		do {
 			if (pgd_quicklist)
 				free_page((unsigned long)pgd_alloc_one_fast(0)), ++freed;
-			if (pmd_quicklist)
-				free_page((unsigned long)pmd_alloc_one_fast(0, 0)), ++freed;
 			if (pte_quicklist)
 				free_page((unsigned long)pte_alloc_one_fast(0, 0)), ++freed;
 		} while (pgtable_cache_size > low);
@@ -290,7 +368,9 @@
 void
 local_flush_tlb_mm(struct mm_struct *mm)
 {
-	spin_lock(&mm->page_table_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mm->page_table_lock, flags);
 
 	if ( mm->map_count ) {
 		struct vm_area_struct *mp;
@@ -298,7 +378,7 @@
 			local_flush_tlb_range( mm, mp->vm_start, mp->vm_end );
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock_irqrestore(&mm->page_table_lock, flags);
 }
 
 /*
===== include/asm-ppc64/pgalloc.h 1.2 vs edited =====
--- 1.2/include/asm-ppc64/pgalloc.h	Tue Apr  9 06:31:08 2002
+++ edited/include/asm-ppc64/pgalloc.h	Tue Feb  3 21:50:36 2004
@@ -15,7 +15,6 @@
 #define quicklists      get_paca()
 
 #define pgd_quicklist 		(quicklists->pgd_cache)
-#define pmd_quicklist 		(quicklists->pmd_cache)
 #define pte_quicklist 		(quicklists->pte_cache)
 #define pgtable_cache_size 	(quicklists->pgtable_cache_sz)
 
@@ -60,10 +59,10 @@
 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
 {
-	unsigned long *ret = (unsigned long *)pmd_quicklist;
+	unsigned long *ret = (unsigned long *)pte_quicklist;
 
 	if (ret != NULL) {
-		pmd_quicklist = (unsigned long *)(*ret);
+		pte_quicklist = (unsigned long *)(*ret);
 		ret[0] = 0;
 		--pgtable_cache_size;
 	}
@@ -80,14 +79,6 @@
 	return pmd;
 }
 
-static inline void
-pmd_free (pmd_t *pmd)
-{
-	*(unsigned long *)pmd = (unsigned long) pmd_quicklist;
-	pmd_quicklist = (unsigned long *) pmd;
-	++pgtable_cache_size;
-}
-
 #define pmd_populate(MM, PMD, PTE)	pmd_set(PMD, PTE)
 
 static inline pte_t*
@@ -115,12 +106,54 @@
 }
 
 static inline void
-pte_free (pte_t *pte)
+pte_free_kernel (pte_t *pte)
 {
 	*(unsigned long *)pte = (unsigned long) pte_quicklist;
 	pte_quicklist = (unsigned long *) pte;
 	++pgtable_cache_size;
 }
+
+
+/* Use the PTE functions for freeing PMD as well, since the same
+ * problem with tree traversals apply. Since pmd pointers are always
+ * virtual, no need for a page_address() translation.
+ */
+ 
+#define pte_free(pte_page)      __pte_free(pte_page)
+#define pmd_free(pmd)           __pte_free(pmd)
+ 
+struct pte_freelist_batch
+{
+	unsigned int	index;
+	void	      **entry;
+};
+ 
+#define PTE_FREELIST_SIZE	(PAGE_SIZE / sizeof(void *))
+ 
+extern void pte_free_now(pte_t *pte);
+extern void pte_free_batch(void **batch, int size);
+extern struct ____cacheline_aligned pte_freelist_batch pte_freelist_cur[] __cacheline_aligned_in_smp;
+ 
+static inline void __pte_free(pte_t *pte)
+{
+	struct pte_freelist_batch *batchp = &pte_freelist_cur[smp_processor_id()];
+ 
+	if (batchp->entry == NULL) {
+		batchp->entry = (void **)__get_free_page(GFP_ATOMIC);
+		if (batchp->entry == NULL) {
+			pte_free_now(pte);
+			return;
+		}
+		batchp->index = 0;
+	}
+ 
+	batchp->entry[batchp->index++] = pte;
+	if (batchp->index == PTE_FREELIST_SIZE) {
+		pte_free_batch(batchp->entry, batchp->index);
+		batchp->entry = NULL;
+	}
+}
+
 
 extern int do_check_pgt_cache(int, int);
 
===== include/asm-ppc64/pgtable.h 1.7 vs edited =====
--- 1.7/include/asm-ppc64/pgtable.h	Mon Aug 25 23:47:52 2003
+++ edited/include/asm-ppc64/pgtable.h	Tue Feb  3 21:33:22 2004
@@ -88,22 +88,22 @@
  * Bits in a linux-style PTE.  These match the bits in the
  * (hardware-defined) PowerPC PTE as closely as possible.
  */
-#define _PAGE_PRESENT	0x001UL	/* software: pte contains a translation */
-#define _PAGE_USER	0x002UL	/* matches one of the PP bits */
-#define _PAGE_RW	0x004UL	/* software: user write access allowed */
-#define _PAGE_GUARDED	0x008UL
-#define _PAGE_COHERENT	0x010UL	/* M: enforce memory coherence (SMP systems) */
-#define _PAGE_NO_CACHE	0x020UL	/* I: cache inhibit */
-#define _PAGE_WRITETHRU	0x040UL	/* W: cache write-through */
-#define _PAGE_DIRTY	0x080UL	/* C: page changed */
-#define _PAGE_ACCESSED	0x100UL	/* R: page referenced */
-#define _PAGE_HPTENOIX	0x200UL /* software: pte HPTE slot unknown */
-#define _PAGE_HASHPTE	0x400UL	/* software: pte has an associated HPTE */
-#define _PAGE_EXEC	0x800UL	/* software: i-cache coherence required */
-#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */
-#define _PAGE_GROUP_IX  0x7000UL /* software: HPTE index within group */
+#define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
+#define _PAGE_USER	0x0002 /* matches one of the PP bits */
+#define _PAGE_RW	0x0004 /* software: user write access allowed */
+#define _PAGE_GUARDED	0x0008
+#define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
+#define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
+#define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
+#define _PAGE_DIRTY	0x0080 /* C: page changed */
+#define _PAGE_ACCESSED	0x0100 /* R: page referenced */
+#define _PAGE_BUSY	0x0200 /* software: pte & hash are busy */
+#define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
+#define _PAGE_EXEC	0x0800 /* software: i-cache coherence required */
+#define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
+#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
 /* Bits 0x7000 identify the index within an HPT Group */
-#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX)
+#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
@@ -281,13 +281,15 @@
 	unsigned long old, tmp;
 
 	__asm__ __volatile__("\n\
-1:	ldarx	%0,0,%3	\n\
+1:	ldarx	%0,0,%3 \n\
+        andi.   %1,%0,%7 # loop on _PAGE_BUSY set\n\
+        bne-    1b \n\
 	andc	%1,%0,%4 \n\
 	or	%1,%1,%5 \n\
 	stdcx.	%1,0,%3 \n\
 	bne-	1b"
 	: "=&r" (old), "=&r" (tmp), "=m" (*p)
-	: "r" (p), "r" (clr), "r" (set), "m" (*p)
+	: "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY)
 	: "cc" );
 	return old;
 }