[RFC] implicit hugetlb pages (hugetlb_implicit)

Tue Jan 13 10:57:08 EST 2004

On Mon, Jan 12, 2004 at 03:38:00PM -0800, Adam Litke wrote:
> Thank you for your comments and suggestions. They are proving very
> helpful as I work to clean this up.

Glad to hear it :)

> On Sun, 2004-01-11 at 20:19, David Gibson wrote:
> > On Fri, Jan 09, 2004 at 01:27:20PM -0800, Adam Litke wrote:
> > >
> > > hugetlb_implicit (2.6.0):
> > >    This patch includes the anonymous mmap work from Dave Gibson
> > > (right?)
> >
> > I'm not sure what you're referring to here. My patches for lbss
> > support also include support for copy-on-write of hugepages and
> > various other changes which can make them act kind of like anonymous
> > pages.
> >
> > But I don't see much in this patch that looks familiar.
>
> Hmm. Could the original author of hugetlb for anonymous mmap claim
> credit for the initial code?

I think I once knew who it was, but I've forgotten, sorry.

Incidentally, you probably do want to fold in my hugepage-COW stuff
(although it does mean some more generic changes).  Otherwise
hugepages are always MAP_SHARED, which means with an implicit hugepage
mmap() certain regions of memory will silently have totally different
semantics to what you expect - it could get very weird across a
fork().

And for that matter there's at least one plain-old-bug in the current
hugepage code which is addressed in my patch (the LOW_HPAGES bit isn't
propagated correctly across a fork()).

I'll attach my patch, which also includes the hugepage ELF segment
stuff.  I'm afraid I haven't had a chance to separate out those parts
of the patch yet.

> > > +	/* Do we have enough free huge pages? */
> > > +	if (!is_hugepage_mem_enough(len))
> > > +		return 0;
> >
> > Is this test safe/necessary? i.e. a) is there any potential race
> > which could cause the mmap() to fail because it's short of memory
> > despite suceeding the test here and b) can't we just let the mmap
> > fail and fall back then rather than checking beforehand?
>
> You're right. Now that safe fallback is working, we might as well
> defer this test to get_unmapped area.

Ok.

> > Do we need/want any consideration of the given "hint" address here?
>
> I am trying to do what the kernel does for normal mmaps here. If
> someone hints at an address, they hopefully have a good reason for it.
> I wouldn't want to override it just so I can do implicit hugetlb. Most
> applications pass NULL for the hint right?

That's kind of my point:  what if someone gives a hugepage aligned
size with a non-aligned hint address - currently the test is only on
the size.  We either have to map at somewhere other than the hint
address (which is what the patch does now, I think), or only attempt a
hugepage map if the hint address is also aligned.

This is for the case where the hint really is a hint, of course - so
we don't have to obey it.  If it's MAP_FIXED it's a different code
path, and we never attempt a hugepage mapping (unless it's explicitly
from hugetlbfs).  Perhaps we should, though.

> > > +	/* Explicit requests for huge pages are allowed to return errors */
> > > +	if (*flags & MAP_HUGETLB) {
> > > +		if (pre_error)
> > > +			return pre_error;
> > > +		return hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags);
> > > +	}
> > > +
> > > +	/*
> > > +	 * When implicit request fails, return 0 so we can
> > > +	 * retry later with regular pages.
> > > +	 */
> > > +	if (mmap_hugetlb_implicit(len)) {
> > > +		if (pre_error)
> > > +			goto out;
> > > +		addr = hugetlb_get_unmapped_area(NULL, addr, len, pgoff, *flags);
> > > +		if (IS_ERR((void *)addr))
> > > +			goto out;
> > > +		else {
> > > +			*flags |= MAP_HUGETLB;
> > > +			return addr;
> > > +		}
> > > +	}
> > > +
> > > +out:
> > > +	*flags &= ~MAP_HUGETLB;
> > > +	return 0;
> > > +}
> >
> > This does assume that 0 is never a valid address returned for
> > a hugepage range. That's true now, but it makes be slightly
> > uncomfortable, since there's no inherent reason we couldn't make
> > segment zero a hugepage segment.
>
> You definately found an ugly part of the patch. Cleanup in progress.

Excellent.

> > > +#ifdef CONFIG_HUGETLBFS
> > > +int shm_with_hugepages(int shmflag, size_t size)
> > > +{
> > > +	/* flag specified explicitly */
> > > +	if (shmflag & SHM_HUGETLB)
> > > +		return 1;
> > > +	/* Are we disabled? */
> > > +	if (!shm_use_hugepages)
> > > +		return 0;
> > > +	/* Must be HPAGE aligned */
> > > +	if (size & ~HPAGE_MASK)
> > > +		return 0;
> > > +	/* Are we under the max per file? */
> > > +	if ((size >> HPAGE_SHIFT) > shm_hugepages_per_file)
> > > +		return 0;
> >
> > I don't really understand this per-file restriction. More comments
> > below.
>
> Since hugetlb pages are a relatively scarce resource, this is a
> rudimentary method to ensure that one application doesn't allocate
> more than its fair share of hugetlb memory.

Ah, ok.  It's probably worth adding a comment or two to that effect.
At the moment I don't think this is particularly necessary, since you
need root (well CAP_IPC_LOCK) to allocate hugepages.  But we may well
want to change that, so some sort of limit is probably a good idea.  I
wonder if there is a more direct way of accomplishing this.

> > > +	/* Do we have enough free huge pages? */
> > > +	if (!is_hugepage_mem_enough(size))
> > > +		return 0;
> >
> > Same concerns with this test as in the mmap case.
>
> Your right. This is racey. I haven't given the shared mem part of the
> patch nearly as much attention as the mmap part. I am going to leave
> this partially broken until I clean up the fallback code for mmaps so
> I can put that here as well.

Fair enough.

> > > @@ -501,8 +505,17 @@ unsigned long do_mmap_pgoff(struct file
> > >
> > >  	/* Obtain the address to map to. we verify (or select) it and ensure
> > >  	 * that it represents a valid section of the address space.
> > > +	 * VM_HUGETLB will never appear in vm_flags when CONFIG_HUGETLB is
> > > +	 * unset.
> > >  	 */
> > > -	addr = get_unmapped_area(file, addr, len, pgoff, flags);
> > > +#ifdef CONFIG_HUGETLBFS
> > > +	addr = try_hugetlb_get_unmapped_area(NULL, addr, len, pgoff, &flags);
> > > +	if (IS_ERR((void *)addr))
> > > +		return addr;
> >
> > This doesn't look right - we don't fall back if try_hugetlb...()
> > fails. But it can fail if we don't have the right permissions, for
> > one thing in which case we certainly do want to fall back.
>
> I admit this is messy and I am working on cleaning it up.

Great.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson
-------------- next part --------------
diff -urN ppc64-linux-2.5/arch/ppc64/mm/hugetlbpage.c linux-gogogo/arch/ppc64/mm/hugetlbpage.c

--- ppc64-linux-2.5/arch/ppc64/mm/hugetlbpage.c	2003-10-14 22:33:33.000000000 +1000
+++ linux-gogogo/arch/ppc64/mm/hugetlbpage.c	2003-11-25 17:04:25.000000000 +1100
@@ -118,6 +118,16 @@
 #define hugepte_page(x)	pfn_to_page(hugepte_pfn(x))
 #define hugepte_none(x)	(!(hugepte_val(x) & _HUGEPAGE_PFN))

+#define hugepte_write(x) (hugepte_val(x) & _HUGEPAGE_RW)
+#define hugepte_same(A,B) \
+	(((hugepte_val(A) ^ hugepte_val(B)) & ~_HUGEPAGE_HPTEFLAGS) == 0)
+
+static inline hugepte_t hugepte_mkwrite(hugepte_t pte)
+{
+	hugepte_val(pte) |= _HUGEPAGE_RW;
+	return pte;
+}
+

 static void free_huge_page(struct page *page);
 static void flush_hash_hugepage(mm_context_t context, unsigned long ea,
@@ -219,20 +229,6 @@
 		pmd_clear((pmd_t *)(ptep+i));
 }

-/*
- * This function checks for proper alignment of input addr and len parameters.
- */
-int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
-{
-	if (len & ~HPAGE_MASK)
-		return -EINVAL;
-	if (addr & ~HPAGE_MASK)
-		return -EINVAL;
-	if (! is_hugepage_only_range(addr, len))
-		return -EINVAL;
-	return 0;
-}
-
 static void do_slbia(void *unused)
 {
 	asm volatile ("isync; slbia; isync":::"memory");
@@ -251,8 +247,11 @@
 	/* Check no VMAs are in the region */
 	vma = find_vma(mm, TASK_HPAGE_BASE_32);

-	if (vma && (vma->vm_start < TASK_HPAGE_END_32))
+	if (vma && (vma->vm_start < TASK_HPAGE_END_32)) {
+		printk(KERN_DEBUG "Low HTLB region busy: PID=%d  vma @ %lx-%lx\n",
+		       current->pid, vma->vm_start, vma->vm_end);
 		return -EBUSY;
+	}

 	/* Clean up any leftover PTE pages in the region */
 	spin_lock(&mm->page_table_lock);
@@ -293,6 +292,43 @@
 	return 0;
 }

+int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
+{
+	if (len & ~HPAGE_MASK)
+		return -EINVAL;
+	if (addr & ~HPAGE_MASK)
+		return -EINVAL;
+	if (! is_hugepage_only_range(addr, len))
+		return -EINVAL;
+	return 0;
+}
+
+int is_potential_hugepage_range(unsigned long addr, unsigned long len)
+{
+	if (len & ~HPAGE_MASK)
+		return -EINVAL;
+	if (addr & ~HPAGE_MASK)
+		return -EINVAL;
+	if (! is_hugepage_potential_range(addr, len))
+		return -EINVAL;
+	return 0;
+}
+
+
+int prepare_hugepage_range(unsigned long addr, unsigned long len)
+{
+	int ret;
+
+	BUG_ON(is_potential_hugepage_range(addr, len) != 0);
+
+	if (is_hugepage_low_range(addr, len)) {
+		ret = open_32bit_htlbpage_range(current->mm);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma)
 {
@@ -300,6 +336,16 @@
 	struct page *ptepage;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	cpumask_t tmp;
+	int cow;
+	int local;
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

 	while (addr < end) {
 		BUG_ON(! in_hugepage_area(src->context, addr));
@@ -310,6 +356,17 @@
 			return -ENOMEM;

 		src_pte = hugepte_offset(src, addr);
+
+		if (cow) {
+			entry = __hugepte(hugepte_update(src_pte,
+							 _HUGEPAGE_RW
+							 | _HUGEPAGE_HPTEFLAGS,
+							 0));
+			if ((addr % HPAGE_SIZE) == 0)
+				flush_hash_hugepage(src->context, addr,
+						    entry, local);
+		}
+
 		entry = *src_pte;

 		if ((addr % HPAGE_SIZE) == 0) {
@@ -483,12 +540,16 @@
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
 	int ret = 0;
+	int writable;

 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
 	BUG_ON((vma->vm_end % HPAGE_SIZE) != 0);

 	spin_lock(&mm->page_table_lock);
+
+	writable = (vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED);
+
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
 		hugepte_t *pte = hugepte_alloc(mm, addr);
@@ -518,15 +579,25 @@
 				ret = -ENOMEM;
 				goto out;
 			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			unlock_page(page);
+			/* This is a new page, all full of zeroes.  If
+			 * we're MAP_SHARED, the page needs to go into
+			 * the page cache.  If it's MAP_PRIVATE it
+			 * might as well be made "anonymous" now or
+			 * we'll just have to copy it on the first
+			 * write. */
+			if (vma->vm_flags & VM_SHARED) {
+				ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+				unlock_page(page);
+			} else {
+				writable = (vma->vm_flags & VM_WRITE);
+			}
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
 				goto out;
 			}
 		}
-		setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
+		setup_huge_pte(mm, page, pte, writable);
 	}
 out:
 	spin_unlock(&mm->page_table_lock);
@@ -659,10 +730,9 @@
 	if (!in_hugepage_area(mm->context, ea))
 		return -1;

-	ea &= ~(HPAGE_SIZE-1);
-
 	/* We have to find the first hugepte in the batch, since
 	 * that's the one that will store the HPTE flags */
+	ea &= HPAGE_MASK;
 	ptep = hugepte_offset(mm, ea);

 	/* Search the Linux page table for a match with va */
@@ -683,7 +753,7 @@
 	 * prevented then send the problem up to do_page_fault.
 	 */
 	is_write = access & _PAGE_RW;
-	if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW)))
+	if (unlikely(is_write && !hugepte_write(*ptep)))
 		return 1;

 	/*
@@ -886,10 +956,11 @@
 			spin_unlock(&htlbpage_lock);
 		}
 		htlbpage_max = htlbpage_free = htlbpage_total = i;
-		printk("Total HugeTLB memory allocated, %d\n", htlbpage_free);
+		printk(KERN_INFO "Total HugeTLB memory allocated, %d\n",
+		       htlbpage_free);
 	} else {
 		htlbpage_max = 0;
-		printk("CPU does not support HugeTLB\n");
+		printk(KERN_INFO "CPU does not support HugeTLB\n");
 	}

 	return 0;
@@ -914,6 +985,121 @@
 	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpage_free;
 }

+static int hugepage_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, hugepte_t *ptep, hugepte_t pte)
+{
+	struct page *old_page, *new_page;
+	int i;
+	cpumask_t tmp;
+	int local;
+
+	BUG_ON(!pfn_valid(hugepte_pfn(*ptep)));
+
+	old_page = hugepte_page(*ptep);
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	/* If no-one else is actually using this page, avoid the copy
+	 * and just make the page writable */
+	if (!TestSetPageLocked(old_page)) {
+		int avoidcopy = (page_count(old_page) == 1);
+		unlock_page(old_page);
+		if (avoidcopy) {
+			for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
+				set_hugepte(ptep+i, hugepte_mkwrite(pte));
+
+
+			pte = __hugepte(hugepte_update(ptep, _HUGEPAGE_HPTEFLAGS, 0));
+			if (hugepte_val(pte) & _HUGEPAGE_HASHPTE)
+				flush_hash_hugepage(mm->context, address,
+						    pte, local);
+			spin_unlock(&mm->page_table_lock);
+			return VM_FAULT_MINOR;
+		}
+	}
+
+	page_cache_get(old_page);
+
+	spin_unlock(&mm->page_table_lock);
+
+	new_page = alloc_hugetlb_page();
+	if (! new_page) {
+		page_cache_release(old_page);
+
+		/* Logically this is OOM, not a SIGBUS, but an OOM
+		 * could cause the kernel to go killing other
+		 * processes which won't help the hugepage situation
+		 * at all (?) */
+		return VM_FAULT_SIGBUS;
+	}
+
+	for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++)
+		copy_user_highpage(new_page + i, old_page + i, address + i*PAGE_SIZE);
+
+	spin_lock(&mm->page_table_lock);
+
+	/* XXX are there races with checking cpu_vm_mask? - Anton */
+	tmp = cpumask_of_cpu(smp_processor_id());
+	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
+		local = 1;
+
+	ptep = hugepte_offset(mm, address);
+	if (hugepte_same(*ptep, pte)) {
+		/* Break COW */
+		for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
+			hugepte_update(ptep, ~0,
+				       hugepte_val(mk_hugepte(new_page, 1)));
+
+		if (hugepte_val(pte) & _HUGEPAGE_HASHPTE)
+			flush_hash_hugepage(mm->context, address,
+					    pte, local);
+
+		/* Make the old page be freed below */
+		new_page = old_page;
+	}
+	page_cache_release(new_page);
+	page_cache_release(old_page);
+	spin_unlock(&mm->page_table_lock);
+	return VM_FAULT_MINOR;
+}
+
+int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			    unsigned long address, int write_access)
+{
+	hugepte_t *ptep;
+	int rc = VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);
+
+	ptep = hugepte_offset(mm, address & HPAGE_MASK);
+
+	if ( (! ptep) || hugepte_none(*ptep))
+		goto fail;
+
+	/* Otherwise, there ought to be a real hugepte here */
+	BUG_ON(hugepte_bad(*ptep));
+
+	rc = VM_FAULT_MINOR;
+
+	if (! (write_access && !hugepte_write(*ptep))) {
+		printk(KERN_WARNING "Unexpected hugepte fault (wr=%d hugepte=%08x\n",
+		     write_access, hugepte_val(*ptep));
+		goto fail;
+	}
+
+	/* The only faults we should actually get are COWs */
+	/* this drops the page_table_lock */
+	return hugepage_cow(mm, vma, address, ptep, *ptep);
+
+ fail:
+	spin_unlock(&mm->page_table_lock);
+
+	return rc;
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
diff -urN ppc64-linux-2.5/arch/ppc64/mm/init.c linux-gogogo/arch/ppc64/mm/init.c
--- ppc64-linux-2.5/arch/ppc64/mm/init.c	2003-10-24 09:50:18.000000000 +1000
+++ linux-gogogo/arch/ppc64/mm/init.c	2003-11-25 14:29:53.000000000 +1100
@@ -549,7 +549,11 @@
 						++ptep;
 					} while (start < pmd_end);
 				} else {
-					WARN_ON(pmd_hugepage(*pmd));
+					/* We don't need to flush huge
+					 * pages here, because that's
+					 * done in
+					 * copy_hugetlb_page_range()
+					 * if necessary */
 					start = pmd_end;
 				}
 				++pmd;
diff -urN ppc64-linux-2.5/fs/binfmt_elf.c linux-gogogo/fs/binfmt_elf.c
--- ppc64-linux-2.5/fs/binfmt_elf.c	2003-10-23 08:29:46.000000000 +1000
+++ linux-gogogo/fs/binfmt_elf.c	2003-11-27 15:58:12.000000000 +1100
@@ -265,11 +265,81 @@

 #ifndef elf_map

+#ifdef CONFIG_HUGETLBFS
+#include <linux/hugetlb.h>
+
+static unsigned long elf_htlb_map(struct file *filep, unsigned long addr,
+				  struct elf_phdr *eppnt, int prot, int type)
+{
+	struct file *htlbfile;
+	unsigned long start, len;
+	unsigned long map_addr;
+	int retval;
+
+	printk(KERN_DEBUG "Found HTLB ELF segment %lx-%lx\n",
+	       addr, addr + eppnt->p_memsz);
+	start = addr & HPAGE_MASK;
+	len = ALIGN(eppnt->p_memsz + (addr & ~HPAGE_MASK), HPAGE_SIZE);
+
+	/* If we have data from the file to put in the segment, we
+	 * have to make it writable, so that we can read it in there
+	 * (mprotect() doesn't work on hugepages */
+	if (eppnt->p_filesz != 0)
+		prot |= PROT_WRITE;
+
+	if (is_potential_hugepage_range(start, len) != 0) {
+		printk(KERN_WARNING "HTLB ELF segment is not a valid hugepage range\n");
+		return -EINVAL;
+	}
+
+	htlbfile = hugetlb_zero_setup(eppnt->p_memsz);
+	if (IS_ERR(htlbfile)) {
+		printk(KERN_WARNING "Unable to allocate HTLB ELF segment (%ld)\n",
+		       PTR_ERR(htlbfile));
+		return PTR_ERR(htlbfile);
+	}
+	set_file_hugepages(htlbfile);
+	down_write(&current->mm->mmap_sem);
+	map_addr = do_mmap(htlbfile, start, len, prot, type, 0);
+	up_write(&current->mm->mmap_sem);
+	fput(htlbfile);
+
+	if (eppnt->p_filesz != 0) {
+		loff_t pos = eppnt->p_offset;
+
+		printk("Reading %lu bytes of file data into HTLB segment\n",
+		       (unsigned long) eppnt->p_filesz);
+		retval = vfs_read(filep, (void __user *)addr, eppnt->p_filesz, &pos);
+		printk("HTLB read returned %d\n", retval);
+		if (retval < 0) {
+			extern asmlinkage long sys_munmap(unsigned long, size_t);
+			sys_munmap(start, len);
+			return retval;
+		}
+	}
+
+
+	return map_addr;
+}
+#else
+static inline int elf_htlb_map(struct file *filep, unsigned long addr,
+			       struct elf_phdr *eppnt, int prot, int type)
+{
+	return -ENOSYS;
+}
+#endif
 static unsigned long elf_map(struct file *filep, unsigned long addr,
 			struct elf_phdr *eppnt, int prot, int type)
 {
 	unsigned long map_addr;

+	if (eppnt->p_flags & PF_LINUX_HTLB) {
+		map_addr = elf_htlb_map(filep, addr, eppnt, prot, type);
+		if (map_addr < (unsigned long)(-1024))
+			return map_addr;
+		printk(KERN_DEBUG "Falling back to non HTLB allocation\n");
+	}
+
 	down_write(&current->mm->mmap_sem);
 	map_addr = do_mmap(filep, ELF_PAGESTART(addr),
 			   eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr), prot, type,
diff -urN ppc64-linux-2.5/include/asm-ppc64/mmu_context.h linux-gogogo/include/asm-ppc64/mmu_context.h
--- ppc64-linux-2.5/include/asm-ppc64/mmu_context.h	2003-09-12 21:06:51.000000000 +1000
+++ linux-gogogo/include/asm-ppc64/mmu_context.h	2003-11-25 13:07:49.000000000 +1100
@@ -80,6 +80,8 @@
 {
 	long head;
 	unsigned long flags;
+	/* This does the right thing across a fork (I hope) */
+	unsigned long low_hpages = mm->context & CONTEXT_LOW_HPAGES;

 	spin_lock_irqsave(&mmu_context_queue.lock, flags);

@@ -90,6 +92,7 @@

 	head = mmu_context_queue.head;
 	mm->context = mmu_context_queue.elements[head];
+	mm->context |= low_hpages;

 	head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0;
 	mmu_context_queue.head = head;
diff -urN ppc64-linux-2.5/include/asm-ppc64/page.h linux-gogogo/include/asm-ppc64/page.h
--- ppc64-linux-2.5/include/asm-ppc64/page.h	2003-09-12 21:06:51.000000000 +1000
+++ linux-gogogo/include/asm-ppc64/page.h	2003-11-24 18:00:54.000000000 +1100
@@ -37,11 +37,22 @@
 #define TASK_HPAGE_END_32	(0xc0000000UL)

 #define ARCH_HAS_HUGEPAGE_ONLY_RANGE
+#define ARCH_HAS_PREPARE_HUGEPAGE_RANGE
+
+#define is_hugepage_low_range(addr, len) \
+	(((addr) > (TASK_HPAGE_BASE_32-(len))) && ((addr) < TASK_HPAGE_END_32))
+#define is_hugepage_high_range(addr, len) \
+	(((addr) > (TASK_HPAGE_BASE-(len))) && ((addr) < TASK_HPAGE_END))
+
+#define is_hugepage_potential_range(addr, len) \
+	(is_hugepage_high_range(addr, len) || is_hugepage_low_range(addr, len))
 #define is_hugepage_only_range(addr, len) \
-	( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \
-	  ((current->mm->context & CONTEXT_LOW_HPAGES) && \
-	   (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) )
+	(is_hugepage_high_range((addr), (len)) || \
+	  ( (current->mm->context & CONTEXT_LOW_HPAGES) && \
+	      is_hugepage_low_range((addr), (len)) ) )
+
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#define ARCH_HANDLES_HUGEPAGE_FAULTS

 #define in_hugepage_area(context, addr) \
 	((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \
diff -urN ppc64-linux-2.5/include/linux/elf.h linux-gogogo/include/linux/elf.h
--- ppc64-linux-2.5/include/linux/elf.h	2003-10-07 11:38:42.000000000 +1000
+++ linux-gogogo/include/linux/elf.h	2003-11-18 16:46:12.000000000 +1100
@@ -271,6 +271,11 @@
 #define PF_W		0x2
 #define PF_X		0x1

+#define PF_MASKOS	0x0ff00000
+#define PF_MASKPROC	0xf0000000
+
+#define PF_LINUX_HTLB	0x00100000
+
 typedef struct elf32_phdr{
   Elf32_Word	p_type;
   Elf32_Off	p_offset;
diff -urN ppc64-linux-2.5/include/linux/hugetlb.h linux-gogogo/include/linux/hugetlb.h
--- ppc64-linux-2.5/include/linux/hugetlb.h	2003-09-27 22:48:37.000000000 +1000
+++ linux-gogogo/include/linux/hugetlb.h	2003-11-25 15:04:35.000000000 +1100
@@ -41,6 +41,22 @@
 #define is_hugepage_only_range(addr, len)	0
 #endif

+#ifndef ARCH_HAS_PREPARE_HUGEPAGE_RANGE
+#define is_potential_hugepage_range(addr, len)	\
+	(is_aligned_hugepage_range((addr), (len)))
+#define prepare_hugepage_range(addr, len)	(0)
+#else
+int is_potential_hugepage_range(unsigned long addr, unsigned long len);
+int prepare_hugepage_range(unsigned long addr, unsigned long len);
+#endif
+
+#ifndef ARCH_HANDLES_HUGEPAGE_FAULTS
+#define handle_hugetlb_mm_fault(mm, vma, a, w)	(VM_FAULT_SIGBUS)
+#else
+int handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+			    unsigned long address, int write_access);
+#endif
+
 #else /* !CONFIG_HUGETLB_PAGE */

 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -61,6 +77,8 @@
 #define mark_mm_hugetlb(mm, vma)		do { } while (0)
 #define follow_huge_pmd(mm, addr, pmd, write)	0
 #define is_aligned_hugepage_range(addr, len)	0
+#define is_allowed_hugepage_range(addr, len)	0
+#define prepare_hugepage_range(addr, len)	(-EINVAL)
 #define pmd_huge(x)	0
 #define is_hugepage_only_range(addr, len)	0

diff -urN ppc64-linux-2.5/mm/memory.c linux-gogogo/mm/memory.c
--- ppc64-linux-2.5/mm/memory.c	2003-11-17 11:20:18.000000000 +1100
+++ linux-gogogo/mm/memory.c	2003-11-18 12:42:34.000000000 +1100
@@ -1603,7 +1603,8 @@
 	inc_page_state(pgfault);

 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		/* mapping truncation can do this. */
+		return handle_hugetlb_mm_fault(mm, vma, address, write_access);

 	/*
 	 * We need the page table lock to synchronize with kswapd
diff -urN ppc64-linux-2.5/mm/mmap.c linux-gogogo/mm/mmap.c
--- ppc64-linux-2.5/mm/mmap.c	2003-10-23 08:29:46.000000000 +1000
+++ linux-gogogo/mm/mmap.c	2003-11-25 15:04:49.000000000 +1100
@@ -787,7 +787,9 @@
 			/*
 			 * Make sure that addr and length are properly aligned.
 			 */
-			ret = is_aligned_hugepage_range(addr, len);
+			ret = is_potential_hugepage_range(addr, len);
+			if (ret == 0)
+				ret = prepare_hugepage_range(addr, len);
 		} else {
 			/*
 			 * Ensure that a normal request is not falling in a