[PATCH 1/1] PPC32 : Huge-page support for ppc440 - 2.6.19-rc4 - revised
Edi Shmueli
edi at linux.vnet.ibm.com
Wed Dec 13 03:04:07 EST 2006
David Gibson wrote:
> On Fri, Dec 01, 2006 at 12:00:19PM -0500, Edi Shmueli wrote:
>> From: Edi Shmueli <edi at linux.vnet.ibm.com>
>>
>> Following requests to test the patch under the latest kernel, here
>> it is again, tested for 2.6.19-rc4. This patch enables applications
>> to exploit the PPC440 TLB support for huge-page mapping, to minimize
>> TLB thrashing. Applications with large memory footprint that
>> exploit this support, experience minimal TLB misses, and boost in
>> performance. NAS benchmarks tested with this patch indicate
>> hundreds of percent of improvement in performance.
>
> Ok, I'm still in the process of getting our Ebony set up again so I
> can test this. In the meantime some observations..
>
> First things first: your patch has been hopelessly whitespace
> mangled, probably by Notes. You'll need to resend with a patch-safe
> mailer.
>
> Have you attempted to run the testsuite from libhugetlbfs with this?
> It will require some tweaking, since it's not previously been needed
> on ppc32, but it has testcases for a whole pile of potential kernel
> hugepage bugs.
>
> Also have you looked at how this code compares to the large page
> support on PPC 40x. 40x doesn't support hugetlbfs, but it does have
> about half the necessary bits, since it stores a page size in the PTE
> for implementing large page mapping of the linear mapping. I'm not
> sure how applicable any of that stuff is to 440.
>
>> Signed-off-by: Edi Shmueli <edi at linux.vnet.ibm.com>
>> -----
>>
>> Benchmarks and Implementation comments
>> ======================================
>> Below is the NAS IS benchmark results, executing under Linux, with and
>> without this huge-page mapping support.
>> IS Benchmark 4KB pages 16MB pages
>> =======================================================
>> Class = A A
>> Size = 8388608 8388608
>> Iterations = 10 10
>> Time in seconds = 24.44 6.38
>> Mop/s total = 3.43 13.15
>> Operation type = keys ranked keys ranked
>> Verification = SUCCESSFUL SUCCESSFUL
>
>> Implementation details:
>> =======================
>
>> This patch is ppc440 architecture-specific. It enables the use of
>> huge-pages by processes executing under the 2.6.16 kernel on the
>> ppc440 processors. Huge-pages are accessible to user processes
>> using either the hugetlbfs or using shared memory. See
>> Documentation/vs/hugetlbpage.txt.
>
>> The ppc 32bit kernel uses 64bit PTEs (set by CONFIG_PTE_64BIT). I
>> exploit a "hole" of 4 unused bits in the PTE MS word (bits 24-27)
>> and code the page size information in those bits. I then modified
>> the TLB miss handler (head_44x.S) to stop using the constant
>> PPC44x_TLB_4K to set the page size in the TLB. Instead, when a TLB
>> miss happens, the miss handler reads the size information from the
>> PTE and sets that size in the TLB entry. This way, different TLB
>> entries get to map different page sizes (e.g., 4KB or 16MB). The
>> TLB replacement policy remains RR. This means that a huge-page entry
>> in the TLB may be overwritten if not used for a long time, but when
>> accessed it will be set again by the TLB miss handler, with the
>> correct size, as set in the PTE.
>
>> In arch/ppc/mm/hugetlbpage.c is where page table entries are set to
>> map huge pages:
>> By default , each process has two-level page-tables. It has 2048,
>> 32bit PMD (or PGD) entries at the higher-level, each maps 2MB of the
>> process address spase, and 512, 64bit PTE entries in each
>> lower-level page table. When a TLB miss happens and no PTE is found
>> by the miss handler to offer the translation, a check is made
>> (memory.c) on whether the faulting address belongs to a huge-page VM
>> region. If so, the code in set_huge_pte_at() will set the required
>> number of PMDs (e.g.,8 PMDs for huge-pages of size 16MB, or 1 PMD
>> for huge-pages of size 2MB or less) to point to the *same*
>> lower-level PTE page table. Within the lower-level page table, it
>> will set the required number PTEs (e.g., all 512 PTEs for huge-pages
>> larger than 2MB, or 256 PTEs for huge-pages of size 1MB etc.) to
>> point to the *same* physical huge-page frame. All these PTEs will
>> be *identical* and have the page-size coded in their MS word as
>> described above.
>
> Creating an all-equal PTE page even when using pages of a size greater
> than or equal to that mapped by a single PMD seems very wasteful.
>
>> Once the TLB miss handler copies the mapping (and the size) from the
>> PTE into on of a TLB entry, the process will not suffer any TLB
>> misses for that huge-page. If the mapping was overwritten by the
>> TLB RR replacement policy, it will be re-loaded again (probably in a
>> different TLB entry) when the process re-access that huge-page.
>
>> diff -uprP -X linux-2.6.19-rc4-vanilla/Documentation/dontdiff linux-2.6.19-rc4-vanilla/arch/ppc/kernel/head_44x.S my_linux/arch/ppc/kernel/head_44x.S
>> --- linux-2.6.19-rc4-vanilla/arch/ppc/kernel/head_44x.S 2006-11-14 11:16:29.000000000 -0500
>> +++ my_linux/arch/ppc/kernel/head_44x.S 2006-11-14 17:26:13.000000000 -0500
>> @@ -21,12 +21,14 @@
>> * debbie_chu at mvista.com
>> * Copyright 2002-2005 MontaVista Software, Inc.
>> * PowerPC 44x support, Matt Porter <mporter at kernel.crashing.org>
>> + * Copyright (C) 2006 Edi Shmueli, IBM Corporation.
>> + * PowerPC 44x handling of huge-page misses.
>> *
>> * This program is free software; you can redistribute it and/or modify it
>> * under the terms of the GNU General Public License as published by the
>> * Free Software Foundation; either version 2 of the License, or (at your
>> * option) any later version.
>> - */
>> +*/
>
> Please try to avoid extraneous changes in your patch.
>
> [snip]
>
>> diff -uprP -X linux-2.6.19-rc4-vanilla/Documentation/dontdiff linux-2.6.19-rc4-vanilla/arch/ppc/mm/hugetlbpage.c my_linux/arch/ppc/mm/hugetlbpage.c
>> --- linux-2.6.19-rc4-vanilla/arch/ppc/mm/hugetlbpage.c 1969-12-31 19:00:00.000000000 -0500
>> +++ my_linux/arch/ppc/mm/hugetlbpage.c 2006-11-15
>> 11:44:43.297682864 -0500
>
> Since this patch is 440 specific, and 40x at least could also support
> hugepages, this should probably go in hugetlbpage_44x.c
>
>> @@ -0,0 +1,185 @@
>> +/*
>> + * PPC32 (440) Huge TLB Page Support for Kernel.
>> + *
>> + * Copyright (C) 2006 Edi Shmueli, IBM Corporation.
>> + *
>> + * Based on the IA-32 version:
>> + * Copyright (C) 2002, Rohit Seth <rohit.seth at intel.com>
>> + *
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/fs.h>
>> +#include <linux/mm.h>
>> +#include <linux/hugetlb.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/smp_lock.h>
>> +#include <linux/slab.h>
>> +#include <linux/err.h>
>> +#include <linux/sysctl.h>
>> +#include <asm/mman.h>
>> +#include <asm/tlb.h>
>> +#include <asm/tlbflush.h>
>> +
>> +pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>> +{
>> + pgd_t *pgd;
>
> I'm not sure if the bogus indentation here is yours, or the result of
> Notes whitespace mangling. If the former, please make sure your code
> is formatted as per CodingStyle.
>
> [snip]
>> + return pte;
>> +}
>> +
>> +#ifdef ARCH_HAS_SETCLEAR_HUGE_PTE
>
> This #ifdef makes no sense here. You already know that this arch will
> need this code.
>
>> +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep, pte_t pte){
>
> Open brace on the next line, as per CodingStyle, please.
>
> [snip]
>> +int is_aligned_hugepage_range(unsigned long addr, unsigned long len)
>> +{
>> + if (len & ~HPAGE_MASK)
>> + return -EINVAL;
>> + if (addr & ~HPAGE_MASK)
>> + return -EINVAL;
>> + return 0;
>> +}
>
> The is_aligned_hugepage_range() callback was removed quite some time
> ago, please remove.
>
> [snip]
>> diff -uprP -X linux-2.6.19-rc4-vanilla/Documentation/dontdiff linux-2.6.19-rc4-vanilla/fs/Kconfig my_linux/fs/Kconfig
>> --- linux-2.6.19-rc4-vanilla/fs/Kconfig 2006-11-14 11:17:37.000000000 -0500
>> +++ my_linux/fs/Kconfig 2006-11-14 17:26:13.000000000 -0500
>> @@ -1008,7 +1008,7 @@ config TMPFS_POSIX_ACL
>>
>> config HUGETLBFS
>> bool "HugeTLB file system support"
>> - depends X86 || IA64 || PPC64 || SPARC64 || SUPERH || BROKEN
>> + depends X86 || IA64 || PPC || PPC64 || SPARC64 || SUPERH || BROKEN
>> help
>> hugetlbfs is a filesystem backing for HugeTLB pages, based on
>> ramfs. For architectures that support it, say Y here and
>> read
>
> This needs a test for 44x as well, or this option will be available
> and horribly broken for other ppc32 machines.
>
>> diff -uprP -X linux-2.6.19-rc4-vanilla/Documentation/dontdiff linux-2.6.19-rc4-vanilla/include/asm-ppc/page.h my_linux/include/asm-ppc/page.h
>> --- linux-2.6.19-rc4-vanilla/include/asm-ppc/page.h 2006-11-14 11:18:00.000000000 -0500
>> +++ my_linux/include/asm-ppc/page.h 2006-11-14 17:26:14.000000000 -0500
>> @@ -7,6 +7,12 @@
>> #define PAGE_SHIFT 12
>> #define PAGE_SIZE (ASM_CONST(1) << PAGE_SHIFT)
>>
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +#define HPAGE_SHIFT 24
>> +#define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)
>> +#define HPAGE_MASK (~(HPAGE_SIZE - 1))
>> +#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
>> +#endif
>> /*
>> * Subtle: this is an int (not an unsigned long) and so it
>> * gets extended to 64 bits the way want (i.e. with 1s). -- paulus
>> diff -uprP -X linux-2.6.19-rc4-vanilla/Documentation/dontdiff linux-2.6.19-rc4-vanilla/include/asm-ppc/pgtable.h my_linux/include/asm-ppc/pgtable.h
>> --- linux-2.6.19-rc4-vanilla/include/asm-ppc/pgtable.h 2006-11-14 11:18:00.000000000 -0500
>> +++ my_linux/include/asm-ppc/pgtable.h 2006-11-15 11:40:45.332514013 -0500
>> @@ -263,6 +263,40 @@ extern unsigned long ioremap_bot, iorema
>> #define _PAGE_NO_CACHE 0x00000400 /* H: I bit */
>> #define _PAGE_WRITETHRU 0x00000800 /* H: W bit */
>>
>> +#if HPAGE_SHIFT == 10 /*Unsupported*/
>> +#define _PAGE_HUGE 0x0000000000000000ULL /* H: SIZE=1K bytes */
>> +#define _PTE_MASK 0xfffffff8UL
>> +#define _PTE_CNT ((1UL) << (PTE_SHIFT - 9))
>> +#elif HPAGE_SHIFT == 12
>> +#define _PAGE_HUGE 0x0000001000000000ULL /* H: SIZE=4K bytes */
>> +#define _PTE_MASK 0xfffffff8UL
>> +#define _PTE_CNT ((1UL) << (PTE_SHIFT - 9))
>
> Yikes! Please use computed values here, not this ghastly string of
> #ifdefs.
>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson
Thanks David,
One step at a time, lets start with libhugetlbfs :-)
I'm successfully able to run most of my tests using the library, backing my data,text and BSS with huge-pages.
There is a major improvement in performance, similar to what I reported above.
Good job with the library !!!
There is a problem though when a program calls "fopen".
I see hugetlbfs does the unmapping/mapping , moves control to main(),
and then a crash with the following error:
"*** glibc detected *** free(): invalid pointer: 0x3002a008 ***"
This happens inside fopen() (....which never returns).
Here is the detailed output.
/bgd-public/edi/IS # ./is.A.linux_ser_hugetlbfs
libhugetlbfs: Hugepage segment 0 (phdr 2): 0x10000000-0x10001b70 (filesz=0x1b70) (prot = 0x5)
libhugetlbfs: Hugepage segment 1 (phdr 3): 0x11000000-0x170006e8 (filesz=0x274) (prot = 0x7)
libhugetlbfs: HUGETLB_SHARE=0, sharing disabled
libhugetlbfs: Got unshared fd as expected -- Preparing
libhugetlbfs: Mapped hugeseg at 0x31000000. Copying 0x1b70 bytes from 0x10000000...
done
libhugetlbfs: Minimal copy was not performed
libhugetlbfs: Prepare succeeded
libhugetlbfs: HUGETLB_SHARE=0, sharing disabled
libhugetlbfs: Got unshared fd as expected -- Preparing
libhugetlbfs: Mapped hugeseg at 0x31000000. Copying 0x274 bytes from 0x11000000...
done
libhugetlbfs: Minimal copy was not performed
libhugetlbfs: Prepare succeeded
*** glibc detected *** free(): invalid pointer: 0x3002a008 ***
Aborted
More information about the Linuxppc-dev
mailing list