[PATCH -V10 00/15] THP support for PPC64
Aneesh Kumar K.V
aneesh.kumar at linux.vnet.ibm.com
Thu Jun 6 01:28:24 EST 2013
Hi,
This is the second patchset needed to support THP on ppc64. Some of the changes
included in this series are tricky in that it changes the powerpc linux page table
walk subtly. We also overload few of the pte flags for ptes at PMD level (huge
page PTEs).
The related mm/ changes are already merged to Andrew's -mm tree.
Some numbers:
The latency measurements code from Anton found at
http://ozlabs.org/~anton/junkcode/latency2001.c
64K page size (With THP support)
--------------------------
[root at llmp24l02 test]# ./latency2001 8G
8589934592 428.49 cycles 120.50 ns
[root at llmp24l02 test]# ./latency2001 -l 8G
8589934592 471.16 cycles 132.50 ns
[root at llmp24l02 test]# echo never > /sys/kernel/mm/transparent_hugepage/enabled
[root at llmp24l02 test]# ./latency2001 8G
8589934592 766.52 cycles 215.56 ns
[root at llmp24l02 test]#
4K page size (No THP support for 4K)
----------------------------
[root at llmp24l02 test]# ./latency2001 8G
8589934592 814.88 cycles 229.16 ns
[root at llmp24l02 test]# ./latency2001 -l 8G
8589934592 463.69 cycles 130.40 ns
[root at llmp24l02 test]#
We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.
Another test that does 50000000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.
split_huge_page impact:
---------------------
To look at the performance impact of large page invalidate, I tried the below
experiment. The test involved, accessing a large contiguous region of memory
location as below
for (i = 0; i < size; i += PAGE_SIZE)
data[i] = i;
We wanted to access the data in sequential order so that we look at the
worst case THP performance. Accesing the data in sequential order implies
we have the Page table cached and overhead of TLB miss is as minimal as
possible. We also don't touch the entire page, because that can result in
cache evict.
After we touched the full range as above, we now call mprotect on each
of that page. A mprotect will result in a hugepage split. This should
allow us to measure the impact of hugepage split.
for (i = 0; i < size; i += PAGE_SIZE)
mprotect(&data[i], PAGE_SIZE, PROT_READ);
Split hugepage impact:
---------------------
THP enabled: 2.851561705 seconds for test completion
THP disable: 3.599146098 seconds for test completion
We are 20.7% better than non THP case even when we have all the large pages split.
Detailed output:
THP enabled:
---------------------------------------
[root at llmp24l02 ~]# cat /proc/vmstat | grep thp
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root at llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G
time taken to touch all the data in ns: 2763096913
Performance counter stats for './split-huge-page-mpro 20G':
1,581 page-faults
3,159 dTLB-load-misses
2.851561705 seconds time elapsed
[root at llmp24l02 ~]#
[root at llmp24l02 ~]# cat /proc/vmstat | grep thp
thp_fault_alloc 1279
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 1279
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root at llmp24l02 ~]#
77.05% split-huge-page [kernel.kallsyms] [k] .clear_user_page
7.10% split-huge-page [kernel.kallsyms] [k] .perf_event_mmap_ctx
1.51% split-huge-page split-huge-page-mpro [.] 0x0000000000000a70
0.96% split-huge-page [unknown] [H] 0x000000000157e3bc
0.81% split-huge-page [kernel.kallsyms] [k] .up_write
0.76% split-huge-page [kernel.kallsyms] [k] .perf_event_mmap
0.76% split-huge-page [kernel.kallsyms] [k] .down_write
0.74% split-huge-page [kernel.kallsyms] [k] .lru_add_page_tail
0.61% split-huge-page [kernel.kallsyms] [k] .split_huge_page
0.59% split-huge-page [kernel.kallsyms] [k] .change_protection
0.51% split-huge-page [kernel.kallsyms] [k] .release_pages
0.96% split-huge-page [unknown] [H] 0x000000000157e3bc
|
|--79.44%-- reloc_start
| |
| |--86.54%-- .__pSeries_lpar_hugepage_invalidate
| | .pSeries_lpar_hugepage_invalidate
| | .hpte_need_hugepage_flush
| | .split_huge_page
| | .__split_huge_page_pmd
| | .vma_adjust
| | .vma_merge
| | .mprotect_fixup
| | .SyS_mprotect
THP disabled:
---------------
[root at llmp24l02 ~]# echo never > /sys/kernel/mm/transparent_hugepage/enabled
[root at llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G
time taken to touch all the data in ns: 3513767220
Performance counter stats for './split-huge-page-mpro 20G':
3,27,726 page-faults
3,29,654 dTLB-load-misses
3.599146098 seconds time elapsed
[root at llmp24l02 ~]#
Changes from V9:
* Rebased to 3.10-rc4
* Added new patch powerpc/mm: handle hugepage size correctly when invalidating hpte entries
* Ran the compile regression on PowerNV platform.
Changes from V8:
* rebase to 3.10-rc2
* make subpage protection syscall work with transparent hugepage.
* Add proper barriers when reading pgtable content stashed in the second half of PMD.
Changes from V7:
* Address review feedback.
* mm/ patches also posted as a seperate series to linux-mm list
* Fixes for races against split and page table walk.
* Updated comments regarding locking details
Changes from V6:
* split the patch series into two patchset.
* Address review feedback.
Changes from V5:
* Address review comments
* Added new patch to not use hugepd for explcit hugepages. Explicit hugepaes
now use PTE format similar to transparent hugepages.
* We don't use page->_mapcount for tracking free PTE frags in a PTE page.
* rebased to a86d52667d8eda5de39393ce737794403bdce1eb
* Tested with libhugetlbfs test suite
Changes from V4:
* Fix bad page error in page_table_alloc
BUG: Bad page state in process stream pfn:f1a59
page:f0000000034dc378 count:1 mapcount:0 mapping: (null) index:0x0
[c000000f322c77d0] [c00000000015e198] .bad_page+0xe8/0x140
[c000000f322c7860] [c00000000015e3c4] .free_pages_prepare+0x1d4/0x1e0
[c000000f322c7910] [c000000000160450] .free_hot_cold_page+0x50/0x230
[c000000f322c79c0] [c00000000003ad18] .page_table_alloc+0x168/0x1c0
Changes from V3:
* PowerNV boot fixes
Change from V2:
* Change patch "powerpc: Reduce PTE table memory wastage" to use much simpler approach
for PTE page sharing.
* Changes to handle huge pages in KVM code.
* Address other review comments
Changes from V1
* Address review comments
* More patch split
* Add batch hpte invalidate for hugepages.
Changes from RFC V2:
* Address review comments
* More code cleanup and patch split
Changes from RFC V1:
* HugeTLB fs now works
* Compile issues fixed
* rebased to v3.8
* Patch series reorded so that ppc64 cleanups and MM THP changes are moved
early in the series. This should help in picking those patches early.
Thanks,
-aneesh
More information about the Linuxppc-dev
mailing list