[PATCH v1 2/3] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

Bharata B Rao bharata at linux.ibm.com
Tue Jun 23 23:42:02 AEST 2020


On Tue, Jun 23, 2020 at 04:07:34PM +0530, Aneesh Kumar K.V wrote:
> Bharata B Rao <bharata at linux.ibm.com> writes:
> 
> > We can hit the following BUG_ON during memory unplug:
> >
> > kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
> > Oops: Exception in kernel mode, sig: 5 [#1]
> > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
> > LR [c00000000147bfec] remove_pagetable+0x578/0x60c
> > Call Trace:
> > 0xc000008050000000 (unreliable)
> > remove_pagetable+0x384/0x60c
> > radix__remove_section_mapping+0x18/0x2c
> > remove_section_mapping+0x1c/0x3c
> > arch_remove_memory+0x11c/0x180
> > try_remove_memory+0x120/0x1b0
> > __remove_memory+0x20/0x40
> > dlpar_remove_lmb+0xc0/0x114
> > dlpar_memory+0x8b0/0xb20
> > handle_dlpar_errorlog+0xc0/0x190
> > pseries_hp_work_fn+0x2c/0x60
> > process_one_work+0x30c/0x810
> > worker_thread+0x98/0x540
> > kthread+0x1c4/0x1d0
> > ret_from_kernel_thread+0x5c/0x74
> >
> > This occurs when unplug is attempted for such memory which has
> > been mapped using memblock pages as part of early kernel page
> > table setup. We wouldn't have initialized the PMD or PTE fragment
> > count for those PMD or PTE pages.
> >
> > Fixing this includes 3 parts:
> >
> > - Re-walk the init_mm page tables from mem_init() and initialize
> >   the PMD and PTE fragment count to 1.
> > - When freeing PUD, PMD and PTE page table pages, check explicitly
> >   if they come from memblock and if so free then appropriately.
> > - When we do early memblock based allocation of PMD and PUD pages,
> >   allocate in PAGE_SIZE granularity so that we are sure the
> >   complete page is used as pagetable page.
> >
> > Since we now do PAGE_SIZE allocations for both PUD table and
> > PMD table (Note that PTE table allocation is already of PAGE_SIZE),
> > we end up allocating more memory for the same amount of system RAM.
> > Here is a comparision of how much more we need for a 64T and 2G
> > system after this patch:
> >
> > 1. 64T system
> > -------------
> > 64T RAM would need 64G for vmemmap with struct page size being 64B.
> >
> > 128 PUD tables for 64T memory (1G mappings)
> > 1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)
> >
> > With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
> > With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K
> >
> > 2. 2G system
> > ------------
> > 2G RAM would need 2M for vmemmap with struct page size being 64B.
> >
> > 1 PUD table for 2G memory (1G mapping)
> > 1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)
> >
> > With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
> > With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K
> 
> How about we just do
> 
> void pmd_fragment_free(unsigned long *pmd)
> {
> 	struct page *page = virt_to_page(pmd);
> 
> 	/*
> 	 * Early pmd pages allocated via memblock
> 	 * allocator need to be freed differently
> 	 */
> 	if (PageReserved(page))
> 		return free_reserved_page(page);
> 
> 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
> 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
> 		pgtable_pmd_page_dtor(page);
> 		__free_page(page);
> 	}
> }
> 
> That way we could avoid the fixup_pgtable_fragments completely?

Yes we could, by doing the same for pte_fragment_free() too.

However right from the early versions, we were going in the direction of
making the handling and behaviour of both early page tables and later
page tables as similar to each other as possible. Hence we started with
"fixing up" the early page tables.

If that's not a significant consideration, we can do away with fixup
and retain the other parts (PAGE_SIZE allocations and conditional
freeing) and still fix the bug.

Regards,
Bharata.


More information about the Linuxppc-dev mailing list