[PATCH] powerpc/pagetable: Add option to dump kernel pagetable

Anshuman Khandual khandual at linux.vnet.ibm.com
Tue Feb 23 16:30:44 AEDT 2016


On 02/23/2016 03:57 AM, Rashmica wrote:
> Hi Anshuman,
> 
> Thanks for the feedback!
> 
> On 22/02/16 21:13, Anshuman Khandual wrote:
>> On 02/22/2016 11:32 AM, Rashmica Gupta wrote:
>>> Useful to be able to dump the kernel page tables to check permissions
>>> and
>>> memory types - derived from arm64's implementation.
>>>
>>> Add a debugfs file to check the page tables. To use this the PPC_PTDUMP
>>> config option must be selected.
>>>
>>> Tested on 64BE and 64LE with both 4K and 64K page sizes.
>>> ---
>> This statement above must be after the ---- line else it will be part of
>> the commit message or you wanted the test note as part of commit message
>> itself ?
>>
>> The patch seems to contain some white space problems. Please clean
>> them up.
> Will do!
>>>   arch/powerpc/Kconfig.debug |  12 ++
>>>   arch/powerpc/mm/Makefile   |   1 +
>>>   arch/powerpc/mm/dump.c     | 364
>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>   3 files changed, 377 insertions(+)
>>>   create mode 100644 arch/powerpc/mm/dump.c
>>>
>>> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
>>> index 638f9ce740f5..e4883880abe3 100644
>>> --- a/arch/powerpc/Kconfig.debug
>>> +++ b/arch/powerpc/Kconfig.debug
>>> @@ -344,4 +344,16 @@ config FAIL_IOMMU
>>>           If you are unsure, say N.
>>>   +config PPC_PTDUMP
>>> +        bool "Export kernel pagetable layout to userspace via debugfs"
>>> +        depends on DEBUG_KERNEL
>>> +        select DEBUG_FS
>>> +        help
>>> +          This options dumps the state of the kernel pagetables in a
>>> debugfs
>>> +          file. This is only useful for kernel developers who are
>>> working in
>>> +          architecture specific areas of the kernel - probably not a
>>> good idea to
>>> +          enable this feature in a production kernel.
>> Just clean the paragraph alignment here
>> ......................................^^^^^^^^
>>
>>> +
>>> +          If you are unsure, say N.
>>> +
>>>   endmenu
>>> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
>>> index 1ffeda85c086..16f84bdd7597 100644
>>> --- a/arch/powerpc/mm/Makefile
>>> +++ b/arch/powerpc/mm/Makefile
>>> @@ -40,3 +40,4 @@ obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
>>>   obj-$(CONFIG_HIGHMEM)        += highmem.o
>>>   obj-$(CONFIG_PPC_COPRO_BASE)    += copro_fault.o
>>>   obj-$(CONFIG_SPAPR_TCE_IOMMU)    += mmu_context_iommu.o
>>> +obj-$(CONFIG_PPC_PTDUMP)    += dump.o
>> File name like "[kernel_]pgtable_dump.c" will sound better ? Or
>> just use like the X86 one "dump_pagetables.c". "dump.c" sounds
>> very generic.
> Yup, good point.
>>> diff --git a/arch/powerpc/mm/dump.c b/arch/powerpc/mm/dump.c
>>> new file mode 100644
>>> index 000000000000..937b10fc40cc
>>> --- /dev/null
>>> +++ b/arch/powerpc/mm/dump.c
>>> @@ -0,0 +1,364 @@
>>> +/*
>>> + * Copyright 2016, Rashmica Gupta, IBM Corp.
>>> + *
>>> + * Debug helper to dump the current kernel pagetables of the system
>>> + * so that we can see what the various memory ranges are set to.
>>> + *
>>> + * Derived from the arm64 implementation:
>>> + * Copyright (c) 2014, The Linux Foundation, Laura Abbott.
>>> + * (C) Copyright 2008 Intel Corporation, Arjan van de Ven.
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of the GNU General Public License
>>> + * as published by the Free Software Foundation; version 2
>>> + * of the License.
>>> + */
>>> +#include <linux/debugfs.h>
>>> +#include <linux/fs.h>
>>> +#include <linux/io.h>
>>> +#include <linux/mm.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/seq_file.h>
>>> +#include <asm/fixmap.h>
>>> +#include <asm/pgtable.h>
>>> +#include <linux/const.h>
>>> +#include <asm/page.h>
>>> +
>>> +#define PUD_TYPE_MASK           (_AT(u64, 3) << 0)
>>> +#define PUD_TYPE_SECT           (_AT(u64, 1) << 0)
>>> +#define PMD_TYPE_MASK           (_AT(u64, 3) << 0)
>>> +#define PMD_TYPE_SECT           (_AT(u64, 1) << 0)
>>> +
>>> +
>>> +#if CONFIG_PGTABLE_LEVELS == 2
>>> +#include <asm-generic/pgtable-nopmd.h>
>>> +#elif CONFIG_PGTABLE_LEVELS == 3
>>> +#include <asm-generic/pgtable-nopud.h>
>>> +#endif
>> Really ? Do we have any platform with only 2 level of page table ?
>>   
> I'm not sure - was trying to cover all the bases. If you're
> confident that we don't, I can remove it.

I am not sure though, may be Michael or Mikey can help here.

>>> +
>>> +#define pmd_sect(pmd)  ((pmd_val(pmd) & PMD_TYPE_MASK) ==
>>> PMD_TYPE_SECT)
>>> +#ifdef CONFIG_PPC_64K_PAGES
>>> +#define pud_sect(pud)           (0)
>>> +#else
>>> +#define pud_sect(pud)           ((pud_val(pud) & PUD_TYPE_MASK) == \
>>> +                                               PUD_TYPE_SECT)
>>> +#endif
>> Can you please explain the use of pmd_sect() and pud_sect() defines ?
>>
>>> +   
>>> +
>>> +struct addr_marker {
>>> +    unsigned long start_address;
>>> +    const char *name;
>>> +};
>> All the architectures are using the same structure addr_marker.
>> Cannot we just move it to a generic header file ? There are
>> other such common structures like these in the file which are
>> used across architectures and can be moved to some where common ?
> Could do that. Where do you think would be the appropriate place
> for such a header file?

We can start at include/linux/mmdebug.h header file.

>>> +
>>> +enum address_markers_idx {
>>> +    VMALLOC_START_NR = 0,
>>> +    VMALLOC_END_NR,
>>> +    ISA_IO_START_NR,
>>> +    ISA_IO_END_NR,
>>> +    PHB_IO_START_NR,
>>> +    PHB_IO_END_NR,
>>> +    IOREMAP_START_NR,
>>> +    IOREMP_END_NR,
>>> +};
>> Where these are used ? ^^^^^^^^^ I dont see any where.
> Whoops, yes those are not used anymore.

Lets remove them.

>>
>> Also as it dumps only the kernel virtual mapping, should not we
>> mention about it some where.
> See my question below...
>>> +
>>> +static struct addr_marker address_markers[] = {
>>> +    { VMALLOC_START,    "vmalloc() Area" },
>>> +    { VMALLOC_END,        "vmalloc() End" },
>>> +    { ISA_IO_BASE,        "isa I/O start" },
>>> +    { ISA_IO_END,        "isa I/O end" },
>>> +    { PHB_IO_BASE,        "phb I/O start" },
>>> +    { PHB_IO_END,        "phb I/O end" },
>>> +    { IOREMAP_BASE,        "I/O remap start" },
>>> +    { IOREMAP_END,        "I/O remap end" },
>>> +    { -1,            NULL },
>>> +};
>>
>> I understand that VMEMMAP range is not covered under the kernel virtual
>> mapping page table. But we can hook it up looking into the linked list
>> we track for the VA-PA mappings in there. Just a suggestion.
> I'm not sure that I understand, can you please explain why we
> would want to do that?

Because, we are dumping every thing that kernel virtual memory
range maps. Kernel virtual memory has two or may be three
components(0xe0000 ? not sure about that).

(1) 0xd000 range (which covers all vmalloc & all IO mappings ranges)
(2) 0xf000 range (vmemmap sparse memory)

Now IIUC 0xd000 is completely managed by the kernel page table. But the
0xf00 range is managed in a simpler form of a linked list. Because we
are dumping all virtual memory ranges here, IMHO we should also dump
all *valid* VA --> PA mappings there to make it complete. Look into
the file arch/powerpc/mm/init_64.c where the linked list is maintained.
It can be a small separate function after you are done walking the
kernel page table.

>>
>>> +
>>> +/*
>>> + * The page dumper groups page table entries of the same type into a
>>> single
>>> + * description. It uses pg_state to track the range information while
>>> + * iterating over the pte entries. When the continuity is broken it
>>> then
>>> + * dumps out a description of the range.
>>> + */
>> This needs to be more detailed. Some where at the starting point of the
>> file we need to write detailed description about what all information it
>> is dumping and how.
> Will do!
>>
>>> +struct pg_state {
>>> +    struct seq_file *seq;
>>> +    const struct addr_marker *marker;
>>> +    unsigned long start_address;
>>> +    unsigned level;
>>> +    u64 current_prot;
>>> +};
>>> +
>>> +struct prot_bits {
>>> +    u64        mask;
>>> +    u64        val;
>>> +    const char    *set;
>>> +    const char    *clear;
>>> +};
>> This structure encapsulates various PTE flags bit information.
>> Then why it is named as "prot_bits" ?
> Yup, a lazy oversight on my part.
>>> +
>>> +static const struct prot_bits pte_bits[] = {
>>> +    {
>>> +        .mask    = _PAGE_USER,
>>> +        .val    = _PAGE_USER,
>>> +        .set    = "user",
>>> +        .clear    = "    ",
>>> +    }, {
>>> +        .mask    = _PAGE_RW,
>>> +        .val    = _PAGE_RW,
>>> +        .set    = "rw",
>>> +        .clear    = "ro",
>>> +    }, {
>>> +        .mask    = _PAGE_EXEC,
>>> +        .val    = _PAGE_EXEC,
>>> +        .set    = " X ",
>>> +        .clear    = "   ",
>>> +    }, {
>>> +        .mask    = _PAGE_PTE,
>>> +        .val    = _PAGE_PTE,
>>> +        .set    = "pte",
>>> +        .clear    = "   ",
>>> +    }, {
>>> +        .mask    = _PAGE_PRESENT,
>>> +        .val    = _PAGE_PRESENT,
>>> +        .set    = "present",
>>> +        .clear    = "       ",
>>> +    }, {
>>> +        .mask    = _PAGE_HASHPTE,
>>> +        .val    = _PAGE_HASHPTE,
>>> +        .set    = "htpe",
>>> +        .clear    = "    ",
>>> +    }, {
>>> +        .mask    = _PAGE_GUARDED,
>>> +        .val    = _PAGE_GUARDED,
>>> +        .set    = "guarded",
>>> +        .clear    = "       ",
>>> +    }, {
>>> +        .mask    = _PAGE_DIRTY,
>>> +        .val    = _PAGE_DIRTY,
>>> +        .set    = "dirty",
>>> +        .clear    = "     ",
>>> +    }, {
>>> +        .mask    = _PAGE_ACCESSED,
>>> +        .val    = _PAGE_ACCESSED,
>>> +        .set    = "accessed",
>>> +        .clear    = "        ",
>>> +    }, {
>>> +        .mask    = _PAGE_WRITETHRU,
>>> +        .val    = _PAGE_WRITETHRU,
>>> +        .set    = "write through",
>>> +        .clear    = "             ",
>>> +    }, {
>>> +        .mask    = _PAGE_NO_CACHE,
>>> +        .val    = _PAGE_NO_CACHE,
>>> +        .set    = "no cache",
>>> +        .clear    = "        ",
>>> +    }, {
>>> +        .mask    = _PAGE_BUSY,
>>> +        .val    = _PAGE_BUSY,
>>> +        .set    = "busy",
>>> +    }, {
>>> +        .mask    = _PAGE_F_GIX,
>>> +        .val    = _PAGE_F_GIX,
>>> +        .set    = "gix",
>>> +    }, {
>>> +        .mask    = _PAGE_F_SECOND,
>>> +        .val    = _PAGE_F_SECOND,
>>> +        .set    = "second",
>>> +    }, {
>>> +        .mask    = _PAGE_SPECIAL,
>>> +        .val    = _PAGE_SPECIAL,
>>> +        .set    = "special",
>>> +    }
>>> +};
>>> +
>>> +struct pg_level {
>>> +    const struct prot_bits *bits;
>>> +    size_t num;
>>> +    u64 mask;
>>> +};
>>> +
>> It describes each level of the page table and what all kind of
>> PTE flags can be there at each of these levels and their combined
>> mask. The structure and it's elements must be named accordingly.
>> Its confusing right now.
>>
>>> +static struct pg_level pg_level[] = {
>>> +    {
>>> +    }, { /* pgd */
>>> +        .bits    = pte_bits,
>>> +        .num    = ARRAY_SIZE(pte_bits),
>>> +    }, { /* pud */
>>> +        .bits    = pte_bits,
>>> +        .num    = ARRAY_SIZE(pte_bits),
>>> +    }, { /* pmd */
>>> +        .bits    = pte_bits,
>>> +        .num    = ARRAY_SIZE(pte_bits),
>>> +    }, { /* pte */
>>> +        .bits    = pte_bits,
>>> +        .num    = ARRAY_SIZE(pte_bits),
>>> +    },
>>> +};
>>> +
>>> +static void dump_prot(struct pg_state *st, const struct prot_bits
>>> *bits,
>>> +            size_t num)
>>> +{
>>> +    unsigned i;
>>> +
>>> +    for (i = 0; i < num; i++, bits++) {
>>> +        const char *s;
>>> +
>>> +        if ((st->current_prot & bits->mask) == bits->val)
>>> +            s = bits->set;
>>> +        else
>>> +            s = bits->clear;
>>> +
>>> +        if (s)
>>> +            seq_printf(st->seq, " %s", s);
>>> +    }
>>> +}
>>> +
>>> +static void note_page(struct pg_state *st, unsigned long addr,
>>> unsigned level,
>>> +                u64 val)
>>> +{
>>> +    static const char units[] = "KMGTPE";
>>> +    u64 prot = val & pg_level[level].mask;
>>> +
>>> +    /* At first no level is set */
>>> +    if (!st->level) {
>>> +        st->level = level;
>>> +        st->current_prot = prot;
>>> +        st->start_address = addr;
>>> +        seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
>>> +    /* We are only interested in dumping when something (protection,
>>> +     *  level of PTE or the section of vmalloc) has changed */
>> Again, these are PTE flags not protection bits. Please change these
>> comments accordingly. Also this function does a lot of things, can
>> we split it up ?
> Do you think that it would make it easier to understand by
> splitting it up?

I think so.

>>> +    } else if (prot != st->current_prot || level != st->level ||
>>> +           addr >= st->marker[1].start_address) {
>>> +        const char *unit = units;
>>> +        unsigned long delta;
>>> +
>>> +        /* Check protection on PTE */
>>> +        if (st->current_prot) {
>>> +            seq_printf(st->seq, "0x%016lx-0x%016lx   ",
>>> +                   st->start_address, addr-1);
>>> +
>>> +            delta = (addr - st->start_address) >> 10;
>>> +            /* Work out what appropriate unit to use */
>>> +            while (!(delta & 1023) && unit[1]) {
>>> +                delta >>= 10;
>>> +                unit++;
>>> +            }
>>> +            seq_printf(st->seq, "%9lu%c", delta, *unit);
>>> +            /* Dump all the protection flags */
>>> +            if (pg_level[st->level].bits)
>>> +                dump_prot(st, pg_level[st->level].bits,
>>> +                      pg_level[st->level].num);
>>> +            seq_puts(st->seq, "\n");
>>> +        }
>>> +        /* Address indicates we have passed the end of the
>>> +         * current section of vmalloc */
>>> +        while (addr >= st->marker[1].start_address) {
>>> +            st->marker++;
>>> +            seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
>>> +        }
>> Right now with this patch, it does not print anything after [
>> vmalloc() End ].
>> But I would expect it to go over all other sections one by one. The above
>> while loop just goes over the list and prints the remaining address range
>> names. Why ?
>>
>> 0xd00007fffff00000-0xd00007fffff0ffff          64K      rw     pte
>> present htpe         dirty accessed
>> 0xd00007fffff10000-0xd00007fffff3ffff         192K      rw     pte
>> present              dirty accessed
>> 0xd00007fffff40000-0xd00007fffff4ffff          64K      rw     pte
>> present htpe         dirty accessed
>> 0xd00007fffff50000-0xd00007fffff7ffff         192K      rw     pte
>> present              dirty accessed
>> 0xd00007fffff80000-0xd00007fffff8ffff          64K      rw     pte
>> present htpe         dirty accessed
>> 0xd00007fffff90000-0xd00007fffffbffff         192K      rw     pte
>> present              dirty accessed
>> 0xd00007fffffc0000-0xd00007fffffcffff          64K      rw     pte
>> present htpe         dirty accessed
>> 0xd00007fffffd0000-0xd00007ffffffffff         192K      rw     pte
>> present              dirty accessed
>> ---[ vmalloc() End ]---
>> ---[ isa I/O start ]---
>> ---[ isa I/O end ]---
>> ---[ phb I/O start ]---
>> ---[ phb I/O end ]---
>> ---[ I/O remap start ]---
>> ---[ I/O remap end ]---
>>
> What setup are you using? My output looked something similar to this for
> all BE and LE
> with both 4K and 64K pages:
> 0xd00007fffff60000-0xd00007fffff7ffff         128K      rw     pte
> present              dirty
> accessed
> 0xd00007fffff80000-0xd00007fffff9ffff         128K      rw     pte
> present htpe         dirty
> accessed
> 0xd00007fffffa0000-0xd00007fffffbffff         128K      rw     pte
> present              dirty
> accessed
> 0xd00007fffffc0000-0xd00007fffffdffff         128K      rw     pte
> present htpe         dirty
> accessed
> 0xd00007fffffe0000-0xd00007ffffffffff         128K      rw     pte
> present              dirty
> accessed
> ---[ vmalloc() End ]---
> ---[ isa I/O start ]---
> ---[ isa I/O end ]---
> ---[ phb I/O start ]---
> 0xd000080000010000-0xd00008000001ffff          64K      rw     pte
> present htpe guarded
> dirty accessed               no cache
> ---[ phb I/O end ]---
> ---[ I/O remap start ]---
> 0xd000080080020000-0xd00008008002ffff          64K      rw     pte
> present htpe guarded
> dirty accessed               no cache
>  0xd000080080040000-0xd00008008004ffff          64K      rw     pte
> present htpe guarded
>  dirty accessed               no cache
> 0xd000080080060000-0xd00008008006ffff          64K      rw     pte
> present htpe guarded
> dirty accessed               no cache
> ---[ I/O remap end ]---


Yeah it looks better. May be my LPAR does not have these ranges used at all.
May be I am missing something about the while loop.



More information about the Linuxppc-dev mailing list