[PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

Tue May 9 17:06:56 AEST 2017

Sorry for the late reply, I somehow missed this.

On 03/05/17 21:56, Anshuman Khandual wrote:
> On 05/03/2017 09:22 AM, Rashmica Gupta wrote:
>> On 28/04/17 19:52, Anshuman Khandual wrote:
>>> On 04/28/2017 11:12 AM, Rashmica Gupta wrote:
>>>> Some powerpc hardware features may want to gain access to a chunk of
>>> What kind of features ? Please add specifics.
>>>
>>>> undisturbed real memory.  This update provides a means to unplug said
>>>> memory
>>> Undisturbed ? Meaning part of memblock and currently inside the buddy
>>> allocator which we are trying to hot unplug out ?
>>>
>>>> from the kernel with a set of debugfs calls.  By writing an integer
>>>> containing
>>>>    the size of memory to be unplugged into
>>> Does the size has some constraints like aligned with memblock section
>>> size ? LMB size ? page block size ? etc. Please add the details.
>> Will do.
>>
>>>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that
>>>> much
>>>> memory from the end of each available chip's memory space (ie each
>>>> memory node).
>>> <size> amount (I guess bytes in this case) of memory will be removed
>>> from the end of the NUMA node ? Whats the guarantee that they would be
>>> free at that time and not being pinned by some process ? If its not
>>> guaranteed to be freed, then interface description should state that
>>> clearly.
>> We start looking from the end of the NUMA node but of course there is no
>> guarantee
>> that we will always be able to find some memory there that we are able
>> to remove.
>
> Okay. Do we have interface for giving this memory back to the buddy
> allocator again when we are done with HW tracing ? If not we need to
> add one.

Not at the moment. Last time I spoke to Anton he said something along 
the lines
of it not being too important as if you are getting the hardware traces 
for debugging
purposes you are probably not worried about a bit of memory being out of 
action.

However I can't see why having an interface to online the memory would 
be a bad thing,
so I'll look into it.

>>>> In addition, the means to read out the contents of the unplugged
>>>> memory is also
>>>> provided by reading out the
>>>> /sys/kernel/debug/powerpc/memtrace/<chip-id>/trace
>>>> file.
>>> All of the debugfs file interfaces added here should be documented some
>>> where in detail.
>>>
>>>> Signed-off-by: Anton Blanchard <anton at samba.org>
>>>> Signed-off-by: Rashmica Gupta <rashmica.g at gmail.com>
>>>>
>>>> ---
>>>> This requires the 'Wire up hpte_removebolted for powernv' patch.
>>>>
>>>> RFC -> v1: Added in two missing locks. Replaced the open-coded
>>>> flush_memory_region() with the existing
>>>> flush_inval_dcache_range(start, end).
>>>>
>>>> memtrace_offline_pages() is open-coded because offline_pages is
>>>> designed to be
>>>> called through the sysfs interface - not directly.
>>>>
>>>> We could move the offlining of pages to userspace, which removes some
>>>> of this
>>>> open-coding. This would then require passing info to the kernel such
>>>> that it
>>>> can then remove the memory that has been offlined. This could be done
>>>> using
>>>> notifiers, but this isn't simple due to locking (remove_memory needs
>>>> mem_hotplug_begin() which the sysfs interface already has). This
>>>> could also be
>>>> done through the debugfs interface (similar to what is done here).
>>>> Either way,
>>>> this would require the process that needs the memory to have
>>>> open-coded code
>>>> which it shouldn't really be involved with.
>>>>
>>>> As the current remove_memory() function requires the memory to
>>>> already be
>>>> offlined, it makes sense to keep the offlining and removal of memory
>>>> functionality grouped together so that a process can simply make one
>>>> request to
>>>> unplug some memory. Ideally there would be a kernel function we could
>>>> call that
>>>> would offline the memory and then remove it.
>>>>
>>>>
>>>>    arch/powerpc/platforms/powernv/memtrace.c | 276
>>>> ++++++++++++++++++++++++++++++
>>>>    1 file changed, 276 insertions(+)
>>>>    create mode 100644 arch/powerpc/platforms/powernv/memtrace.c
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c
>>>> b/arch/powerpc/platforms/powernv/memtrace.c
>>>> new file mode 100644
>>>> index 0000000..86184b1
>>>> --- /dev/null
>>>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>>>> @@ -0,0 +1,276 @@
>>>> +/*
>>>> + * This program is free software; you can redistribute it and/or modify
>>>> + * it under the terms of the GNU General Public License as published by
>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>> + * (at your option) any later version.
>>>> + *
>>>> + * This program is distributed in the hope that it will be useful,
>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>> + * GNU General Public License for more details.
>>>> + *
>>>> + * Copyright (C) IBM Corporation, 2014
>>>> + *
>>>> + * Author: Anton Blanchard <anton at au.ibm.com>
>>>> + */
>>>> +
>>>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt
>>>> +
>>>> +#include <linux/bitops.h>
>>>> +#include <linux/string.h>
>>>> +#include <linux/memblock.h>
>>>> +#include <linux/init.h>
>>>> +#include <linux/moduleparam.h>
>>>> +#include <linux/fs.h>
>>>> +#include <linux/debugfs.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/memory.h>
>>>> +#include <linux/memory_hotplug.h>
>>>> +#include <asm/machdep.h>
>>>> +#include <asm/debugfs.h>
>>>> +#include <asm/cacheflush.h>
>>>> +
>>>> +struct memtrace_entry {
>>>> +    void *mem;
>>>> +    u64 start;
>>>> +    u64 size;
>>>> +    u32 nid;
>>>> +    struct dentry *dir;
>>>> +    char name[16];
>>>> +};
>>> Little bit of description about the structure here will help.
>> Something like 'this enables us to keep track of the memory removed from
>> each node'?
> Right, something like that.
>
>>>> +
>>>> +static struct memtrace_entry *memtrace_array;
>>>> +static unsigned int memtrace_array_nr;
>>>> +
>>>> +static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
>>>> +                 size_t count, loff_t *ppos)
>>>> +{
>>>> +    struct memtrace_entry *ent = filp->private_data;
>>>> +
>>>> +    return simple_read_from_buffer(ubuf, count, ppos, ent->mem,
>>>> ent->size);
>>>> +}
>>>> +
>>>> +static bool valid_memtrace_range(struct memtrace_entry *dev,
>>>> +                 unsigned long start, unsigned long size)
>>>> +{
>>>> +    if ((dev->start <= start) &&
>>> Switch the position of start and dev->start above. Will make
>>> it easy while reading.
>>>
>>>> +        ((start + size) <= (dev->start + dev->size)))
>>>> +        return true;
>>>> +
>>>> +    return false;
>>>> +}
>>>> +
>>>> +static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma)
>>>> +{
>>>> +    unsigned long size = vma->vm_end - vma->vm_start;
>>>> +    struct memtrace_entry *dev = filp->private_data;
>>>> +
>>>> +    if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size))
>>>> +        return -EINVAL;
>>>> +
>>>> +    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>>> Why we do this ? Its coming from real RAM not IO memory. Then the page
>>> protection still needs changes ?
>> Once the memory is removed from the kernel mappings we want to mark it as
>> uncachable.
> Got it but why ? Uncachable marking are for pages which will be mapped
> to IO ranges which should not be cached just to prevent the possibility
> of stale data.
>>>> +
>>>> +    if (io_remap_pfn_range(vma, vma->vm_start,
>>>> +                   vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
>>>> +                   size, vma->vm_page_prot))
>>> You can just call remap_pfn_rang() instead though they are all the same.
>>> There is nothing I/O here should be explicit.
>> Good point.
>>
>>>> +        return -EAGAIN;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static const struct file_operations memtrace_fops = {
>>>> +    .llseek = default_llseek,
>>>> +    .read    = memtrace_read,
>>>> +    .mmap    = memtrace_mmap,
>>>> +    .open    = simple_open,
>>>> +};
>>>> +
>>>> +static int check_memblock_online(struct memory_block *mem, void *arg)
>>>> +{
>>>> +    if (mem->state != MEM_ONLINE)
>>>> +        return -1;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int change_memblock_state(struct memory_block *mem, void *arg)
>>>> +{
>>>> +    unsigned long state = (unsigned long)arg;
>>>> +
>>>> +    mem->state = state;
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64
>>>> nr_pages)
>>>> +{
>>>> +    u64 end_pfn = start_pfn + nr_pages - 1;
>>>> +
>>>> +    if (walk_memory_range(start_pfn, end_pfn, NULL,
>>>> +        check_memblock_online))
>>>> +        return false;
>>>> +
>>>> +    walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
>>>> +              change_memblock_state);
>>>> +
>>> walk_memory_range() might be expensive, cant we just change the state
>>> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
>>> the first loop and bail out if any of the memblock is not in MEM_ONLINE
>>> in the first place.
>> Good idea.
>>
>>>> +    mem_hotplug_begin();
>>>> +    if (offline_pages(start_pfn, nr_pages)) {
>>>> +        walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
>>>> +                  change_memblock_state);
>>>> +        mem_hotplug_done();
>>> Right, this can remain as is. If we fail to offline pages, mark the
>>> memory blocks as MEM_ONLINE again.
>>>
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
>>>> +              change_memblock_state);
>>>> +    mem_hotplug_done();
>>> Right.
>>>
>>>> +
>>>> +    /* Clear the dcache to remove any references to the memory */
>>>> +    flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT),
>>>> +                   (u64)__va(end_pfn << PAGE_SHIFT));
>>> I am wondering why this is required now when we dont do anything for
>>> cache flushing calls from core VM. If its really required now then
>>> it also should be required during memory hot unplug operations in
>>> general as well.
>> I could not see if this was being done when removing memory so figured
>> that it was better to put it in than not do it.
> Looking at the definitions I had pointed out before which gets
> called from core VM, powerpc does not need to do anything specific
> for cache invalidation or flushing. But I am not really sure on
> this. So let it be.
>
>>> /*
>>>    * No cache flushing is required when address mappings are changed,
>>>    * because the caches on PowerPCs are physically addressed.
>>>    */
>>> #define flush_cache_all()            do { } while (0)
>>> #define flush_cache_mm(mm)            do { } while (0)
>>> #define flush_cache_dup_mm(mm)            do { } while (0)
>>> #define flush_cache_range(vma, start, end)    do { } while (0)
>>> #define flush_cache_page(vma, vmaddr, pfn)    do { } while (0)
>>> #define flush_icache_page(vma, page)        do { } while (0)
>>> #define flush_cache_vmap(start, end)        do { } while (0)
>>> #define flush_cache_vunmap(start, end)        do { } while (0)
>>>
>>>
>>>> +
>>>> +    /* Now remove memory from the mappings */
>>>> +    lock_device_hotplug();
>>>> +    remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages <<
>>>> PAGE_SHIFT);
>>>> +    unlock_device_hotplug();
>>> Right. Now we have successfully taken down the memory.
>>>
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>>>> +{
>>>> +    u64 start_pfn, end_pfn, nr_pages;
>>>> +    u64 base_pfn;
>>>> +
>>>> +    if (!NODE_DATA(nid) || !node_spanned_pages(nid))
>>>> +        return 0;
>>> Why NODE_DATA check is required here ? Each node should have one
>>> allocated and initialized by now, else we have bigger problems.
>>> Is there any specific reason to check for spanned pages instead
>>> of present/managed pages.
>> Anton wrote this check, so will need to confirm with him. I assume
>> we check node_spanned_pages() rather than node_present_pages()
>> because in arch/powerpc/mm/numa.c we set node_spanned_pages() and
>> not node_present_pages()?
> I guess any thing is okay but NODE_DATA() seems redundant though.

Agreed.

>
> struct pglist_data {
>          ..........................
> 	unsigned long node_present_pages; /* total number of physical pages */
> 	unsigned long node_spanned_pages; /* total size of physical page
> 					     range, including holes */
>
> }
>
>>>> +
>>>> +    start_pfn = node_start_pfn(nid);
>>>> +    end_pfn = node_end_pfn(nid);
>>>> +    nr_pages = size >> PAGE_SHIFT;
>>>> +
>>>> +    /* Trace memory needs to be aligned to the size */
>>>> +    end_pfn = round_down(end_pfn - nr_pages, nr_pages);
>>>> +
>>>> +    for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -=
>>>> nr_pages) {
>>>> +        if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true)
>>>> +            return base_pfn << PAGE_SHIFT;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int memtrace_init_regions_runtime(u64 size)
>>>> +{
>>>> +    u64 m;
>>>> +    u32 nid;
>>>> +
>>>> +    memtrace_array = kzalloc(sizeof(struct memtrace_entry) *
>>>> +                num_online_nodes(), GFP_KERNEL);
>>>> +    if (!memtrace_array) {
>>>> +        pr_err("Failed to allocate memtrace_array\n");
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    for_each_online_node(nid) {
>>>> +        m = memtrace_alloc_node(nid, size);
>>>> +        /*
>>>> +         * A node might not have any local memory, so warn but
>>>> +         * continue on.
>>>> +         */
>>>> +        if (!m) {
>>>> +            pr_err("Failed to allocate trace memory on node %d\n",
>>>> +                 nid);
>>>> +        } else {
>>>> +            pr_info("Allocated trace memory on node %d at 0x%016llx\n",
>>>> +                 nid, m);
>>>> +
>>>> +            memtrace_array[memtrace_array_nr].start = m;
>>>> +            memtrace_array[memtrace_array_nr].size = size;
>>>> +            memtrace_array[memtrace_array_nr].nid = nid;
>>>> +            memtrace_array_nr++;
>>>> +        }
>>>> +    }
>>>> +    return 0;
>>>> +}
>>> All the pr_info() and pr_err() prints should have a "memtrace :"
>>> before the
>>> actual string to make it clear that its coming from this feature.
>>>
>> Good point!
>>>> +
>>>> +static struct dentry *memtrace_debugfs_dir;
>>>> +
>>>> +static int memtrace_init_debugfs(void)
>>>> +{
>>>> +    int ret = 0;
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < memtrace_array_nr; i++) {
>>>> +        struct dentry *dir;
>>>> +        struct memtrace_entry *ent = &memtrace_array[i];
>>>> +
>>>> +        ent->mem = ioremap(ent->start, ent->size);
>>>> +        /* Warn but continue on */
>>>> +        if (!ent->mem) {
>>>> +            pr_err("Failed to map trace memory at 0x%llx\n",
>>>> +                 ent->start);
>>>> +            ret = -1;
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        snprintf(ent->name, 16, "%08x", ent->nid);
>>>> +        dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
>>>> +        if (!dir)
>>>> +            return -1;
>>>> +
>>>> +        ent->dir = dir;
>>>> +        debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
>>>> +        debugfs_create_x64("start", 0400, dir, &ent->start);
>>>> +        debugfs_create_x64("size", 0400, dir, &ent->size);
>>>> +        debugfs_create_u32("node", 0400, dir, &ent->nid);
>>>> +    }
>>> Oh okay, its creating all the four files. Please create corresponding
>>> to each of the files some where. Documentation/ABI/testing lists the
>>> actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess.
>> I'm not exactly sure what you are saying here... Seeing that there is
>> documentation about debugfs files in Documentation/ABI/testing, I'll
>> follow suit
>> and put it there.
> I meant the same.
>
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static u64 memtrace_size;
>>>> +
>>>> +static int memtrace_enable_set(void *data, u64 val)
>>>> +{
>>>> +    if (memtrace_size)
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (!val)
>>>> +        return -EINVAL;
>>>> +
>>>> +    /* Make sure size is aligned to a memory block */
>>>> +    if (val & (memory_block_size_bytes()-1))
>>> As I have mentioned earlier, this should be mentioned in the interface
>>> description some where.
>>>
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (memtrace_init_regions_runtime(val))
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (memtrace_init_debugfs())
>>>> +        return -EINVAL;
>>>> +
>>>> +    memtrace_size = val;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int memtrace_enable_get(void *data, u64 *val)
>>>> +{
>>>> +    *val = memtrace_size;
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
>>>> memtrace_enable_set, "0x%016llx\n");
>>>> +
>>>> +static int memtrace_init(void)
>>>> +{
>>>> +    memtrace_debugfs_dir = debugfs_create_dir("memtrace",
>>>> powerpc_debugfs_root);
>>>> +    if (!memtrace_debugfs_dir)
>>>> +        return -1;
>>>> +
>>>> +    debugfs_create_file("enable", 0600, memtrace_debugfs_dir,
>>>> +                NULL, &memtrace_init_fops);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +machine_device_initcall(powernv, memtrace_init);
>>>> +
>>>>
>>> BTW how we start the tracing process for the trace to be collected in the
>>> interface before we can read them ? This interface does not seem to have
>>> a handler. When it directs the HW to start collecting the traces ?
>>>
>>> debugfs_create_x64("start", 0400, dir, &ent->start);
>>>
>>>
>> I think you're asking 'what is actually going to call this code and do
>> the tracing'?
> No, when you call this interface, where is the routine to start the
> actual tracing invoking appropriate platform functions or HW
> instructions ? I dont see such a function associated with 'start'
> interface mentioned above.

Essentially,
DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
memtrace_enable_set, "0x%016llx\n");

means that when a number is written to the memtrace/enable file, the 
number is
read as a u64 and passed to memtrace_enable_set()

>
>> There is some userspace code in the works to do this. The person who was
>> writing it
>> is doing something else now, and I think it's a bit gross so I'm trying
>> to tidy it
>> up a little.
> Okay, please do provide some pointers when the code is ready which will
> help in better understanding of this interface.
>
Will do!