[PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing
Anshuman Khandual
khandual at linux.vnet.ibm.com
Wed May 3 21:56:04 AEST 2017
On 05/03/2017 09:22 AM, Rashmica Gupta wrote:
> On 28/04/17 19:52, Anshuman Khandual wrote:
>> On 04/28/2017 11:12 AM, Rashmica Gupta wrote:
>>> Some powerpc hardware features may want to gain access to a chunk of
>> What kind of features ? Please add specifics.
>>
>>> undisturbed real memory. This update provides a means to unplug said
>>> memory
>> Undisturbed ? Meaning part of memblock and currently inside the buddy
>> allocator which we are trying to hot unplug out ?
>>
>>> from the kernel with a set of debugfs calls. By writing an integer
>>> containing
>>> the size of memory to be unplugged into
>> Does the size has some constraints like aligned with memblock section
>> size ? LMB size ? page block size ? etc. Please add the details.
>
> Will do.
>
>>
>>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that
>>> much
>>> memory from the end of each available chip's memory space (ie each
>>> memory node).
>> <size> amount (I guess bytes in this case) of memory will be removed
>> from the end of the NUMA node ? Whats the guarantee that they would be
>> free at that time and not being pinned by some process ? If its not
>> guaranteed to be freed, then interface description should state that
>> clearly.
>
> We start looking from the end of the NUMA node but of course there is no
> guarantee
> that we will always be able to find some memory there that we are able
> to remove.
Okay. Do we have interface for giving this memory back to the buddy
allocator again when we are done with HW tracing ? If not we need to
add one.
>
>>> In addition, the means to read out the contents of the unplugged
>>> memory is also
>>> provided by reading out the
>>> /sys/kernel/debug/powerpc/memtrace/<chip-id>/trace
>>> file.
>> All of the debugfs file interfaces added here should be documented some
>> where in detail.
>>
>>> Signed-off-by: Anton Blanchard <anton at samba.org>
>>> Signed-off-by: Rashmica Gupta <rashmica.g at gmail.com>
>>>
>>> ---
>>> This requires the 'Wire up hpte_removebolted for powernv' patch.
>>>
>>> RFC -> v1: Added in two missing locks. Replaced the open-coded
>>> flush_memory_region() with the existing
>>> flush_inval_dcache_range(start, end).
>>>
>>> memtrace_offline_pages() is open-coded because offline_pages is
>>> designed to be
>>> called through the sysfs interface - not directly.
>>>
>>> We could move the offlining of pages to userspace, which removes some
>>> of this
>>> open-coding. This would then require passing info to the kernel such
>>> that it
>>> can then remove the memory that has been offlined. This could be done
>>> using
>>> notifiers, but this isn't simple due to locking (remove_memory needs
>>> mem_hotplug_begin() which the sysfs interface already has). This
>>> could also be
>>> done through the debugfs interface (similar to what is done here).
>>> Either way,
>>> this would require the process that needs the memory to have
>>> open-coded code
>>> which it shouldn't really be involved with.
>>>
>>> As the current remove_memory() function requires the memory to
>>> already be
>>> offlined, it makes sense to keep the offlining and removal of memory
>>> functionality grouped together so that a process can simply make one
>>> request to
>>> unplug some memory. Ideally there would be a kernel function we could
>>> call that
>>> would offline the memory and then remove it.
>>>
>>>
>>> arch/powerpc/platforms/powernv/memtrace.c | 276
>>> ++++++++++++++++++++++++++++++
>>> 1 file changed, 276 insertions(+)
>>> create mode 100644 arch/powerpc/platforms/powernv/memtrace.c
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c
>>> b/arch/powerpc/platforms/powernv/memtrace.c
>>> new file mode 100644
>>> index 0000000..86184b1
>>> --- /dev/null
>>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>>> @@ -0,0 +1,276 @@
>>> +/*
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License as published by
>>> + * the Free Software Foundation; either version 2 of the License, or
>>> + * (at your option) any later version.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>>> + * GNU General Public License for more details.
>>> + *
>>> + * Copyright (C) IBM Corporation, 2014
>>> + *
>>> + * Author: Anton Blanchard <anton at au.ibm.com>
>>> + */
>>> +
>>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt
>>> +
>>> +#include <linux/bitops.h>
>>> +#include <linux/string.h>
>>> +#include <linux/memblock.h>
>>> +#include <linux/init.h>
>>> +#include <linux/moduleparam.h>
>>> +#include <linux/fs.h>
>>> +#include <linux/debugfs.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/memory.h>
>>> +#include <linux/memory_hotplug.h>
>>> +#include <asm/machdep.h>
>>> +#include <asm/debugfs.h>
>>> +#include <asm/cacheflush.h>
>>> +
>>> +struct memtrace_entry {
>>> + void *mem;
>>> + u64 start;
>>> + u64 size;
>>> + u32 nid;
>>> + struct dentry *dir;
>>> + char name[16];
>>> +};
>> Little bit of description about the structure here will help.
>
> Something like 'this enables us to keep track of the memory removed from
> each node'?
Right, something like that.
>
>>> +
>>> +static struct memtrace_entry *memtrace_array;
>>> +static unsigned int memtrace_array_nr;
>>> +
>>> +static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
>>> + size_t count, loff_t *ppos)
>>> +{
>>> + struct memtrace_entry *ent = filp->private_data;
>>> +
>>> + return simple_read_from_buffer(ubuf, count, ppos, ent->mem,
>>> ent->size);
>>> +}
>>> +
>>> +static bool valid_memtrace_range(struct memtrace_entry *dev,
>>> + unsigned long start, unsigned long size)
>>> +{
>>> + if ((dev->start <= start) &&
>> Switch the position of start and dev->start above. Will make
>> it easy while reading.
>>
>>> + ((start + size) <= (dev->start + dev->size)))
>>> + return true;
>>> +
>>> + return false;
>>> +}
>>> +
>>> +static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma)
>>> +{
>>> + unsigned long size = vma->vm_end - vma->vm_start;
>>> + struct memtrace_entry *dev = filp->private_data;
>>> +
>>> + if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size))
>>> + return -EINVAL;
>>> +
>>> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>> Why we do this ? Its coming from real RAM not IO memory. Then the page
>> protection still needs changes ?
>
> Once the memory is removed from the kernel mappings we want to mark it as
> uncachable.
Got it but why ? Uncachable marking are for pages which will be mapped
to IO ranges which should not be cached just to prevent the possibility
of stale data.
>
>>> +
>>> + if (io_remap_pfn_range(vma, vma->vm_start,
>>> + vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
>>> + size, vma->vm_page_prot))
>> You can just call remap_pfn_rang() instead though they are all the same.
>> There is nothing I/O here should be explicit.
>
> Good point.
>
>>> + return -EAGAIN;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static const struct file_operations memtrace_fops = {
>>> + .llseek = default_llseek,
>>> + .read = memtrace_read,
>>> + .mmap = memtrace_mmap,
>>> + .open = simple_open,
>>> +};
>>> +
>>> +static int check_memblock_online(struct memory_block *mem, void *arg)
>>> +{
>>> + if (mem->state != MEM_ONLINE)
>>> + return -1;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int change_memblock_state(struct memory_block *mem, void *arg)
>>> +{
>>> + unsigned long state = (unsigned long)arg;
>>> +
>>> + mem->state = state;
>>> + return 0;
>>> +}
>>> +
>>> +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64
>>> nr_pages)
>>> +{
>>> + u64 end_pfn = start_pfn + nr_pages - 1;
>>> +
>>> + if (walk_memory_range(start_pfn, end_pfn, NULL,
>>> + check_memblock_online))
>>> + return false;
>>> +
>>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_GOING_OFFLINE,
>>> + change_memblock_state);
>>> +
>> walk_memory_range() might be expensive, cant we just change the state
>> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
>> the first loop and bail out if any of the memblock is not in MEM_ONLINE
>> in the first place.
>
> Good idea.
>
>>> + mem_hotplug_begin();
>>> + if (offline_pages(start_pfn, nr_pages)) {
>>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
>>> + change_memblock_state);
>>> + mem_hotplug_done();
>> Right, this can remain as is. If we fail to offline pages, mark the
>> memory blocks as MEM_ONLINE again.
>>
>>> + return false;
>>> + }
>>> +
>>> + walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
>>> + change_memblock_state);
>>> + mem_hotplug_done();
>> Right.
>>
>>> +
>>> + /* Clear the dcache to remove any references to the memory */
>>> + flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT),
>>> + (u64)__va(end_pfn << PAGE_SHIFT));
>> I am wondering why this is required now when we dont do anything for
>> cache flushing calls from core VM. If its really required now then
>> it also should be required during memory hot unplug operations in
>> general as well.
>
> I could not see if this was being done when removing memory so figured
> that it was better to put it in than not do it.
Looking at the definitions I had pointed out before which gets
called from core VM, powerpc does not need to do anything specific
for cache invalidation or flushing. But I am not really sure on
this. So let it be.
>
>>
>> /*
>> * No cache flushing is required when address mappings are changed,
>> * because the caches on PowerPCs are physically addressed.
>> */
>> #define flush_cache_all() do { } while (0)
>> #define flush_cache_mm(mm) do { } while (0)
>> #define flush_cache_dup_mm(mm) do { } while (0)
>> #define flush_cache_range(vma, start, end) do { } while (0)
>> #define flush_cache_page(vma, vmaddr, pfn) do { } while (0)
>> #define flush_icache_page(vma, page) do { } while (0)
>> #define flush_cache_vmap(start, end) do { } while (0)
>> #define flush_cache_vunmap(start, end) do { } while (0)
>>
>>
>>> +
>>> + /* Now remove memory from the mappings */
>>> + lock_device_hotplug();
>>> + remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages <<
>>> PAGE_SHIFT);
>>> + unlock_device_hotplug();
>> Right. Now we have successfully taken down the memory.
>>
>>> +
>>> + return true;
>>> +}
>>> +
>>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>>> +{
>>> + u64 start_pfn, end_pfn, nr_pages;
>>> + u64 base_pfn;
>>> +
>>> + if (!NODE_DATA(nid) || !node_spanned_pages(nid))
>>> + return 0;
>> Why NODE_DATA check is required here ? Each node should have one
>> allocated and initialized by now, else we have bigger problems.
>> Is there any specific reason to check for spanned pages instead
>> of present/managed pages.
>
> Anton wrote this check, so will need to confirm with him. I assume
> we check node_spanned_pages() rather than node_present_pages()
> because in arch/powerpc/mm/numa.c we set node_spanned_pages() and
> not node_present_pages()?
I guess any thing is okay but NODE_DATA() seems redundant though.
struct pglist_data {
..........................
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
}
>>
>>> +
>>> + start_pfn = node_start_pfn(nid);
>>> + end_pfn = node_end_pfn(nid);
>>> + nr_pages = size >> PAGE_SHIFT;
>>> +
>>> + /* Trace memory needs to be aligned to the size */
>>> + end_pfn = round_down(end_pfn - nr_pages, nr_pages);
>>> +
>>> + for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -=
>>> nr_pages) {
>>> + if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true)
>>> + return base_pfn << PAGE_SHIFT;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int memtrace_init_regions_runtime(u64 size)
>>> +{
>>> + u64 m;
>>> + u32 nid;
>>> +
>>> + memtrace_array = kzalloc(sizeof(struct memtrace_entry) *
>>> + num_online_nodes(), GFP_KERNEL);
>>> + if (!memtrace_array) {
>>> + pr_err("Failed to allocate memtrace_array\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> + for_each_online_node(nid) {
>>> + m = memtrace_alloc_node(nid, size);
>>> + /*
>>> + * A node might not have any local memory, so warn but
>>> + * continue on.
>>> + */
>>> + if (!m) {
>>> + pr_err("Failed to allocate trace memory on node %d\n",
>>> + nid);
>>> + } else {
>>> + pr_info("Allocated trace memory on node %d at 0x%016llx\n",
>>> + nid, m);
>>> +
>>> + memtrace_array[memtrace_array_nr].start = m;
>>> + memtrace_array[memtrace_array_nr].size = size;
>>> + memtrace_array[memtrace_array_nr].nid = nid;
>>> + memtrace_array_nr++;
>>> + }
>>> + }
>>> + return 0;
>>> +}
>> All the pr_info() and pr_err() prints should have a "memtrace :"
>> before the
>> actual string to make it clear that its coming from this feature.
>>
> Good point!
>>> +
>>> +static struct dentry *memtrace_debugfs_dir;
>>> +
>>> +static int memtrace_init_debugfs(void)
>>> +{
>>> + int ret = 0;
>>> + int i;
>>> +
>>> + for (i = 0; i < memtrace_array_nr; i++) {
>>> + struct dentry *dir;
>>> + struct memtrace_entry *ent = &memtrace_array[i];
>>> +
>>> + ent->mem = ioremap(ent->start, ent->size);
>>> + /* Warn but continue on */
>>> + if (!ent->mem) {
>>> + pr_err("Failed to map trace memory at 0x%llx\n",
>>> + ent->start);
>>> + ret = -1;
>>> + continue;
>>> + }
>>> +
>>> + snprintf(ent->name, 16, "%08x", ent->nid);
>>> + dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
>>> + if (!dir)
>>> + return -1;
>>> +
>>> + ent->dir = dir;
>>> + debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
>>> + debugfs_create_x64("start", 0400, dir, &ent->start);
>>> + debugfs_create_x64("size", 0400, dir, &ent->size);
>>> + debugfs_create_u32("node", 0400, dir, &ent->nid);
>>> + }
>> Oh okay, its creating all the four files. Please create corresponding
>> to each of the files some where. Documentation/ABI/testing lists the
>> actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess.
>
> I'm not exactly sure what you are saying here... Seeing that there is
> documentation about debugfs files in Documentation/ABI/testing, I'll
> follow suit
> and put it there.
I meant the same.
>
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static u64 memtrace_size;
>>> +
>>> +static int memtrace_enable_set(void *data, u64 val)
>>> +{
>>> + if (memtrace_size)
>>> + return -EINVAL;
>>> +
>>> + if (!val)
>>> + return -EINVAL;
>>> +
>>> + /* Make sure size is aligned to a memory block */
>>> + if (val & (memory_block_size_bytes()-1))
>> As I have mentioned earlier, this should be mentioned in the interface
>> description some where.
>>
>>> + return -EINVAL;
>>> +
>>> + if (memtrace_init_regions_runtime(val))
>>> + return -EINVAL;
>>> +
>>> + if (memtrace_init_debugfs())
>>> + return -EINVAL;
>>> +
>>> + memtrace_size = val;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int memtrace_enable_get(void *data, u64 *val)
>>> +{
>>> + *val = memtrace_size;
>>> + return 0;
>>> +}
>>> +
>>> +DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
>>> memtrace_enable_set, "0x%016llx\n");
>>> +
>>> +static int memtrace_init(void)
>>> +{
>>> + memtrace_debugfs_dir = debugfs_create_dir("memtrace",
>>> powerpc_debugfs_root);
>>> + if (!memtrace_debugfs_dir)
>>> + return -1;
>>> +
>>> + debugfs_create_file("enable", 0600, memtrace_debugfs_dir,
>>> + NULL, &memtrace_init_fops);
>>> +
>>> + return 0;
>>> +}
>>> +machine_device_initcall(powernv, memtrace_init);
>>> +
>>>
>> BTW how we start the tracing process for the trace to be collected in the
>> interface before we can read them ? This interface does not seem to have
>> a handler. When it directs the HW to start collecting the traces ?
>>
>> debugfs_create_x64("start", 0400, dir, &ent->start);
>>
>>
> I think you're asking 'what is actually going to call this code and do
> the tracing'?
No, when you call this interface, where is the routine to start the
actual tracing invoking appropriate platform functions or HW
instructions ? I dont see such a function associated with 'start'
interface mentioned above.
> There is some userspace code in the works to do this. The person who was
> writing it
> is doing something else now, and I think it's a bit gross so I'm trying
> to tidy it
> up a little.
Okay, please do provide some pointers when the code is ready which will
help in better understanding of this interface.
More information about the Linuxppc-dev
mailing list