[PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

Mon May 15 14:35:17 AEST 2017

On 05/15/2017 09:04 AM, Rashmica Gupta wrote:
> 
> 
> On 14/05/17 14:55, Anshuman Khandual wrote:
>> On 05/09/2017 12:36 PM, Rashmica Gupta wrote:
>>> Sorry for the late reply, I somehow missed this.
>>>
>>>
>>> On 03/05/17 21:56, Anshuman Khandual wrote:
>>>> On 05/03/2017 09:22 AM, Rashmica Gupta wrote:
>>>>> On 28/04/17 19:52, Anshuman Khandual wrote:
>>>>>> On 04/28/2017 11:12 AM, Rashmica Gupta wrote:
>>>>>>> Some powerpc hardware features may want to gain access to a chunk of
>>>>>> What kind of features ? Please add specifics.
>>>>>>
>>>>>>> undisturbed real memory.  This update provides a means to unplug
>>>>>>> said
>>>>>>> memory
>>>>>> Undisturbed ? Meaning part of memblock and currently inside the buddy
>>>>>> allocator which we are trying to hot unplug out ?
>>>>>>
>>>>>>> from the kernel with a set of debugfs calls.  By writing an integer
>>>>>>> containing
>>>>>>>     the size of memory to be unplugged into
>>>>>> Does the size has some constraints like aligned with memblock section
>>>>>> size ? LMB size ? page block size ? etc. Please add the details.
>>>>> Will do.
>>>>>
>>>>>>> /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that
>>>>>>> much
>>>>>>> memory from the end of each available chip's memory space (ie each
>>>>>>> memory node).
>>>>>> <size> amount (I guess bytes in this case) of memory will be removed
>>>>>> from the end of the NUMA node ? Whats the guarantee that they
>>>>>> would be
>>>>>> free at that time and not being pinned by some process ? If its not
>>>>>> guaranteed to be freed, then interface description should state that
>>>>>> clearly.
>>>>> We start looking from the end of the NUMA node but of course there
>>>>> is no
>>>>> guarantee
>>>>> that we will always be able to find some memory there that we are able
>>>>> to remove.
>>>> Okay. Do we have interface for giving this memory back to the buddy
>>>> allocator again when we are done with HW tracing ? If not we need to
>>>> add one.
>>> Not at the moment. Last time I spoke to Anton he said something along
>>> the lines
>>> of it not being too important as if you are getting the hardware traces
>>> for debugging
>>> purposes you are probably not worried about a bit of memory being out of
>>> action.
>>>
>>> However I can't see why having an interface to online the memory would
>>> be a bad thing,
>>> so I'll look into it.
>> Yes, the interface to put them back into buddy is important even if the
>> amount of memory is very less for tracing. Just need to trigger hotplug
>> and online procedure to put it back.
>>
>>>>>>> In addition, the means to read out the contents of the unplugged
>>>>>>> memory is also
>>>>>>> provided by reading out the
>>>>>>> /sys/kernel/debug/powerpc/memtrace/<chip-id>/trace
>>>>>>> file.
>>>>>> All of the debugfs file interfaces added here should be documented
>>>>>> some
>>>>>> where in detail.
>>>>>>
>>>>>>> Signed-off-by: Anton Blanchard <anton at samba.org>
>>>>>>> Signed-off-by: Rashmica Gupta <rashmica.g at gmail.com>
>>>>>>>
>>>>>>> ---
>>>>>>> This requires the 'Wire up hpte_removebolted for powernv' patch.
>>>>>>>
>>>>>>> RFC -> v1: Added in two missing locks. Replaced the open-coded
>>>>>>> flush_memory_region() with the existing
>>>>>>> flush_inval_dcache_range(start, end).
>>>>>>>
>>>>>>> memtrace_offline_pages() is open-coded because offline_pages is
>>>>>>> designed to be
>>>>>>> called through the sysfs interface - not directly.
>>>>>>>
>>>>>>> We could move the offlining of pages to userspace, which removes
>>>>>>> some
>>>>>>> of this
>>>>>>> open-coding. This would then require passing info to the kernel such
>>>>>>> that it
>>>>>>> can then remove the memory that has been offlined. This could be
>>>>>>> done
>>>>>>> using
>>>>>>> notifiers, but this isn't simple due to locking (remove_memory needs
>>>>>>> mem_hotplug_begin() which the sysfs interface already has). This
>>>>>>> could also be
>>>>>>> done through the debugfs interface (similar to what is done here).
>>>>>>> Either way,
>>>>>>> this would require the process that needs the memory to have
>>>>>>> open-coded code
>>>>>>> which it shouldn't really be involved with.
>>>>>>>
>>>>>>> As the current remove_memory() function requires the memory to
>>>>>>> already be
>>>>>>> offlined, it makes sense to keep the offlining and removal of memory
>>>>>>> functionality grouped together so that a process can simply make one
>>>>>>> request to
>>>>>>> unplug some memory. Ideally there would be a kernel function we
>>>>>>> could
>>>>>>> call that
>>>>>>> would offline the memory and then remove it.
>>>>>>>
>>>>>>>
>>>>>>>     arch/powerpc/platforms/powernv/memtrace.c | 276
>>>>>>> ++++++++++++++++++++++++++++++
>>>>>>>     1 file changed, 276 insertions(+)
>>>>>>>     create mode 100644 arch/powerpc/platforms/powernv/memtrace.c
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/platforms/powernv/memtrace.c
>>>>>>> b/arch/powerpc/platforms/powernv/memtrace.c
>>>>>>> new file mode 100644
>>>>>>> index 0000000..86184b1
>>>>>>> --- /dev/null
>>>>>>> +++ b/arch/powerpc/platforms/powernv/memtrace.c
>>>>>>> @@ -0,0 +1,276 @@
>>>>>>> +/*
>>>>>>> + * This program is free software; you can redistribute it and/or
>>>>>>> modify
>>>>>>> + * it under the terms of the GNU General Public License as
>>>>>>> published by
>>>>>>> + * the Free Software Foundation; either version 2 of the
>>>>>>> License, or
>>>>>>> + * (at your option) any later version.
>>>>>>> + *
>>>>>>> + * This program is distributed in the hope that it will be useful,
>>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>>>> + * GNU General Public License for more details.
>>>>>>> + *
>>>>>>> + * Copyright (C) IBM Corporation, 2014
>>>>>>> + *
>>>>>>> + * Author: Anton Blanchard <anton at au.ibm.com>
>>>>>>> + */
>>>>>>> +
>>>>>>> +#define pr_fmt(fmt) "powernv-memtrace: " fmt
>>>>>>> +
>>>>>>> +#include <linux/bitops.h>
>>>>>>> +#include <linux/string.h>
>>>>>>> +#include <linux/memblock.h>
>>>>>>> +#include <linux/init.h>
>>>>>>> +#include <linux/moduleparam.h>
>>>>>>> +#include <linux/fs.h>
>>>>>>> +#include <linux/debugfs.h>
>>>>>>> +#include <linux/slab.h>
>>>>>>> +#include <linux/memory.h>
>>>>>>> +#include <linux/memory_hotplug.h>
>>>>>>> +#include <asm/machdep.h>
>>>>>>> +#include <asm/debugfs.h>
>>>>>>> +#include <asm/cacheflush.h>
>>>>>>> +
>>>>>>> +struct memtrace_entry {
>>>>>>> +    void *mem;
>>>>>>> +    u64 start;
>>>>>>> +    u64 size;
>>>>>>> +    u32 nid;
>>>>>>> +    struct dentry *dir;
>>>>>>> +    char name[16];
>>>>>>> +};
>>>>>> Little bit of description about the structure here will help.
>>>>> Something like 'this enables us to keep track of the memory removed
>>>>> from
>>>>> each node'?
>>>> Right, something like that.
>>>>
>>>>>>> +
>>>>>>> +static struct memtrace_entry *memtrace_array;
>>>>>>> +static unsigned int memtrace_array_nr;
>>>>>>> +
>>>>>>> +static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
>>>>>>> +                 size_t count, loff_t *ppos)
>>>>>>> +{
>>>>>>> +    struct memtrace_entry *ent = filp->private_data;
>>>>>>> +
>>>>>>> +    return simple_read_from_buffer(ubuf, count, ppos, ent->mem,
>>>>>>> ent->size);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static bool valid_memtrace_range(struct memtrace_entry *dev,
>>>>>>> +                 unsigned long start, unsigned long size)
>>>>>>> +{
>>>>>>> +    if ((dev->start <= start) &&
>>>>>> Switch the position of start and dev->start above. Will make
>>>>>> it easy while reading.
>>>>>>
>>>>>>> +        ((start + size) <= (dev->start + dev->size)))
>>>>>>> +        return true;
>>>>>>> +
>>>>>>> +    return false;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int memtrace_mmap(struct file *filp, struct vm_area_struct
>>>>>>> *vma)
>>>>>>> +{
>>>>>>> +    unsigned long size = vma->vm_end - vma->vm_start;
>>>>>>> +    struct memtrace_entry *dev = filp->private_data;
>>>>>>> +
>>>>>>> +    if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT,
>>>>>>> size))
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>>>>>> Why we do this ? Its coming from real RAM not IO memory. Then the
>>>>>> page
>>>>>> protection still needs changes ?
>>>>> Once the memory is removed from the kernel mappings we want to mark
>>>>> it as
>>>>> uncachable.
>>>> Got it but why ? Uncachable marking are for pages which will be mapped
>>>> to IO ranges which should not be cached just to prevent the possibility
>>>> of stale data.
>>>>>>> +
>>>>>>> +    if (io_remap_pfn_range(vma, vma->vm_start,
>>>>>>> +                   vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
>>>>>>> +                   size, vma->vm_page_prot))
>>>>>> You can just call remap_pfn_rang() instead though they are all the
>>>>>> same.
>>>>>> There is nothing I/O here should be explicit.
>>>>> Good point.
>>>>>
>>>>>>> +        return -EAGAIN;
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static const struct file_operations memtrace_fops = {
>>>>>>> +    .llseek = default_llseek,
>>>>>>> +    .read    = memtrace_read,
>>>>>>> +    .mmap    = memtrace_mmap,
>>>>>>> +    .open    = simple_open,
>>>>>>> +};
>>>>>>> +
>>>>>>> +static int check_memblock_online(struct memory_block *mem, void
>>>>>>> *arg)
>>>>>>> +{
>>>>>>> +    if (mem->state != MEM_ONLINE)
>>>>>>> +        return -1;
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int change_memblock_state(struct memory_block *mem, void
>>>>>>> *arg)
>>>>>>> +{
>>>>>>> +    unsigned long state = (unsigned long)arg;
>>>>>>> +
>>>>>>> +    mem->state = state;
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64
>>>>>>> nr_pages)
>>>>>>> +{
>>>>>>> +    u64 end_pfn = start_pfn + nr_pages - 1;
>>>>>>> +
>>>>>>> +    if (walk_memory_range(start_pfn, end_pfn, NULL,
>>>>>>> +        check_memblock_online))
>>>>>>> +        return false;
>>>>>>> +
>>>>>>> +    walk_memory_range(start_pfn, end_pfn, (void
>>>>>>> *)MEM_GOING_OFFLINE,
>>>>>>> +              change_memblock_state);
>>>>>>> +
>>>>>> walk_memory_range() might be expensive, cant we just change the state
>>>>>> to MEM_GOING_OFFLINE while checking the state for MEM_ONLINE during
>>>>>> the first loop and bail out if any of the memblock is not in
>>>>>> MEM_ONLINE
>>>>>> in the first place.
>>>>> Good idea.
>>>>>
>>>>>>> +    mem_hotplug_begin();
>>>>>>> +    if (offline_pages(start_pfn, nr_pages)) {
>>>>>>> +        walk_memory_range(start_pfn, end_pfn, (void *)MEM_ONLINE,
>>>>>>> +                  change_memblock_state);
>>>>>>> +        mem_hotplug_done();
>>>>>> Right, this can remain as is. If we fail to offline pages, mark the
>>>>>> memory blocks as MEM_ONLINE again.
>>>>>>
>>>>>>> +        return false;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    walk_memory_range(start_pfn, end_pfn, (void *)MEM_OFFLINE,
>>>>>>> +              change_memblock_state);
>>>>>>> +    mem_hotplug_done();
>>>>>> Right.
>>>>>>
>>>>>>> +
>>>>>>> +    /* Clear the dcache to remove any references to the memory */
>>>>>>> +    flush_inval_dcache_range((u64)__va(start_pfn << PAGE_SHIFT),
>>>>>>> +                   (u64)__va(end_pfn << PAGE_SHIFT));
>>>>>> I am wondering why this is required now when we dont do anything for
>>>>>> cache flushing calls from core VM. If its really required now then
>>>>>> it also should be required during memory hot unplug operations in
>>>>>> general as well.
>>>>> I could not see if this was being done when removing memory so figured
>>>>> that it was better to put it in than not do it.
>>>> Looking at the definitions I had pointed out before which gets
>>>> called from core VM, powerpc does not need to do anything specific
>>>> for cache invalidation or flushing. But I am not really sure on
>>>> this. So let it be.
>>>>
>>>>>> /*
>>>>>>     * No cache flushing is required when address mappings are
>>>>>> changed,
>>>>>>     * because the caches on PowerPCs are physically addressed.
>>>>>>     */
>>>>>> #define flush_cache_all()            do { } while (0)
>>>>>> #define flush_cache_mm(mm)            do { } while (0)
>>>>>> #define flush_cache_dup_mm(mm)            do { } while (0)
>>>>>> #define flush_cache_range(vma, start, end)    do { } while (0)
>>>>>> #define flush_cache_page(vma, vmaddr, pfn)    do { } while (0)
>>>>>> #define flush_icache_page(vma, page)        do { } while (0)
>>>>>> #define flush_cache_vmap(start, end)        do { } while (0)
>>>>>> #define flush_cache_vunmap(start, end)        do { } while (0)
>>>>>>
>>>>>>
>>>>>>> +
>>>>>>> +    /* Now remove memory from the mappings */
>>>>>>> +    lock_device_hotplug();
>>>>>>> +    remove_memory(nid, start_pfn << PAGE_SHIFT, nr_pages <<
>>>>>>> PAGE_SHIFT);
>>>>>>> +    unlock_device_hotplug();
>>>>>> Right. Now we have successfully taken down the memory.
>>>>>>
>>>>>>> +
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>>>>>>> +{
>>>>>>> +    u64 start_pfn, end_pfn, nr_pages;
>>>>>>> +    u64 base_pfn;
>>>>>>> +
>>>>>>> +    if (!NODE_DATA(nid) || !node_spanned_pages(nid))
>>>>>>> +        return 0;
>>>>>> Why NODE_DATA check is required here ? Each node should have one
>>>>>> allocated and initialized by now, else we have bigger problems.
>>>>>> Is there any specific reason to check for spanned pages instead
>>>>>> of present/managed pages.
>>>>> Anton wrote this check, so will need to confirm with him. I assume
>>>>> we check node_spanned_pages() rather than node_present_pages()
>>>>> because in arch/powerpc/mm/numa.c we set node_spanned_pages() and
>>>>> not node_present_pages()?
>>>> I guess any thing is okay but NODE_DATA() seems redundant though.
>>> Agreed.
>>>
>>>> struct pglist_data {
>>>>           ..........................
>>>>      unsigned long node_present_pages; /* total number of physical
>>>> pages */
>>>>      unsigned long node_spanned_pages; /* total size of physical page
>>>>                           range, including holes */
>>>>
>>>> }
>>>>
>>>>>>> +
>>>>>>> +    start_pfn = node_start_pfn(nid);
>>>>>>> +    end_pfn = node_end_pfn(nid);
>>>>>>> +    nr_pages = size >> PAGE_SHIFT;
>>>>>>> +
>>>>>>> +    /* Trace memory needs to be aligned to the size */
>>>>>>> +    end_pfn = round_down(end_pfn - nr_pages, nr_pages);
>>>>>>> +
>>>>>>> +    for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -=
>>>>>>> nr_pages) {
>>>>>>> +        if (memtrace_offline_pages(nid, base_pfn, nr_pages) ==
>>>>>>> true)
>>>>>>> +            return base_pfn << PAGE_SHIFT;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int memtrace_init_regions_runtime(u64 size)
>>>>>>> +{
>>>>>>> +    u64 m;
>>>>>>> +    u32 nid;
>>>>>>> +
>>>>>>> +    memtrace_array = kzalloc(sizeof(struct memtrace_entry) *
>>>>>>> +                num_online_nodes(), GFP_KERNEL);
>>>>>>> +    if (!memtrace_array) {
>>>>>>> +        pr_err("Failed to allocate memtrace_array\n");
>>>>>>> +        return -EINVAL;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    for_each_online_node(nid) {
>>>>>>> +        m = memtrace_alloc_node(nid, size);
>>>>>>> +        /*
>>>>>>> +         * A node might not have any local memory, so warn but
>>>>>>> +         * continue on.
>>>>>>> +         */
>>>>>>> +        if (!m) {
>>>>>>> +            pr_err("Failed to allocate trace memory on node %d\n",
>>>>>>> +                 nid);
>>>>>>> +        } else {
>>>>>>> +            pr_info("Allocated trace memory on node %d at
>>>>>>> 0x%016llx\n",
>>>>>>> +                 nid, m);
>>>>>>> +
>>>>>>> +            memtrace_array[memtrace_array_nr].start = m;
>>>>>>> +            memtrace_array[memtrace_array_nr].size = size;
>>>>>>> +            memtrace_array[memtrace_array_nr].nid = nid;
>>>>>>> +            memtrace_array_nr++;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>> All the pr_info() and pr_err() prints should have a "memtrace :"
>>>>>> before the
>>>>>> actual string to make it clear that its coming from this feature.
>>>>>>
>>>>> Good point!
>>>>>>> +
>>>>>>> +static struct dentry *memtrace_debugfs_dir;
>>>>>>> +
>>>>>>> +static int memtrace_init_debugfs(void)
>>>>>>> +{
>>>>>>> +    int ret = 0;
>>>>>>> +    int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < memtrace_array_nr; i++) {
>>>>>>> +        struct dentry *dir;
>>>>>>> +        struct memtrace_entry *ent = &memtrace_array[i];
>>>>>>> +
>>>>>>> +        ent->mem = ioremap(ent->start, ent->size);
>>>>>>> +        /* Warn but continue on */
>>>>>>> +        if (!ent->mem) {
>>>>>>> +            pr_err("Failed to map trace memory at 0x%llx\n",
>>>>>>> +                 ent->start);
>>>>>>> +            ret = -1;
>>>>>>> +            continue;
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        snprintf(ent->name, 16, "%08x", ent->nid);
>>>>>>> +        dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
>>>>>>> +        if (!dir)
>>>>>>> +            return -1;
>>>>>>> +
>>>>>>> +        ent->dir = dir;
>>>>>>> +        debugfs_create_file("trace", 0400, dir, ent,
>>>>>>> &memtrace_fops);
>>>>>>> +        debugfs_create_x64("start", 0400, dir, &ent->start);
>>>>>>> +        debugfs_create_x64("size", 0400, dir, &ent->size);
>>>>>>> +        debugfs_create_u32("node", 0400, dir, &ent->nid);
>>>>>>> +    }
>>>>>> Oh okay, its creating all the four files. Please create corresponding
>>>>>> to each of the files some where. Documentation/ABI/testing lists the
>>>>>> actual system ABI on /sys/ not the sys/kernel/debug/ ones I guess.
>>>>> I'm not exactly sure what you are saying here... Seeing that there is
>>>>> documentation about debugfs files in Documentation/ABI/testing, I'll
>>>>> follow suit
>>>>> and put it there.
>>>> I meant the same.
>>>>
>>>>>>> +
>>>>>>> +    return ret;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static u64 memtrace_size;
>>>>>>> +
>>>>>>> +static int memtrace_enable_set(void *data, u64 val)
>>>>>>> +{
>>>>>>> +    if (memtrace_size)
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    if (!val)
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    /* Make sure size is aligned to a memory block */
>>>>>>> +    if (val & (memory_block_size_bytes()-1))
>>>>>> As I have mentioned earlier, this should be mentioned in the
>>>>>> interface
>>>>>> description some where.
>>>>>>
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    if (memtrace_init_regions_runtime(val))
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    if (memtrace_init_debugfs())
>>>>>>> +        return -EINVAL;
>>>>>>> +
>>>>>>> +    memtrace_size = val;
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int memtrace_enable_get(void *data, u64 *val)
>>>>>>> +{
>>>>>>> +    *val = memtrace_size;
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
>>>>>>> memtrace_enable_set, "0x%016llx\n");
>>>>>>> +
>>>>>>> +static int memtrace_init(void)
>>>>>>> +{
>>>>>>> +    memtrace_debugfs_dir = debugfs_create_dir("memtrace",
>>>>>>> powerpc_debugfs_root);
>>>>>>> +    if (!memtrace_debugfs_dir)
>>>>>>> +        return -1;
>>>>>>> +
>>>>>>> +    debugfs_create_file("enable", 0600, memtrace_debugfs_dir,
>>>>>>> +                NULL, &memtrace_init_fops);
>>>>>>> +
>>>>>>> +    return 0;
>>>>>>> +}
>>>>>>> +machine_device_initcall(powernv, memtrace_init);
>>>>>>> +
>>>>>>>
>>>>>> BTW how we start the tracing process for the trace to be collected
>>>>>> in the
>>>>>> interface before we can read them ? This interface does not seem to
>>>>>> have
>>>>>> a handler. When it directs the HW to start collecting the traces ?
>>>>>>
>>>>>> debugfs_create_x64("start", 0400, dir, &ent->start);
>>>>>>
>>>>>>
>>>>> I think you're asking 'what is actually going to call this code and do
>>>>> the tracing'?
>>>> No, when you call this interface, where is the routine to start the
>>>> actual tracing invoking appropriate platform functions or HW
>>>> instructions ? I dont see such a function associated with 'start'
>>>> interface mentioned above.
>>> Essentially,
>>> DEFINE_SIMPLE_ATTRIBUTE(memtrace_init_fops, memtrace_enable_get,
>>> memtrace_enable_set, "0x%016llx\n");
>>>
>>> means that when a number is written to the memtrace/enable file, the
>>> number is
>>> read as a u64 and passed to memtrace_enable_set()
>> Right, then we create the following interfaces for each entry of the
>> memory
>> trace. By now all the memory ranges are ioremapped.
>>
>> +        debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
>> +        debugfs_create_x64("start", 0400, dir, &ent->start);
>> +        debugfs_create_x64("size", 0400, dir, &ent->size);
>> +        debugfs_create_u32("node", 0400, dir, &ent->nid);
>>
>> Then what really starts the HW trace ? where we tell the HW to start
>> tracing
>> and put the trace details in the memory buffer allocated and ioremapped ?
>> IIUC just making the memory ioremapped() wont start the trace
>> automatically.
>>
> 
> Ah sorry, misunderstood you. Yes, this patch is only to remove the
> memory from kernel mappings so the hardware tracing can use it - not the
> actual code that invokes the tracing. That code involves some hardware
> information that hasn't been made publicly accessible, so it is not my
> place to share that. The hardware trace will be written to the 'removed'
> memory directly by the hardware and the debugfs files allow us to read
> the trace without having to reonline the memory.

It makes sense now. Thanks Rashmica.