[PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

Tyrel Datwyler turtle.in.the.kernel at gmail.com
Fri Apr 21 02:51:55 AEST 2017


On 04/19/2017 09:43 PM, Frank Rowand wrote:

<snip>

>> Something else to keep in mind is that while pr_debugs could be used to
>> provide feedback on the reference counts and of_reconfig events they
>> don't in anyway tell us where they are happening in the kernel. The
> 
> Yes, that is critical information.  When there are refcount issues, the
> root cause is at varying levels back in the call stack.
> 
> 
>> trace infrastructure provides the ability to stack trace those events.
>> The following example provides me a lot more information about who is
>> doing what and where after I hot-add an ethernet adapter:
>>
>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
>> # cat trace | grep -A6 "/pci at 800000020000018"
>> ...
>>            drmgr-7349  [006] d...  7138.821875: of_node_get: refcount=8,
>> dn->full_name=/pci at 800000020000018
>>            drmgr-7349  [006] d...  7138.821876: <stack trace>
>>  => .msi_quota_for_device
>>  => .rtas_setup_msi_irqs
>>  => .arch_setup_msi_irqs
>>  => .__pci_enable_msix
>>  => .pci_enable_msix_range
> 
> Nice!  It is great to have function names in the call stack.

Agreed.

<snip>

>> To get that same info as far as I know is to add a dump_stack() after
>> each pr_debug.
> 
> Here is a patch that I have used.  It is not as user friendly in terms
> of human readable stack traces (though a very small user space program
> should be able to fix that).  The patch is cut and pasted into this
> email, so probably white space damaged.
> 
> Instead of dumping the stack, each line in the "report" contains
> the top six addresses in the call stack.  If interesting, they
> can be post-processed (as I will show in some examples below).

Very cool. I wasn't familiar with the CALLER_ADDR* defines. Thanks for
sharing.

> 
> ---
>  drivers/of/dynamic.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> Index: b/drivers/of/dynamic.c
> ===================================================================
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -13,6 +13,7 @@
>  #include <linux/slab.h>
>  #include <linux/string.h>
>  #include <linux/proc_fs.h>
> +#include <linux/ftrace.h>
>  
>  #include "of_private.h"
>  
> @@ -27,6 +28,20 @@ struct device_node *of_node_get(struct d
>  {
>  	if (node)
>  		kobject_get(&node->kobj);
> +
> +	if (node) {
> +		int k;
> +		int refcount = refcount_read(&node->kobj.kref.refcount);
> +		pr_err("XXX get 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx] ",
> +			node, refcount,
> +			CALLER_ADDR5, CALLER_ADDR4, CALLER_ADDR3,
> +			CALLER_ADDR2, CALLER_ADDR1, CALLER_ADDR0);
> +		// refcount = (refcount > 20) ? 20 : refcount; /* clamp max */
> +		for (k = 0; k < refcount; k++)
> +			pr_cont("+");
> +		pr_cont(" %s\n", of_node_full_name(node));
> +	}
> +
>  	return node;
>  }
>  EXPORT_SYMBOL(of_node_get);
> @@ -38,8 +53,22 @@ EXPORT_SYMBOL(of_node_get);
>   */
>  void of_node_put(struct device_node *node)
>  {
> +	if (node) {
> +		int k;
> +		int refcount = refcount_read(&node->kobj.kref.refcount);
> +		pr_err("XXX put 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx] ",
> +			node, refcount,
> +			CALLER_ADDR5, CALLER_ADDR4, CALLER_ADDR3,
> +			CALLER_ADDR2, CALLER_ADDR1, CALLER_ADDR0);
> +		// refcount = (refcount > 20) ? 20 : refcount; /* clamp max */
> +		for (k = 0; k < refcount; k++)
> +			pr_cont("-");
> +		pr_cont(" %s\n", of_node_full_name(node));
> +	}
> +
>  	if (node)
>  		kobject_put(&node->kobj);
> +
>  }
>  EXPORT_SYMBOL(of_node_put);
> 
> 
> I have used this mostly in the context of boot time, so there is a lot
> of output.  My notes on configuration needed for my ARM boards are:
> 
>    FIXME: Currently using pr_err() so I don't need to set loglevel on boot.
> 
>           So obviously not a user friendly tool!!!
>           The process is:
>              - apply patch
>              - configure, build, boot kernel
>              - analyze data
>              - remove patch
> 
>    LOG_BUF_SHIFT (was 17)
>    General Setup ->
>       [select 21] Kernel log buffer size (16 => 64KB, 17 => 128KB)
> 
> 
>    Device Drivers ->
>       Device Tree and Open Firmware support ->
>         Device Tree overlays
> 
> 
> Want CONFIG_FRAME_POINTER so that CALLER_ADDR* will work.
> To be able to enable CONFIG_FRAME_POINTER, need to disable CONFIG_ARM_UNWIND.
> 
>    Kernel hacking ->
>       [unselect] Enable stack unwinding support (EXPERIMENTAL)
>       CONFIG_FRAME_POINTER will then be selected automatically
> 
> 
> The output looks like:
> 
> [    0.231430] OF: XXX get 0xeefeb5dc   2 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++ /smp2p-adsp/slave-kernel
> [    0.231457] OF: XXX get 0xeefeb5dc   3 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814188] +++ /smp2p-adsp/slave-kernel
> [    0.231495] OF: XXX get 0xeefeb5dc   4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814258] ++++ /smp2p-adsp/slave-kernel
> [    0.231537] OF: XXX get 0xeefeb244   4 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] ++++ /smp2p-adsp
> [    0.231568] OF: XXX put 0xeefeb5dc   4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp/slave-kernel
> [    0.231610] OF: XXX get 0xeefe759c  23 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] +++++++++++++++++++++++ /
> [    0.231702] OF: XXX put 0xeefeb244   4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp
> [    0.231744] OF: XXX put 0xeefe759c  23 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ----------------------- /
> [    0.231881] OF: XXX get 0xeefecb2c  22 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e86c 0xc080e678 0xc080de7c] ++++++++++++++++++++++ /soc/interrupt-controller at f9000000
> [    0.231972] OF: XXX put 0xeefecb2c  22 [0xc080fd94 0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e61c] ---------------------- /soc/interrupt-controller at f9000000
> [    0.232101] OF: XXX get 0xeefeb5dc   4 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++++ /smp2p-adsp/slave-kernel
> [    0.232134] OF: XXX put 0xeefeb5dc   4 [0xc080fd94 0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e61c] ---- /smp2p-adsp/slave-kernel
> [    0.232178] OF: XXX get 0xeefeb5dc   4 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++++ /smp2p-adsp/slave-kernel
> [    0.232211] OF: XXX get 0xeefeb5dc   5 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814188] +++++ /smp2p-adsp/slave-kernel
> [    0.232257] OF: XXX get 0xeefeb5dc   6 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814258] ++++++ /smp2p-adsp/slave-kernel
> [    0.232308] OF: XXX get 0xeefeb244   4 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] ++++ /smp2p-adsp
> [    0.232339] OF: XXX put 0xeefeb5dc   6 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ------ /smp2p-adsp/slave-kernel
> [    0.232390] OF: XXX get 0xeefe759c  23 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] +++++++++++++++++++++++ /
> [    0.232482] OF: XXX put 0xeefeb244   4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp
> 
> 
> 
> But I normally strip off the timestamp, and grep for the "OF: XXX ",
> which gets me only the get and put info.  It is also easy to grep
> for a single node of interest.
> 
> The data fields are:
>   get or put
>   the struct device_node address
>   refcount
>   a 6 caller deep call stack
>   for get, one '+' per refcount or for put, one '-' per refcount
>   the full node name
> 
> The refcount for get is the post get value, for put is the pre put value,
> so they are easy to match up for human scanning. The length of the "++++"
> and "----" patterns on the end are also intended for easy human scanning.
> 
> Here are two actual refcount issues for the root node on my 4.11-rc1:
> 
> OF: XXX get 0xeefe759c   2 [0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813070 0xc080ccc8] ++ /
> OF: XXX put 0xeefe759c   2 [0xc031aa28 0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813014] -- /
> OF: XXX get 0xeefe759c   2 [0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813070 0xc080ccc8] ++ /
> OF: XXX put 0xeefe759c   2 [0xc031aa40 0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813014] -- /
> OF: XXX get 0xeefe759c  22 [0xc0308518 0xc09330e0 0xc0d00e3c 0xc03017d0 0xc0d3a1fc 0xc080d948] ++++++++++++++++++++++ /
> OF: XXX put 0xeefe759c  22 [0xc0308518 0xc09330e0 0xc0d00e3c 0xc03017d0 0xc0d3a1fc 0xc080d8c8] ---------------------- /
> OF: XXX get 0xeefe759c  22 [0xc0d00e3c 0xc03017d0 0xc0d3a234 0xc0810684 0xc081061c 0xc080d928] ++++++++++++++++++++++ /
> OF: XXX get 0xeefe759c  23 [0xc08103c4 0xc0810024 0xc080fd84 0xc08137b4 0xc0812c88 0xc080ccc8] +++++++++++++++++++++++ /
> OF: XXX put 0xeefe759c  23 [0xc08105d8 0xc08103c4 0xc0810024 0xc080fd84 0xc08137b4 0xc0812cb4] ----------------------- /
> 
> The call stack could easily be post-processed, for example using addr2line.
> Here is the call stack for when the refcount incremented to 23 from 22 (or
> more accurately, to 22 from 21):
> 
> 0xc0d00e3c Line 857 of "init/main.c"
> 0xc03017d0 Line 792 of "init/main.c"
> 0xc0d3a234 Line 528 of "drivers/of/platform.c"
> 0xc0810684 Line 503 of "drivers/of/platform.c"
> 0xc081061c Line 267 of "include/linux/of.h"
> 0xc080d928 Line 815 of "drivers/of/base.c"
> 
> Which ends up being this code:
> 
>    of_platform_default_populate_init()
>       of_platform_default_populate()
>          of_platform_populate()
>             [[ of_find_node_by_path("/") ]]
>                [[ of_find_node_opts_by_path(path, NULL) ]]
>                   of_node_get(of_root)
> 
> Note that some functions can be left out of the ARM call stack, with
> a return going back more than one level.  The functions in the call
> list above that are enclosed in '[[' and ']]' were found by source
> inspection in those cases.

The same thing is encountered in ppc64 stack traces. I assume it is
generally inlining of small functions, but I've never actually verified
that theory. Probably should take the time to investigate, or just ask
someone.

> 
> It looks like a put is missing, but about 250 get/put pairs later,
> of_platform_populate() does the required put on node "/".
> 
> Then quite a bit later, after lots of balanced gets and puts, there is an
> initcall that does a get on the root without a corresponding put.
> 
> 
> The jump from refcount 2 to refcount 22 is an interesting case, insofar as it
> is not the result of of_node_get().  It is instead inside a series of calls to
> kobject_add():

Took me a hot minute to track this one down when I first encountered the
extreme reference count jump on the root node with no associated
of_node_gets. Luckily, the other nodes with children sent me looking in
the right direction.

> 
>    kernel_init()
>       kernel_init_freeable()
>         do_basic_setup()
>             driver_init()
>                of_core_init()
>                   for_each_of_allnodes(np)
>                      __of_attach_node_sysfs(np)
>                         kobject_add()
> 
> 
> Filtering and reformatting is "easily done" with grep and other
> normal unix tools.
> 
> For example, a simple stream of command line tools can show a
> streamlined report of the refcounts of a single node (in this
> case the root node), which can easily be scanned for interesting
> events:
> 
> [    0.199569]    2  ++ /
> [    0.199629]    2  -- /
> [    0.199826]    2  ++ /
> [    0.199886]    2  -- /
> [    0.212549]   22  ++++++++++++++++++++++ /
> [    0.212855]   22  ---------------------- /
> [    0.213087]   22  ++++++++++++++++++++++ /
> [    0.213700]   23  +++++++++++++++++++++++ /
> [    0.213797]   23  ----------------------- /
> [    0.213973]   23  +++++++++++++++++++++++ /
> 
> ... hundreds of boring put/get pairs
> 
> [    0.458737]   23  ----------------------- /
> [    0.458909]   23  +++++++++++++++++++++++ /
> [    0.459035]   23  ----------------------- /
> [    0.459305]   22  ---------------------- /
> [    0.470255]   22  ++++++++++++++++++++++ /
> 
> ... hundreds of boring put/get pairs
> 
> [   93.110548]   22  ++++++++++++++++++++++ /
> [   93.140046]   22  ---------------------- /
> [   93.264639]   22  ++++++++++++++++++++++ /
> [   93.389530]   23  +++++++++++++++++++++++ /
> [   93.414269]   23  ----------------------- /
> 
> 
> You might have noticed that the call trace is not interesting for
> most of the gets and puts.  There are over 350 get/put pairs for
> the root node in the boot that I collected the above examples on,
> but only a few instances where the trace matters.  Thus leaving
> the call stack in a compact format until needed is a feature.
> 
> I will be the first to admit that the tool is not polished and not
> easy to use, though it is easily extended with post-processing.
> 
> I wrote the patch as a proof of concept a while ago and have not
> fleshed it out.  In fact, calling it a "tool" is overstating what
> it is.
> 
> 
>> Further, filters can be set on the tracepoint event fields such that
>> trace data could be restricted to a particular device_node or refcount
>> threshold. For example:
>>
>> # cd /sys/kernel/debug/tracing# cat events/of/of_node_get/format
>> # echo "dn_name == /pci at 800000020000018" > events/of/filter
>>
>> # cat trace
>>            drmgr-10542 [003] ....  9630.677001: of_node_put: refcount=5,
>> dn->full_name=/pci at 800000020000018
>>            drmgr-10542 [003] d...  9631.677368: of_node_get: refcount=6,
>> dn->full_name=/pci at 800000020000018
>>            drmgr-10542 [003] ....  9631.677389: of_node_put: refcount=5,
>> dn->full_name=/pci at 800000020000018
>>            drmgr-10542 [003] ....  9631.677390: of_reconfig_notify:
>> action=DETACH_NODE, dn->full_name=/pci at 800000020000018, prop->name=null,
>> old_prop->name=null
>>            drmgr-10542 [003] .n..  9632.025656: of_node_put: refcount=4,
>> dn->full_name=/pci at 800000020000018
>>            drmgr-10542 [003] .n..  9632.025657: of_node_put: refcount=3,
>> dn->full_name=/pci at 800000020000018
>>
>> After setting the filter and doing a hot-remove of the pci device in
>> question the trace quickly tells me 3 references are being leaked. In
>> combination with the stacktrace option I can quickly correlate call
>> sites that take references without releasing them.
> 
> Thanks for sharing that.  It is nice seeing your results.

Ditto.

-Tyrel


More information about the Linuxppc-dev mailing list