[Skiboot] [PATCH] mem_region: Merge similar allocations when dumping

Tue Jul 24 17:46:46 AEST 2018

Oliver <oohall at gmail.com> writes:
> On Mon, Jul 23, 2018 at 7:15 PM, Stewart Smith <stewart at linux.ibm.com> wrote:
>> "Oliver O'Halloran" <oohall at gmail.com> writes:
>>> Currently we print one line for each allocation done at runtime when
>>> dumping the memory allocations. We do a few thousand allocations at
>>> boot so this can result in a huge amount of text being printed which
>>> is a) slow to print, and b) Can result in the log buffer overflowing
>>> which destroys otherwise useful information.
>>>
>>> This patch adds a de-duplication to this memory allocation dump by
>>> merging "similar" allocations (same location, same size) into one.
>>
>> Seems like a good idea, it's pretty damn verbose currently.
>>
>>> Unfortunately, the algorithm used to do the de-duplication is quadratic,
>>> but considering we only dump the allocations in the event of a fatal
>>> error I think this is acceptable. I also did some benchmarking and found
>>> that on a ZZ it takes ~3ms to do a dump with 12k allocations. On a Zaius
>>> it's slightly longer at about ~10ms for 10k allocs. However, the
>>> difference there was due to the output being written to the UART.
>>
>> I didn't think we were fatal when this occurs at runtime - we'd happily
>> continue executing if the failure was somewhere non-critical (which it
>> *should* be if we're doing memory allocation at runtime). So that gives
>> me pause on the quadratic part of it.
>>
>> Maybe we should only dump the list of allocations on the first failure?
>> After that it's probably not too interesting.
>
> Makes sense. We might want to do the same for assert() failures too.
>
>>> This patch also bumps the log level to PR_NOTICE. PR_INFO messages are
>>> suppressed at the default log level, which probably isn't something you
>>> want considering we only dump the allocations when we run out of skiboot
>>> heap space.
>>
>> Could the motivation of not spamming the console with all the data be
>> that we didn't want to overwhelm the console in the case of We Did
>> Something Terrible?
>>
>> At least in the past we've had bugs where we'd fail a memory allocation
>> at runtime and just print forever - which is a pretty nasty thing when
>> compared to the situations these bugs have been in
>
> Sure, but the whole point of this patch is to eliminate the "print-forever"
> situation. On the ZZ I tested with the 12k allocation reduced to ~150 lines
> which is fairly manageable. Even a Firestone with the world's crappiest SOL
> implementation it only takes ~2 seconds to dump the de-duplicated log.
>
> Anyway, currently we have zero instrumentation on allocation failures so every
> allocation failure is a silent failure. This isn't a great situation
> at the best of
> times, but it's particularly obnoxious at boot time where we need the allocation
> dump to have any chance of debugging the issue.
>
>>(where continuing to  operate would be Just Fine).
>
> Have we actually tested that it's Just Fine? I don't believe even for a second
> that runtime allocation failures is something we handle gracefully in all,
> or even most, cases.

Only tested in production by customers :)

-- 
Stewart Smith
OPAL Architect, IBM.