[Skiboot] [PATCH] mem_region: Merge similar allocations when dumping

Oliver oohall at gmail.com
Mon Jul 23 20:11:32 AEST 2018

On Mon, Jul 23, 2018 at 7:15 PM, Stewart Smith <stewart at linux.ibm.com> wrote:
> "Oliver O'Halloran" <oohall at gmail.com> writes:
>> Currently we print one line for each allocation done at runtime when
>> dumping the memory allocations. We do a few thousand allocations at
>> boot so this can result in a huge amount of text being printed which
>> is a) slow to print, and b) Can result in the log buffer overflowing
>> which destroys otherwise useful information.
>> This patch adds a de-duplication to this memory allocation dump by
>> merging "similar" allocations (same location, same size) into one.
> Seems like a good idea, it's pretty damn verbose currently.
>> Unfortunately, the algorithm used to do the de-duplication is quadratic,
>> but considering we only dump the allocations in the event of a fatal
>> error I think this is acceptable. I also did some benchmarking and found
>> that on a ZZ it takes ~3ms to do a dump with 12k allocations. On a Zaius
>> it's slightly longer at about ~10ms for 10k allocs. However, the
>> difference there was due to the output being written to the UART.
> I didn't think we were fatal when this occurs at runtime - we'd happily
> continue executing if the failure was somewhere non-critical (which it
> *should* be if we're doing memory allocation at runtime). So that gives
> me pause on the quadratic part of it.
> Maybe we should only dump the list of allocations on the first failure?
> After that it's probably not too interesting.

Makes sense. We might want to do the same for assert() failures too.

>> This patch also bumps the log level to PR_NOTICE. PR_INFO messages are
>> suppressed at the default log level, which probably isn't something you
>> want considering we only dump the allocations when we run out of skiboot
>> heap space.
> Could the motivation of not spamming the console with all the data be
> that we didn't want to overwhelm the console in the case of We Did
> Something Terrible?
> At least in the past we've had bugs where we'd fail a memory allocation
> at runtime and just print forever - which is a pretty nasty thing when
> compared to the situations these bugs have been in

Sure, but the whole point of this patch is to eliminate the "print-forever"
situation. On the ZZ I tested with the 12k allocation reduced to ~150 lines
which is fairly manageable. Even a Firestone with the world's crappiest SOL
implementation it only takes ~2 seconds to dump the de-duplicated log.

Anyway, currently we have zero instrumentation on allocation failures so every
allocation failure is a silent failure. This isn't a great situation
at the best of
times, but it's particularly obnoxious at boot time where we need the allocation
dump to have any chance of debugging the issue.

>(where continuing to  operate would be Just Fine).

Have we actually tested that it's Just Fine? I don't believe even for a second
that runtime allocation failures is something we handle gracefully in all,
or even most, cases.


More information about the Skiboot mailing list