[Skiboot] Memory allocations / free HEAP space

Tue Apr 7 10:00:07 AEST 2015

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:
> On Tue, 2015-04-07 at 07:59 +1000, Stewart Smith wrote:
>> Hi all,
>> 
>> I've been looking at HEAP usage of skiboot booting in various
>> environments.
>> 
>> We currently reserve 12MB for heap (or 11MB once there's my gcov
>> modifications). On Mambo, we only use about 600k -
>> but this is due to there being only one cpu and pretty much no
>> devices. On a dual socket P8, we use a lot more - on an FSP system we
>> currently only have 2.3MB free when we boot the kernel. This is possibly
>> getting a bit close.
>> 
>> We can take one (or both) of these actions:
>> 1) reduce memory usage
>> 2) add extra heap.
>> 
>> I added a small patch to dump out the allocations and free space,
>> getting a decent view as to both memory usage and free space
>> fragmentation.
>> 
>> The big allocations are:
>> [15649665335,5]     0x00100010 hw/fsp/fsp-console.c:543.
>> 
>> We preallocate memory for each possible console. We possibly don't need
>> to allocate all of these on startup, perhaps only when console is opened?
>
> Depends ... I was trying to avoid runtime allocations as much as
> possible to limit fragmentation in the original design. What we do know
> however is that we only ever need as many consoles as there are serial
> ports, plus one. So we know that at boot time. This is 3 on P7 and 2 on
> P8.

A worthwhile design. Looking at fragmentation at boot time, it is
something we're only okay at, and adding huge amounts of smarts to the
allocator to avoid it is probably not necessarily the best use of time.

Maybe I should just ensure we're doing the max for P7/P8 there and be
done with it.

>> [15649677058,5]     0x0000c010 hw/fsp/fsp-mem-err.c:386.
>> 
>> This should probably be converted to use core/pool.c rather than custom
>> pool.
>> 
>> [15649684021,5]     0x00040010 hw/fsp/fsp-elog-read.c:537.
>> [15649691303,5]     0x00001010 hw/fsp/fsp-elog-read.c:515.
>> 
>> Instinctively I think this should be core/pool.c rather than custom one,
>> but I haven't looked into details.
>> 
>> We also probably don't need to statically allocate the error log buffer
>> to read from FSP?
>
> Same deal, we will use it, and I'd rather not get into situations where
> we fail to allocate it.

We will get notified again if we can't at that very moment retreive an
error log though, so it's not a *vital* bit of memory. Although the
getting increasingly fragmented memory I guess is the concern.

It may be useful to get memory fragmentation information from long
running systems.

>> [15649698639,5]     0x00010010 hw/fsp/fsp-elog-write.c:398.
>> 
>> We do probably want to keep the panic buffer allocated at boot time,
>> although in the code path that uses it, we probably also want to avoid
>> allocations (which it doesn't look like we really succeed at).
>> 
>> [15649706019,5]     0x00040010 hw/fsp/fsp-elog-write.c:405.
>> [15649713362,5]     0x00010010 hw/fsp/fsp-elog-write.c:412.
>> 
>> probably also should be core/pool.c
>> 
>> [15649720676,5]     0x000e1010 core/pool.c:66.
>> 
>> This is actually from somewhere else, not sure where though.
>> 
>> [15649741434,5]     0x00100010 hw/fsp/fsp.c:1083.
>> 
>> This is fsp_inbound_buf which I'm not convinced needs to be always
>> allocated and I'm not convinced we really need (it looks like only
>> fsp-leds use it)
>
> That one is needed. The FSP asks us to allocate memory on its behalf and
> map it in the TCEs, that's where we get it from. If we want to make it
> dynamic, we'd have to also dynamically map in the TCEs. Not a huge deal,
> but the code was simpler that way.
>
> However I know at some point the FSP was "allocating" a lot more than it
> would ever need, some HMC related stuff that are never going to be used
> with OPAL etc... I don't know if that is still the case and we might be
> able to "adjust" the size of the inbound buf.

That was my guess... it is a fairly decent chunk of memory.

>> [15651055312,5]     0x00100010 core/nvram.c:228.
>> 
>> This is because we allocate a copy of NVRAM in OPAL and do all reads
>> from our cache but write-through to NVRAM.
>> 
>> What we *should* have done of course is have an API that was async, but
>> the Linux code now exists that does this:
>> 
>>         rc = opal_read_nvram(__pa(buf), count, off);
>>         if (rc != OPAL_SUCCESS)
>>                 return -EIO;
>> 
>> and this is the API, that it doesn't block.
>
> Correct. This is based on the original OPAL v2 design. It also helps
> with platforms where the nvram isn't functional (such as BML). We should
> just eat that one. However we can make the nvram smaller.
>
>> We *could* still adapt, but we'd have to deprecate things over a long
>> period of time, so it looks like we're stuck (unless NVRAM suddenly
>> jumps in size and we *have* to fix this).
>> 
>> [15652018374,5]     0x00027a30 hw/phb3.c:4188.
>> [15652359587,5]     0x00020010 hw/phb3.c:2217.
>> [15652434447,5]     0x00027a30 hw/phb3.c:4188.
>> [15652574357,5]     0x00027a30 hw/phb3.c:4188.
>> [15652714381,5]     0x00027a30 hw/phb3.c:4188.
>> [15652847063,5]     0x00027a30 hw/phb3.c:4188.
>> [15652979102,5]     0x00027a30 hw/phb3.c:4188.
>> [15653119254,5]     0x00027a30 hw/phb3.c:4188.
>> [15653259644,5]     0x00027a30 hw/phb3.c:4188.
>> 
>> These look like the cost of doing PCI.
>
> Right, the PHBs need those tables.
>
> I think a good thing would be to try to limit the memory used for device
> nodes and properties. We have a lot of 4-bytes ones I suspect, maybe we
> could have a dedicated bitmap allocator for them ? Also we could
> definitely "factor" the ones containing 0. However it's quite a bit of
> work.

Quite likely worthwhile doing, there's a lot of overhead with current
allocator for these. But I'm not sure it's going to gain us
much... maybe another 64k?

Perhaps adding more heap really is what we have to do.. 2.3MB seems
pretty low, and on a four socket system, I think we'll run out before
boot.