[Skiboot] Memory allocations / free HEAP space

Tue Apr 7 08:15:53 AEST 2015

On Tue, 2015-04-07 at 07:59 +1000, Stewart Smith wrote:
> Hi all,
> 
> I've been looking at HEAP usage of skiboot booting in various
> environments.
> 
> We currently reserve 12MB for heap (or 11MB once there's my gcov
> modifications). On Mambo, we only use about 600k -
> but this is due to there being only one cpu and pretty much no
> devices. On a dual socket P8, we use a lot more - on an FSP system we
> currently only have 2.3MB free when we boot the kernel. This is possibly
> getting a bit close.
> 
> We can take one (or both) of these actions:
> 1) reduce memory usage
> 2) add extra heap.
> 
> I added a small patch to dump out the allocations and free space,
> getting a decent view as to both memory usage and free space
> fragmentation.
> 
> The big allocations are:
> [15649665335,5]     0x00100010 hw/fsp/fsp-console.c:543.
> 
> We preallocate memory for each possible console. We possibly don't need
> to allocate all of these on startup, perhaps only when console is opened?

Depends ... I was trying to avoid runtime allocations as much as
possible to limit fragmentation in the original design. What we do know
however is that we only ever need as many consoles as there are serial
ports, plus one. So we know that at boot time. This is 3 on P7 and 2 on
P8.

> [15649677058,5]     0x0000c010 hw/fsp/fsp-mem-err.c:386.
> 
> This should probably be converted to use core/pool.c rather than custom
> pool.
> 
> [15649684021,5]     0x00040010 hw/fsp/fsp-elog-read.c:537.
> [15649691303,5]     0x00001010 hw/fsp/fsp-elog-read.c:515.
> 
> Instinctively I think this should be core/pool.c rather than custom one,
> but I haven't looked into details.
> 
> We also probably don't need to statically allocate the error log buffer
> to read from FSP?

Same deal, we will use it, and I'd rather not get into situations where
we fail to allocate it.

> [15649698639,5]     0x00010010 hw/fsp/fsp-elog-write.c:398.
> 
> We do probably want to keep the panic buffer allocated at boot time,
> although in the code path that uses it, we probably also want to avoid
> allocations (which it doesn't look like we really succeed at).
> 
> [15649706019,5]     0x00040010 hw/fsp/fsp-elog-write.c:405.
> [15649713362,5]     0x00010010 hw/fsp/fsp-elog-write.c:412.
> 
> probably also should be core/pool.c
> 
> [15649720676,5]     0x000e1010 core/pool.c:66.
> 
> This is actually from somewhere else, not sure where though.
> 
> [15649741434,5]     0x00100010 hw/fsp/fsp.c:1083.
> 
> This is fsp_inbound_buf which I'm not convinced needs to be always
> allocated and I'm not convinced we really need (it looks like only
> fsp-leds use it)

That one is needed. The FSP asks us to allocate memory on its behalf and
map it in the TCEs, that's where we get it from. If we want to make it
dynamic, we'd have to also dynamically map in the TCEs. Not a huge deal,
but the code was simpler that way.

However I know at some point the FSP was "allocating" a lot more than it
would ever need, some HMC related stuff that are never going to be used
with OPAL etc... I don't know if that is still the case and we might be
able to "adjust" the size of the inbound buf.

> [15649753411,5]     0x0007fb90 core/hostservices.c:422.
> 
> This is hservice_lid_load, so it's HBRT.
> 
> [15650025347,5]     0x00100010 hw/fsp/fsp-sensor.c:724.
> 
> sensor_buffer. Without looking closely, cannot work out if there's an
> "easy" way to not use this much memory for duration of
> runtime... perhaps this is something we just have to eat?
> 
> 
> [15651004495,5]     0x0007fb90 core/hostservices.c:422.
> 
> Huh... didn't expect to see this again... I wonder if we are leaking
> memory across load requests or if this is a different LID?
> 
> [15651055312,5]     0x00100010 core/nvram.c:228.
> 
> This is because we allocate a copy of NVRAM in OPAL and do all reads
> from our cache but write-through to NVRAM.
> 
> What we *should* have done of course is have an API that was async, but
> the Linux code now exists that does this:
> 
>         rc = opal_read_nvram(__pa(buf), count, off);
>         if (rc != OPAL_SUCCESS)
>                 return -EIO;
> 
> and this is the API, that it doesn't block.

Correct. This is based on the original OPAL v2 design. It also helps
with platforms where the nvram isn't functional (such as BML). We should
just eat that one. However we can make the nvram smaller.

> We *could* still adapt, but we'd have to deprecate things over a long
> period of time, so it looks like we're stuck (unless NVRAM suddenly
> jumps in size and we *have* to fix this).
> 
> [15652018374,5]     0x00027a30 hw/phb3.c:4188.
> [15652359587,5]     0x00020010 hw/phb3.c:2217.
> [15652434447,5]     0x00027a30 hw/phb3.c:4188.
> [15652574357,5]     0x00027a30 hw/phb3.c:4188.
> [15652714381,5]     0x00027a30 hw/phb3.c:4188.
> [15652847063,5]     0x00027a30 hw/phb3.c:4188.
> [15652979102,5]     0x00027a30 hw/phb3.c:4188.
> [15653119254,5]     0x00027a30 hw/phb3.c:4188.
> [15653259644,5]     0x00027a30 hw/phb3.c:4188.
> 
> These look like the cost of doing PCI.

Right, the PHBs need those tables.

I think a good thing would be to try to limit the memory used for device
nodes and properties. We have a lot of 4-bytes ones I suspect, maybe we
could have a dedicated bitmap allocator for them ? Also we could
definitely "factor" the ones containing 0. However it's quite a bit of
work.

Cheers,
Ben.