[RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
Vitaly Kuznetsov
vkuznets at redhat.com
Sat Feb 25 01:10:29 AEDT 2017
Michal Hocko <mhocko at kernel.org> writes:
> On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko at kernel.org> writes:
>>
>> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko at kernel.org> writes:
>> > [...]
>> >> > Is a grow from 256M -> 128GB really something that happens in real life?
>> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> >> > which is an operation which has to allocate memory has to scale with the
>> >> > currently available memory IMHO.
>> >>
>> >> With virtual machines this is very real and not exaggerated at
>> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> >> guest is running out of it. Even 100 blocks can represent an issue.
>> >
>> > Do you have any reference to a bug report. I am really curious because
>> > something really smells wrong and it is not clear that the chosen
>> > solution is really the best one.
>>
>> Unfortunately I'm not aware of any publicly posted bug reports (CC:
>> K. Y. - he may have a reference) but I think I still remember everything
>> correctly. Not sure how deep you want me to go into details though...
>
> As much as possible to understand what was really going on...
>
>> Virtual guests under stress were getting into OOM easily and the OOM
>> killer was even killing the udev process trying to online the
>> memory.
>
> Do you happen to have any OOM report? I am really surprised that udev
> would be an oom victim because that process is really small. Who is
> consuming all the memory then?
It's been a while since I worked on this and unfortunatelly I don't have
a log. From what I remember, the kernel itself was consuming all memory
so *all* processes were victims.
>
> Have you measured how much memory do we need to allocate to add one
> memblock?
No, it's actually a good idea if we decide to do some sort of pre-allocation.
Just did a quick (and probably dirty) test, increasing guest memory from
4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
per block. It's really easy to trigger OOM for small guests.
>
>> There was a workaround for the issue added to the hyper-v driver
>> doing memory add:
>>
>> hv_mem_hot_add(...) {
>> ...
>> add_memory(....);
>> wait_for_completion_timeout(..., 5*HZ);
>> ...
>> }
>
> I can still see
> /*
> * Wait for the memory block to be onlined when memory onlining
> * is done outside of kernel (memhp_auto_online). Since the hot
> * add has succeeded, it is ok to proceed even if the pages in
> * the hot added region have not been "onlined" within the
> * allowed time.
> */
> if (dm_device.ha_waiting)
> wait_for_completion_timeout(&dm_device.ol_waitevent,
> 5*HZ);
>
See
dm_device.ha_waiting = !memhp_auto_online;
30 lines above. The workaround is still there for udev case and it is
still equaly bad.
>> the completion was done by observing for the MEM_ONLINE event. This, of
>> course, was slowing things down significantly and waiting for a
>> userspace action in kernel is not a nice thing to have (not speaking
>> about all other memory adding methods which had the same issue). Just
>> removing this wait was leading us to the same OOM as the hypervisor was
>> adding more and more memory and eventually even add_memory() was
>> failing, udev and other processes were killed,...
>
> Yes, I agree that waiting on a user action from the kernel is very far
> from ideal.
>
>> With the feature in place we have new memory available right after we do
>> add_memory(), everything is serialized.
>
> What prevented you from onlining the memory explicitly from
> hv_mem_hot_add path? Why do you need a user visible policy for that at
> all? You could also add a parameter to add_memory that would do the same
> thing. Or am I missing something?
We have different mechanisms for adding memory, I'm aware of at least 3:
ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
pretty sure I can reproduce the issue on Xen, for example - just boot a
small guest and try adding tons of memory. Why should we have different
defaults for different technologies?
And, BTW, the link to the previous discussion:
https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ
--
Vitaly
More information about the Linuxppc-dev
mailing list