[RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

Sat Feb 25 01:10:29 AEDT 2017

Michal Hocko <mhocko at kernel.org> writes:

> On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko at kernel.org> writes:
>> 
>> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko at kernel.org> writes:
>> > [...]
>> >> > Is a grow from 256M -> 128GB really something that happens in real life?
>> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> >> > which is an operation which has to allocate memory has to scale with the
>> >> > currently available memory IMHO.
>> >> 
>> >> With virtual machines this is very real and not exaggerated at
>> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> >> guest is running out of it. Even 100 blocks can represent an issue.
>> >
>> > Do you have any reference to a bug report. I am really curious because
>> > something really smells wrong and it is not clear that the chosen
>> > solution is really the best one.
>> 
>> Unfortunately I'm not aware of any publicly posted bug reports (CC:
>> K. Y. - he may have a reference) but I think I still remember everything
>> correctly. Not sure how deep you want me to go into details though...
>
> As much as possible to understand what was really going on...
>
>> Virtual guests under stress were getting into OOM easily and the OOM
>> killer was even killing the udev process trying to online the
>> memory.
>
> Do you happen to have any OOM report? I am really surprised that udev
> would be an oom victim because that process is really small. Who is
> consuming all the memory then?

It's been a while since I worked on this and unfortunatelly I don't have
a log. From what I remember, the kernel itself was consuming all memory
so *all* processes were victims.

>
> Have you measured how much memory do we need to allocate to add one
> memblock?

No, it's actually a good idea if we decide to do some sort of pre-allocation.

Just did a quick (and probably dirty) test, increasing guest memory from
4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
per block. It's really easy to trigger OOM for small guests.

>
>> There was a workaround for the issue added to the hyper-v driver
>> doing memory add:
>> 
>> hv_mem_hot_add(...) {
>> ...
>>  add_memory(....);
>>  wait_for_completion_timeout(..., 5*HZ);
>>  ...
>> }
>
> I can still see 
> 		/*
> 		 * Wait for the memory block to be onlined when memory onlining
> 		 * is done outside of kernel (memhp_auto_online). Since the hot
> 		 * add has succeeded, it is ok to proceed even if the pages in
> 		 * the hot added region have not been "onlined" within the
> 		 * allowed time.
> 		 */
> 		if (dm_device.ha_waiting)
> 			wait_for_completion_timeout(&dm_device.ol_waitevent,
> 						    5*HZ);
>

See 

 dm_device.ha_waiting = !memhp_auto_online;

30 lines above. The workaround is still there for udev case and it is
still equaly bad.

>> the completion was done by observing for the MEM_ONLINE event. This, of
>> course, was slowing things down significantly and waiting for a
>> userspace action in kernel is not a nice thing to have (not speaking
>> about all other memory adding methods which had the same issue). Just
>> removing this wait was leading us to the same OOM as the hypervisor was
>> adding more and more memory and eventually even add_memory() was
>> failing, udev and other processes were killed,...
>
> Yes, I agree that waiting on a user action from the kernel is very far
> from ideal.
>
>> With the feature in place we have new memory available right after we do
>> add_memory(), everything is serialized.
>
> What prevented you from onlining the memory explicitly from
> hv_mem_hot_add path? Why do you need a user visible policy for that at
> all? You could also add a parameter to add_memory that would do the same
> thing. Or am I missing something?

We have different mechanisms for adding memory, I'm aware of at least 3:
ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
pretty sure I can reproduce the issue on Xen, for example - just boot a
small guest and try adding tons of memory. Why should we have different
defaults for different technologies? 

And, BTW, the link to the previous discussion:
https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

-- 
  Vitaly