[RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

Fri Feb 24 03:36:38 AEDT 2017

Michal Hocko <mhocko at kernel.org> writes:

> On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko at kernel.org> writes:
>> 
>> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko at kernel.org> writes:
>> >> 
>> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> >> > [...]
>> >> >> > There is a workaround in that a user could online the memory or have
>> >> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> >> > sysfs interface to online memory goes through device_online() which
>> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> >> 
>> >> >> Using udev rule for memory onlining is possible when you disable
>> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> >> use memory hotplug to address memory pressure the loop through userspace
>> >> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> >> online newly added memory.
>> >> >
>> >> > How does the in-kernel implementation prevents from that?
>> >> >
>> >> 
>> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> >> to add it in add_memory_resource() we'll also manage to online it.
>> >
>> > How does that differ from initiating online from the users?
>> >
>> >> With
>> >> udev rule we may end up adding many blocks and then (as udev is
>> >> asynchronous) failing to online any of them.
>> >
>> > Why would it fail?
>> >
>> >> In-kernel operation is synchronous.
>> >
>> > which doesn't mean anything as the context is preemptible AFAICS.
>> >
>> 
>> It actually does,
>> 
>> imagine the following example: you run a small guest (256M of memory)
>> and now there is a request to add 1000 128mb blocks to it. 
>
> Is a grow from 256M -> 128GB really something that happens in real life?
> Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> which is an operation which has to allocate memory has to scale with the
> currently available memory IMHO.

With virtual machines this is very real and not exaggerated at
all. E.g. Hyper-V host can be tuned to automatically add new memory when
guest is running out of it. Even 100 blocks can represent an issue.

>
>> In case you
>> do it the old way you're very likely to get OOM somewhere in the middle
>> as you keep adding blocks which requere kernel memory and nobody is
>> onlining it (or, at least you're racing with the onliner). With
>> in-kernel implementation we're going to online the first block when it's
>> added and only then go to the second.
>
> Yes, adding a memory will cost you some memory and that is why I am
> really skeptical when memory hotplug is used under a strong memory
> pressure. This can lead to OOMs even when you online one block at the
> time.

If you can't allocate anything then yes, of course it will fail. But if
you try to add many blocks without onlining at the same time the
probability of failure is orders of magniture higher.

(a bit unrelated) I was actually thinking about the possible failure and
had the following idea in my head: we always keep everything allocated
for one additional memory block so when hotplug happens we use this
reserved space to add the block, online it and immediately reserve space
for the next one. I didn't do any coding yet.

>
> [...]
>> > This was not my decision so I can only guess but to me it makes sense.
>> > Both memory and cpus can be physically present and offline which is a
>> > perfectly reasonable state. So having a two phase physicall hotadd is
>> > just built on top of physical vs. logical distinction. I completely
>> > understand that some usecases will really like to online the whole node
>> > as soon as it appears present. But an automatic in-kernel implementation
>> > has its down sites - e.g. if this operation fails in the middle you will
>> > not know about that unless you check all the memblocks in sysfs. This is
>> > really a poor interface.
>> 
>> And how do you know that some blocks failed to online with udev?
>
> Because the udev will run a code which can cope with that - retry if the
> error is recoverable or simply report with all the details. Compare that
> to crawling the system log to see that something has broken...

I don't know much about udev, but the most common rule to online memory
I've met is:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"

doesn't do anything smart.

In current RHEL7 it is even worse:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

so to online new memory block we actually need to run a process.

>
>> Who
>> handles these failures and how? And, the last but not least, why do
>> these failures happen?
>
> I haven't heard reports about the failures and from looking into the
> code those are possible but very unlikely.

My point is - failures are possible, yes, but in the most common
use-case if we hot-plugged some memory we most probably want to use it
and the feature does that. I'd be glad to hear about possible
improvemets to it of course.

-- 
  Vitaly