[RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline

Fri Feb 24 02:09:20 AEDT 2017

On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko at kernel.org> writes:
> 
> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> > [...]
> >> > There is a workaround in that a user could online the memory or have
> >> > a udev rule to online the memory by using the sysfs interface. The
> >> > sysfs interface to online memory goes through device_online() which
> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> 
> >> Using udev rule for memory onlining is possible when you disable
> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> use memory hotplug to address memory pressure the loop through userspace
> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> online newly added memory.
> >
> > How does the in-kernel implementation prevents from that?
> >
> 
> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> to add it in add_memory_resource() we'll also manage to online it.

How does that differ from initiating online from the users?

> With
> udev rule we may end up adding many blocks and then (as udev is
> asynchronous) failing to online any of them.

Why would it fail?

> In-kernel operation is synchronous.

which doesn't mean anything as the context is preemptible AFAICS.

> >> In addition to that, systemd/udev folks
> >> continuosly refused to add this udev rule to udev calling it stupid as
> >> it actually is an unconditional and redundant ping-pong between kernel
> >> and udev.
> >
> > This is a policy and as such it doesn't belong to the kernel. The whole
> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> > merged it.
> 
> I disagree.
> 
> First of all it's not a policy, it is a default. We have many other
> defaults in kernel. When I add a network card or a storage, for example,
> I don't need to go anywhere and 'enable' it before I'm able to use
> it from userspace. An for memory (and CPUs) we, for some unknown reason
> opted for something completely different. If someone is plugging new
> memory into a box he probably wants to use it, I don't see much value in
> waiting for a special confirmation from him. 

This was not my decision so I can only guess but to me it makes sense.
Both memory and cpus can be physically present and offline which is a
perfectly reasonable state. So having a two phase physicall hotadd is
just built on top of physical vs. logical distinction. I completely
understand that some usecases will really like to online the whole node
as soon as it appears present. But an automatic in-kernel implementation
has its down sites - e.g. if this operation fails in the middle you will
not know about that unless you check all the memblocks in sysfs. This is
really a poor interface.

> Second, this feature is optional. If you want to keep old behavior just
> don't enable it.

It just adds unnecessary configuration noise as well

> Third, this solves real world issues. With Hyper-V it is very easy to
> show udev failing on stress. 

What is the reason for this failures. Do you have any link handy?

> No other solution to the issue was ever suggested.

you mean like using ballooning for the memory overcommit like other more
reasonable virtualization solutions?

-- 
Michal Hocko
SUSE Labs