[mm v2 0/3] Support memory cgroup hotplug

Thu Nov 24 00:05:12 AEDT 2016

On 23/11/16 20:28, Michal Hocko wrote:
> On Wed 23-11-16 19:37:16, Balbir Singh wrote:
>>
>>
>> On 23/11/16 19:07, Michal Hocko wrote:
>>> On Wed 23-11-16 18:50:42, Balbir Singh wrote:
>>>>
>>>>
>>>> On 23/11/16 18:25, Michal Hocko wrote:
>>>>> On Wed 23-11-16 15:36:51, Balbir Singh wrote:
>>>>>> In the absence of hotplug we use extra memory proportional to
>>>>>> (possible_nodes - online_nodes) * number_of_cgroups. PPC64 has a patch
>>>>>> to disable large consumption with large number of cgroups. This patch
>>>>>> adds hotplug support to memory cgroups and reverts the commit that
>>>>>> limited possible nodes to online nodes.
>>>>>
>>>>> Balbir,
>>>>> I have asked this in the previous version but there still seems to be a
>>>>> lack of information of _why_ do we want this, _how_ much do we save on
>>>>> the memory overhead on most systems and _why_ the additional complexity
>>>>> is really worth it. Please make sure to add all this in the cover
>>>>> letter.
>>>>>
>>>>
>>>> The data is in the patch referred to in patch 3. The order of waste was
>>>> 200MB for 400 cgroup directories enough for us to restrict possible_map
>>>> to online_map. These patches allow us to have a larger possible map and
>>>> allow onlining nodes not in the online_map, which is currently a restriction
>>>> on ppc64.
>>>
>>> How common is to have possible_map >> online_map? If this is ppc64 then
>>> what is the downside of keeping the current restriction instead?
>>>
>>
>> On my system CONFIG_NODE_SHIFT is 8, 256 nodes and possible_nodes are 2
>> The downside is the ability to hotplug and online an offline node.
>> Please see http://www.spinics.net/lists/linux-mm/msg116724.html
> 
> OK, so we are slowly getting to what I've asked originally ;) So who
> cares? Depending on CONFIG_NODE_SHIFT (which tends to be quite large in
> distribution or other general purpose kernels) the overhead is 424B (as
> per pahole on the current kernel) for one numa node. Most machines are
> to be expected 1-4 numa nodes so the overhead might be somewhere around
> 100K per memcg (with 256 possible nodes). Not trivial amount for sure
> but I would rather encourage people to lower the possible node count for
> their hardware if it is artificially large.
> 

On my desktop NODES_SHIFT is 6, many distro kernels have it a 9. I've known
of solutions that use fake NUMA for partitioning and need as many nodes as
possible.

>>>> A typical system that I use has about 100-150 directories, depending on the
>>>> number of users/docker instances/configuration/virtual machines. These numbers
>>>> will only grow as we pack more of these instances on them.
>>>>
>>>> From a complexity view point, the patches are quite straight forward.
>>>
>>> Well, I would like to hear more about that. {get,put}_online_memory
>>> at random places doesn't sound all that straightforward to me.
>>>
>>
>> I thought those places were not random :) I tried to think them out as
>> discussed with Vladimir. I don't claim the code is bug free, we can fix
>> any bugs as we test this more.
> 
> I am more worried about synchronization with the hotplug which tends to
> be a PITA in places were we were simply safe by definition until now. We
> do not have all that many users of memcg->nodeinfo[nid] from what I can see
> but are all of them safe to never race with the hotplug. A lack of
> highlevel design description is less than encouraging.

As in explanation? The design is dictated by the notifier and the actions
to take when the node comes online/offline.

 So please try to
> spend some time describing how do we use nodeinfo currently and how is
> the synchronization with the hotplug supposed to work and what
> guarantees that no stale nodinfos can be ever used. This is just too
> easy to get wrong...
> 

OK.. I'll add that in the next cover letter

Balbir