[PATCH] slub: Don't throw away partial remote slabs if there is no local memory

Thu Feb 6 13:07:57 EST 2014

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.
> 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.
> 
> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Thanks,
Nish