[PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

Sat Feb 16 02:26:49 AEDT 2019

On Thu, 14 Feb 2019, Jason Gunthorpe wrote:

> On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote:
>
> > > > > Really unclear how to fix this. The pinned/locked split with two
> > > > > buckets may be the right way.
> > > >
> > > > Are you suggesting that we have 2 user limits?
> > >
> > > This is what RDMA has done since CL's patch.
> >
> > I don't understand?  What is the other _user_ limit (other than
> > RLIMIT_MEMLOCK)?
>
> With todays implementation RLIMIT_MEMLOCK covers two user limits,
> total number of pinned pages and total number of mlocked pages. The
> two are different buckets and not summed.

Applications were failing at some point because they were effectively
summed up. If you mlocked/pinned a dataset of more than half the memory of
a system then things would get really weird.

Also there is the possibility of even more duplication because pages can
be pinned by multiple kernel subsystems. So you could get more than
doubling of the number.

The sane thing was to account them separately so that mlocking and
pinning worked without apps failing and then wait for another genius
to find out how to improve the situation by getting the pinned page mess
under control.

It is not even advisable to check pinned pages against any limit because
pages can be pinned by multiple subsystems.

The main problem here is that we only have a refcount to indicate pinning
and no way to clearly distinguish long term from short pins. In order to
really fix this issue we would need to have a list of subsystems that have
taken long term pins on a page. But doing so would waste a lot of memory
and cause a significant performance regression.

And the discussions here seem to be meandering around these issues.
Nothing really that convinces me that we have a clean solution at hand.