[PATCH 00/14] Present useful limits to user (v2)

Doug Ledford dledford at redhat.com
Tue Jul 19 08:05:31 AEST 2016

On 7/15/2016 12:35 PM, Topi Miettinen wrote:
> On 07/15/16 13:04, Balbir Singh wrote:
>> On Fri, Jul 15, 2016 at 01:35:47PM +0300, Topi Miettinen wrote:
>>> Hello,
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find out
>>> useful values for the limits, except blind trial and error.
>>> This patch series attempts to fix that by giving at least a nice starting
>>> point from the highwater mark values of the resources in question.
>>> I looked where each limit is checked and added a call to update the mark
>>> nearby.
>>> Example run of program from Documentation/accounting/getdelauys.c:
>>> ./getdelays -R -p `pidof smartd`
>>> printing resource accounting
>>> RLIMIT_DATA=18198528
>>> RLIMIT_STACK=135168
>>> RLIMIT_AS=130879488
>>> ./getdelays -R -C /sys/fs/cgroup/systemd/system.slice/smartd.service/
>>> printing resource accounting
>>> sleeping 1, blocked 0, running 0, stopped 0, uninterruptible 0
>>> RLIMIT_DATA=18198528
>>> RLIMIT_STACK=135168
>>> RLIMIT_AS=130879488
>> Does this mean that rlimit_data and rlimit_stack should be set to the
>> values as specified by the data above?
> My plan is that either system administrator, distro maintainer or even
> upstream developer can get reasonable values for the limits. They may
> still be wrong, but things would be better than without any help to
> configure the system.

This is not necessarily true.  It seems like there is a disconnect
between what these various values are for and what you are positioning
them as.  Most of these limits are meant to protect the system from
resource starvation crashes.  They aren't meant to be any sort of double
check on a specific application.  The vast majority of applications can
have bugs, leak resources, and do all sorts of other bad things and
still not hit these limits.  A program that leaks a file handle an hour
but only normally has 50 handles in use would take 950 hours of constant
leaking before these limits would kick in to bring the program under
control.  That's over a month.  What's more though, the kernel couldn't
really care less that a single application leaked files until it got to
1000 open.  The real point of the limit on file handles (since they are
cheap) is just not to let the system get brought down.  Someone could
maliciously fire up 1000 processes, and they could all attempt to open
up as many files as possible in order to drown the system in open
inodes.  The combination of the limit on maximum user processes and
maximum files per process are intended to prevent this.  They are not
intended to prevent a single, properly running application from
operating.  In fact, there are very few applications that are likely to
break the 1000 file per process limit.  It is outrageously high for most
applications.  They will leak files and do all sorts of bad things
without this ever stopping them.  But it does stop malicious programs.
And the process limit stops malicious users too.  The max locked memory
is used by almost no processes, and for the very few that use it, the
default is more than enough.  The major exception is the RDMA stack,
which uses it so much that we just disable it on large systems because
it's impossible to predict how much we'll need and we don't want a job
to get killed because it couldn't get the memory it needs for buffers.
The limit on POSIX message queues is another one where it's more than
enough for most applications which don't use this feature at all, and
the few systems that use this feature adjust the limit to something sane
on their system (we can't make the default sane for these special
systems or else it becomes an avenue for Denial of Service attack, so
the default must stay low and servers that make extensive use of this
feature must up their limit on a case by case basis).

>> Do we expect a smart user space daemon to then tweak the RLIMIT values?
> Someone could write an autotuning daemon that checks if the system has
> changed (for example due to upgrade) and then run some tests to
> reconfigure the system. But the limits are a bit too fragile, or rather,
> applications can't handle failure, so I don't know if that would really
> work.

This misses the point of most of these limits.  They aren't there to
keep normal processes and normal users in check.  They are there to stop
runaway use.  This runaway situation might be accidental, or it might be
a nefarious users.  They are generally set exceedingly high for those
things every application uses, and fairly low for those things that
almost no application uses but which could be abused by the nefarious
user crowd.

Moreover, for a large percentage of applications, the highwatermark is a
source of great trickery.  For instance, if you have a web server that
is hosting web pages written in python, and therefore are using
mod_python in the httpd server (assuming apache here), then your
highwatermark will never be a reliable, stable thing.  If you get 1000
web requests in a minute, all utilizing the mod_python resource in the
web server, and you don't have your httpd configured to restart after
every few hundred requests handled, then mod_python in your httpd
process will grow seemingly without limit.  It will consume tons of
memory.  And the only limit on how much memory it will consume is
determined by how many web requests it handles in between its garbage
collection intervals * how much memory it allocates per request.  If you
don't happen to catch the absolute highest amount while you are
gathering your watermarks, then when you actually switch the system to
enforcing the limits you learned from all your highwatermarks (you are
planning on doing that aren't you?....I didn't see a copy of the patch
1/14, so I don't know if this infrastructure ever goes back to enforcing
the limits or not, but I would assume so, what point is there in
learning what the limits should be if you then never turn around and
enforce them?), load spikes will cause random program failures.

Really, this looks like a solution in search of a problem.  Right now,
the limits are set where they are because they do two things:

1) Stay out of the way of the vast majority of applications.  Those
applications that get tripped up by the defaults (like RDMA applications
getting stopped by memlock settings) have setup guides that spell out
which limits need changed and hints on what to change them too.

2) Stop nefarious users or errant applications from a total runaway
situation on a machine.

If your applications run without fail unless they have already failed,
and the whole machine doesn't go down with your failed application, then
the limits are working as designed.  If your typical machine
configuration includes 256GB of RAM, then you could probably stand to
increase some of the limits safely if you wanted to.  But unless you
have applications getting killed because of these limits, why would you?

Right now, I'm inclined to NAK the patch set.  I've only seen patch 9/14
since you didn't Cc: everyone on the patch 1/14 that added the
infrastructure.  But, as I mentioned in another email, I think this can
be accomplished via a systemtap script instead so we keep the clutter
out of the kernel.  And more importantly, these patches seem to be
thinking about these limits as though they are supposed to be some sort
of tight fitting container around applications that catch an errant
application as soon as it steps out of bounds.  Nothing could be further
from the truth, and if we actually implemented something of that sort,
programs susceptible to high resource usage during load spikes would
suddenly start failing on a frequent basis.  The proof that these limits
are working is given by the fact that we rarely hear from users about
their programs being killed for resource consumption, and yet we also
don't hear from users about their systems going down due to runaway
applications.  From what I can tell from these patches, I would suspect
complaints from one of those two issues to increase once these patches
are in place and put in use, and that doesn't seem like a good thing.

Doug Ledford <dledford at redhat.com>
    GPG Key ID: 0E572FDD

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 884 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20160718/ea68753a/attachment.sig>

More information about the Linuxppc-dev mailing list