[RFC 0/2] Reenable might_sleep() checks for might_fault() when atomic

Wed Nov 26 21:05:04 AEDT 2014

Hi Michael,

thanks for your reply! some general thought:

This change was introduced mid 2013 but we don't have a single user relying
on this code change yet, right?

Disabling might_sleep() for functions that clearly state that they may sleep to
get some special case running feels wrong to me.

> On Tue, Nov 25, 2014 at 12:43:24PM +0100, David Hildenbrand wrote:
> > I recently discovered that commit 662bbcb2747c2422cf98d3d97619509379eee466
> > removed/skipped all might_sleep checks for might_fault() calls when in atomic.
> 
> Yes.  You can add e.g. might_sleep in your code if, for some reason, it is
> illegal to call it in an atomic context.
> But faults are legal in atomic if you handle the possible
> errors, and faults do not necessary cause caller to sleep, so might_fault
> should not call might_sleep.

My point is that and (almost at) everywhere where we use pagefault_disable, we
are using the inatomic variants. Also the documentation of copy_to_user()
clearly states at various points that this function "may sleep":

-> git grep "This function may sleep" yields
   "Context: User context only.  This function may sleep." e.g. s390, x86, mips

Patching out the might_sleep() from these functions seems more to be a hack to
solve another problem - not using the inatomic variants or finding them not to
be sufficient for a task?

So calling might_sleep() in these functions seems very right to me.

> 
> > Reason was to allow calling copy_(to|from)_user while in pagefault_disabled(),
> > because otherwise, CONFIG_DEBUG_ATOMIC_SLEEP would find false positives.
> 
> That wasn't the only reason BTW.
> Andi Kleen also showed that compiler did a terrible job optimizing
> get/put user when might_sleep was called.
> See e.g. this thread:
> https://lkml.org/lkml/2013/8/14/652
> There was even an lwn.net article about it, that I don't seem to be
> able to locate.

Thanks, I'll try to look it up! but:

might_sleep() will only be called when lock debugging / sleep in atomic is in,
so this doesn't seem to be a problem for me in a production environment. or am
I missing something?

> So might_sleep is not appropriate for lightweight operations like __get_user,
> which people literally expect to be a single instruction.

I agree that __.* variants should not call might_fault() (like I mentioned
after the table below).

> 
> 
> I also have a project going which handles very short packets by copying
> them into guest memory directly without waking up a thread.
> I do it by calling recvmsg on a socket, then switching to
> a thread if I get back EFAULT.
> Not yet clean enough to upstream but it does seem to cut
> the latency down quite a bit, so I'd like to have the option.
> 
> 
> Generally, a caller that does something like this under a spinlock:
>         preempt_disable
>         pagefault_disable
>         error = copy_to_user

So why can't we use the inatomic variant? This seems to be
atomic environment to me. Calling a function that states that it may sleep
doesn't feel right to me.

>         pagefault_enable
>         preempt_enable_no_resched
> 
> is not doing anything wrong and should not get a warning,
> as long as error is handled correctly later.
> 
> You can also find the discussion around the patches
> educational:
> http://article.gmane.org/gmane.comp.emulators.kvm.devel/109928
> 
> > However
> > we have the inatomic variants of these function for this purpose.
> 
> Does inatomic install fixups now?

If not, I think this would rather be the way to go.

> 
> Last I checked, it didn't so it did not satisfy this purpose.
> See this comment from x86:
> 
>  * Copy data from kernel space to user space.  Caller must check
>  * the specified block with access_ok() before calling this function.
>  * The caller should also make sure he pins the user space address
>  * so that we don't result in page fault and sleep.
> 
> 
> Also - switching to inatomic would scatter if (atomic) all
> over code. It's much better to simply call the same
> function (e.g. recvmsg) and have it fail gracefully:
> after all, we have code to handle get/put user errors
> anyway.
> 
> > The result of this change was that all guest access (that correctly uses
> > might_fault()) doesn't perform atomic checks when CONFIG_DEBUG_ATOMIC_SLEEP is
> > enabled. We lost a mighty debugging feature for user access.
> 
> What's the path you are trying to debug?

Well, we had a problem where we held a spin_lock and called
copy_(from|to)_user(). We experienced very random deadlocks that took some guy
almost a week to debug. The simple might_sleep() check would have showed this
error immediately.

I would have said that in 99,9% of all copy_to_user() calls we want to check
might_sleep(). That pagefault_disable() is a special case that should be
handled differently - in my opinion.

> 
> If your code can faults, then it's safe to call
> from atomic context.
> If it can't, it must pin the page.  You can not do access_ok checks and
> then assume access won't fault.
> 
> > As I wasn't able to come up with any other reason why this should be
> > necessary, I suggest turning the might_sleep() checks on again in the
> > might_fault() code.
> 
> Faults triggered with pagefault_disabled do not cause
> caller to sleep, merely to fail and get an error,
> so might_sleep is simply wrong.
> 
> > 
> > pagefault_disable/pagefault_enable seems to be used mainly for futex, perf event
> > and kmap.
> > 
> > Going over all changes since that commit, it seems like most code already uses the
> > inatomic variants of copy_(to|from)_user. Code relying on (get|put)_user during
> > pagefault_disable() don't make use of any might_fault() in their (get|put)_user
> > implementation. Examples:
> > - arch/m68k/include/asm/futex.h
> > - arch/parisc/include/asm/futex.h
> > - arch/sh/include/asm/futex-irq.h
> > - arch/tile/include/asm/futex.h
> > So changing might_fault() back to trigger might_sleep() won't change a thing for
> > them. Hope I haven't missed anything.
> 
> Did you check the generated code?

Nope not yet. But as I said, if lock debugging is not enabled, this should
remain as is - without any might_sleep() checks.

> On x86 it seems to me this patchset is definitely going to
> slow things down, in fact putting back in a performance regression
> that Andi found.
> 
> 
> > I only identified one might_fault() that would get triggered by get_user() on
> > powerpc and fixed it by using the inatomic variant (not tested). I am not sure
> > if we need some non-sleeping access_ok() prior to __get_user_inatomic().
> > 
> > By looking at the code I was wondering where the correct place for might_fault()
> > calls is? Doesn't seem to be very consistent. Examples:
> > 
> >                    | asm-generic | arm | arm64 | frv | m32r | x86 and s390
> > ---------------------------------------------------------------------------
> > get_user()         |   Yes       | Yes | Yes   | No  | Yes  |     Yes
> > __get_user()       |   No        | Yes | No    | No  | Yes  |     No
> > put_user()         |   Yes       | Yes | Yes   | No  | Yes  |     Yes
> > __put_user()       |   No        | Yes | No    | No  | Yes  |     No
> > copy_to_user()     |   Yes       | No  | No    | Yes | Yes  |     Yes
> > __copy_to_user()   |   No        | No  | No    | Yes | No   |     No
> > copy_from_user()   |   Yes       | No  | No    | Yes | Yes  |     Yes
> > __copy_from_user() |   No        | No  | No    | Yes | No   |     No
> > 
> 
> I think it would be a mistake to make this change.
> 
> Most call-sites handle faults in atomic just fine by
> returning error to caller.
> 
> > So I would have assume that the way asm-generic, x86 and s390 (+ propably
> > others) do this is the right way?
> > So we can speed up multiple calls to e.g. copy_to_user() by doing the access
> > check manually (and also the might_fault() if relevant), then calling
> > __copy_to_user().
> > 
> > So in general, I conclude that the concept is:
> > 1. __.* variants perform no checking and don't call might_fault()
> > 2. [a-z].* variants perform access checking (access_ok() if implemented)
> > 3. [a-z].* variants call might_fault()
> > 4. .*_inatomic variants usually don't perform access checks
> > 5. .*_inatomic variants don't call might_fault()
> > 6. If common code uses the __.* variants, it has to trigger access_ok() and
> >    call might_fault()
> > 7. For pagefault_disable(), the inatomic variants are to be used
> 
> inatomic variants don't seem to handle faults, so you
> must pin any memory you pass to them.
> 

Would that be an option for your special case?

> 
> > Comments? Opinions?
> > 
> 
> If the same address is accessed multiple times, access_ok + __
> variant can be used to speed access up a bit.
> This is rarely the case, but this is the case for e.g. vhost.
> But access_ok does not guarantee that no fault will trigger:
> there's really no way to do that ATM except pinning the page.
> 
> 
> > David Hildenbrand (2):
> >   powerpc/fsl-pci: atomic get_user when pagefault_disabled
> >   mm, sched: trigger might_sleep() in might_fault() when atomic
> > 
> >  arch/powerpc/sysdev/fsl_pci.c |  2 +-
> >  include/linux/kernel.h        |  8 ++++++--
> >  mm/memory.c                   | 11 ++++-------
> >  3 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > -- 
> > 1.8.5.5
>