[PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()

Thu Oct 17 03:10:47 AEDT 2024

On Mon, 14 Oct 2024 at 09:55, Linus Torvalds
<torvalds at linux-foundation.org> wrote:
>
> On Mon, 14 Oct 2024 at 05:30, Kirill A. Shutemov <kirill at shutemov.name> wrote:
> >
> > Given that LAM enforces bit 47/56 to be equal to bit 63 I think we can do
> > this unconditionally instead of masking:
> >
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index d066aecf8aeb..86d4511520b1 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -37,9 +37,14 @@
> >
> >  #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
> >
> > +#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
> > +       "shl $(64 - 48), %rdx", \
> > +       "shl $(64 - 57), %rdx", X86_FEATURE_LA57
>
> That's certainly a lot smaller than the big constant, which is good.

Actually, having thought this over, I think we can simplify things
more, and not do any shifts AT ALL.

IOW, the whole "mask non-canonical bits" model I think works on its
own even without the "shift sign bit and or" logic.

So it can replace all the shifting entirely (much less the double
shifting), and we can have something like this:

  --- a/arch/x86/lib/getuser.S
  +++ b/arch/x86/lib/getuser.S
  @@ -37,11 +37,14 @@

   #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC

  +#define X86_CANONICAL_MASK ALTERNATIVE \
  +     "movq $0x80007fffffffffff,%rdx", \
  +     "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57
  +
   .macro check_range size:req
   .if IS_ENABLED(CONFIG_X86_64)
  -     mov %rax, %rdx
  -     sar $63, %rdx
  -     or %rdx, %rax
  +     X86_CANONICAL_MASK      /* mask into %rdx */
  +     and %rdx,%rax
   .else
        cmp $TASK_SIZE_MAX-\size+1, %eax
        jae .Lbad_get_user

Yes, it does end up having that big constant (not great), but makes up
for it by avoiding the register move instruction (3 bytes) and two
shift instructions (4 bytes each), so while the "load big constant"
instruction is a big 10-byte instruction, it's actually one byte
shorter than the old code was.

It also avoids a couple of data dependencies, so it's both smaller and simpler.

An invalid address with the high bit set will become non-canonical, so
it will cause an exception as expected.

But it avoids the AMD issue, because while it's non-canonical, it
always clears that bit 48 (or 57), so it will never hit a kernel
TLB/cache tag.

Of course, if some other core then does the sane thing, and uses  bit
#63 (instead of #48/57), and still has the pseudo-meltdown behavior,
then the non-canonical address generated by the plain 'and' will end
up having the same old issue.

But in (related) news, the address masking trick really does matter:

   https://lore.kernel.org/all/202410161557.5b87225e-oliver.sang@intel.com/

which reports that ridiculous 6.8% improvement just from avoiding the
barrier in user_access_begin().

So that barrier really *is* very expensive. Surprisingly so.

                 Linus