[PATCH] x86/uaccess: Avoid barrier_nospec() in copy_from_user()

Mon Oct 14 17:50:55 AEDT 2024

On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoimboe at kernel.org> wrote:
>
> If I understand correctly, LAM bits are for the benefit of SW and are
> ignored by HW?  Can we just mask those bits off?

Yes. But then you waste time on the masking, but particularly if it
then causes silly extra overhead just to get the mask.

That was why the whole original LAM patches were dismissed - not
because an "and" instruction to remove the LAM bits would be
expensive, but because when you do static branching depending on it
and then load the mask from memory, the code goes from nice simple
straight-line efficient code to a horrible mess.

The early LAM patches admittedly made it even worse because it did a
per-thread mask etc, so it really got quite nasty and wasn't just
*one* static jump to a constant. But still..

> > So maybe the "what is the actual cycle latency of detecting the
> > faulting instruction" really is the core question here.
>
> I think I remember reading that STAC/CLAC can't necessarily be relied on
> as a speculation barrier, depending on microarchitectural details.  It
> might be safest to assume we can't rely on that.  Masking is relatively
> cheap anyway.

The WHOLE BUG is all about "microarchitectural details".

So arguing that STAC is a microarchitectural detail and not
architecturally guaranteed isn't an argument.

Because architecturally, this but simply does not exist, and so some
architectural definition simply doesn't matter.

So all that matters is whether the broken microarchitectures that used
the wrong bit for testing also have a STAC that basically hides the
bug.

Christ. The hardware should have used a bit that is *MEANINGFUL*,
namely bit #63, not some random meaningless bit that just happens to
be one of the bits that is then checked for canonicality.

And it's so annoying, because from a *hardware* perspective, bit #63
vs bit #48 is completely irrelevant. It's literally just a wire choice

But from an architectural perspective, bit #63 is not only the
*actual* bit that is the real difference  ("kernel is at the top of
the address space") but for software, bit #48 is fundamentally harder
to test.

IOW, it's completely the wrong effing bit to test.

Honestly, the Intel meltdown thing I at least understood *why* it happened.

This AMD "use meaningless bit #48" bug just strikes me as active malice.

The paper I found for this bug doesn't go into any details of what the
cycle count issues are until the full address is verified, and doesn't
even mention which micro-architectures are affected. It says "Zen+"
and "Zen 2", from a quick read.

The AMD note doesn't say even that. Is this *all* Zen cores? If not,
when did it get fixed?

> So far I have something like the below which is completely untested and
> probably actively wrong in some places.

This is worse than actively wrong. It's just adding insult to injury.

> +static inline bool __access_ok(const void __user *ptr, unsigned long size)
> +{
> +       unsigned long addr = (__force unsigned long)untagged_addr(ptr);
> +       unsigned long limit = TASK_SIZE_MAX;

We're not encouraging this complete microarchitectural incompetence by
using something expensive like TASK_SIZE_MAX, which actually expands
to be conditional on pgtable_l5_enabled() and a shift (maybe the
compiler ends up moving the shift into both sides of the
conditional?).

Which isn't even the right thing, because presumably as far as the
HARDWARE is concerned, the actual width of the TLB is what gets
tested,  and isn't actually dependent on whether 5-level paging is
actually *enabled* or not, but on whether the hardware *supports*
5-level paging.

I guess as long as we aren't using the high bits, we can just clear
all of them (so if we just always clear bit #57 when we also clear
#48, we are ok), but it still seems wrong in addition to being
pointlessly expensive.

> +#define mask_user_address(x) \
> +       ((typeof(x))((__force unsigned long)(x) & ((1UL << __VIRTUAL_MASK_SHIFT) - 1)))

No. Again, that code makes it look like it's some nice simple
constant. It's not. __VIRTUAL_MASK_SHIFT is that conditional thing
based on pgtable_l5_enabled(). So now you have static branches etc
craziness instead of having a nice constant.

We can fix this, but no, we're not going "hardware people did
something incredibly stupid, so now to balance things out, *we* will
do something equally stupid".

We are going to be better than that.

So the way to fix this properly - if it even needs fixing - is to just
have an actual assembler alternate that does something like this in
get_user.S, and does the same for the C code for mask_user_address().

And it needs a *big* comment about the stupidity of hardware checking
the wrong bit that has no semantic meaning except for it (too late!)
being tested for being canonical with the actual bit that matters (ie
bit #63).

And again, the whole "if it even needs fixing" is a big thing. Maybe
STAC is just taking long enough that the canonicality check *has* been
done. We know the STAC isn't a serializing instruction, but we also
*do* know that STAC sure as hell will synchronize at least to some
degree with memory access permission testing, because that's literally
the whole and only point of it.

(Of course, if the AC bit was just passed along from the front end and
tagged all the instructions, the actual CLAC/STAC instructions
wouldn't need to serializing with actual instruction execution, but if
that was the case they wouldn't be as expensive as I see them being in
profiles, so we know the uarch isn't doing something that clever).

Christ. I thought I was over being annoyed by hardware bugs. But this
hardware bug is just annoying in how *stupid* it is.

Anyway, the attached patch

 (a) only fixes get_user() - I assume put_user() doesn't even need
this, because the store will go into the store buffer, and certainly
be killed before it gets anywhere else?

 (b) only does the asm side, the mask_user_address() would need to do
the same thing using the C version: alternative()

 (c) is entirely untested, and I might have gotten the constants wrong
or some other logic wrong.

but at least the code it generates doesn't look actively like garbage.
It generates something like this:

        mov    %rax,%rdx
        sar    $0x3f,%rdx
        or     %rdx,%rax
        movabs $0x80007fffffffffff,%rdx
        and    %rdx,%rax

which looks about as good as it gets (assuming I didn't screw up the constant).

The "or" still sets all bits when it's one of the real kernel
addresses), but the "and" will now guarantee that the end result is
canonical for positive addresses, and guaranteed non-canonical - and
for this AMD bug, a zero bit 48/57 - for the invalid kernel range.

Yes, it basically means that as far as the kernel is concerned,
"everything is LAM".

The kernel would actually accept all positive addresses, and only
fault (now with a non-canonical GP fault) when users try to use
negative addresses.

Which is arguably also quite horrendous, but it's effectively what
having LAM enabled would do in hardware anyway, so hey, it's "forward
looking". Bah.

And we could make the masking constants be 0xfeff7fffffffffff and
0xfeffffffffffff, and *only* mask off bit 48/57.

And we could even make the whole "and" be conditional on the AMD bug
with a new X86_FEATURE_CANONICAL_LEAK thing.

So there are certainly options in this area.

             Linus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.diff
Type: text/x-patch
Size: 673 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20241013/f8806c62/attachment.bin>