[PATCH] powerpc: Fix data corruption on IPI

Fri Nov 17 19:20:29 AEDT 2023

----- Original Message -----
> From: "npiggin" <npiggin at gmail.com>
> To: "Timothy Pearson" <tpearson at raptorengineering.com>, "Michael Ellerman" <mpe at ellerman.id.au>
> Cc: "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:01:12 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>
>>
>> ----- Original Message -----
>> > From: "Michael Ellerman" <mpe at ellerman.id.au>
>> > To: "Timothy Pearson" <tpearson at raptorengineering.com>, "linuxppc-dev"
>> > <linuxppc-dev at lists.ozlabs.org>
>> > Cc: "Jens Axboe" <axboe at kernel.dk>
>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>
>> > Hi Timothy,
>> > 
>> > Thanks for debugging this, but I'm unclear why this is helping because
>> > we should already have a full barrier (hwsync) on both the sending and
>> > receiving side.
>> > 
>> > More below.
>>
>> I've spent another few days poking at this, and think I might finally have
>> something more solid in terms of what exactly is happening, but would like some
>> feedback on the concept / how best to fix the potential problem.
>>
>> As background, there are several worker threads both in userspace and in kernel
>> mode.  Crucially, the main MariaDB data processing thread (the one that handles
>> tasks like flushing dirty pages to disk) always runs on the same core as the
>> io_uring kernel thread that picks up I/O worker creation requests and handles
>> them via create_worker_cb().
>>
>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>> currently sets up the new process with create_io_thread(), and immediately uses
>> an IPI to forcibly schedule the new process.  Because of the way the two
>> threads interact, the new process ends up grabbing the CPU from the running
>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>> timing is right in this process, things get trampled on in userspace and the
>> database server either crashes or throws a corruption fault.
>>
>> Through extensive debugging, I've narrowed this down to invalid state in the VSX
>> registers on return to the MariaDB user thread from the new kernel thread.  For
>> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
>> thread, and something in the kernel was busy writing new data to them.
>>
>> A direct example I was able to observe is as follows:
>>
>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
>> due to the IPI
>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>> value 0x820040000000000082004000
>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>> stw     r9,116(r3)
>> stxvd2x vs8,r3,r0
>> ...
>> CRASH
> 
> Nice find, that looks pretty conclusive.
> 
>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>> operations with VSX registers so it does eventually fail.  I ended up with a
>> tight loop around glibc operations that use VSX to trigger the failure reliably
>> enough to even understand what was going on.
>>
>> As I am not as familiar with this part of the Linux kernel as with most other
>> areas, what is the expected save/restore path for the FP/VSX registers around
>> an IPI and associated forced thread switch?  If restore_math() is in the return
>> path, note that MSR_FP is set in regs->msr.
> 
> Context switching these FP/vec registers should be pretty robust in
> general because it's not just io-uring that uses them. io-uring could
> be using some uncommon kernel code that uses the registers incorrectly
> though I guess.
> 
>>
>> Second question: should we even be using the VSX registers at all in kernel
>> space?  Is this a side effect of io_uring interacting so closely with userspace
>> threads, or something else entirely?
>>
>> If I can get pointed somewhat in the right direction I'm ready to develop the
>> rest of the fix for this issue...just trying to avoid another several days of
>> slogging through the source to see what it's supposed to be doing in the first
>> place. :)
> 
> Kernel can use FP/VEC/VSX registers but it has to enable and disable
> explicitly. Such kernel code also should not be preemptible.
> 
> enable|disable_kernel_fp|altivec|vsx().
> 
> Maybe try run the test with ppc_strict_facility_enable boot option to
> start with.
> 
> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
> the disable_kernel_* functions.
> 
> You could also add an enable/disable counter for each, and make sure it
> is balanced on context switch or kernel->userspace exit.
> 
> Thanks,
> Nick

Will do, thanks for the hints!

I had a debug idea just as I sent the earlier message, and was able to confirm the kernel is purposefully restoring the bad data in at least one spot in the thread's history, though curiously *not* right before everything goes off the rails.  I also dumped the entire kernel binary and confirmed it isn't touching the vs* registers, so overall this is leaning more toward a bad value being restored than kernel code inadvertently making use of the vector registers.

An I missing anything other than do_restore_fp() that could touch the vs* registers around an interrupt-driven task switch?