[PATCH] powerpc: Fix data corruption on IPI

Fri Nov 17 18:52:13 AEDT 2023

----- Original Message -----
> From: "Timothy Pearson" <tpearson at raptorengineeringinc.com>
> To: "Michael Ellerman" <mpe at ellerman.id.au>, "npiggin" <npiggin at gmail.com>
> Cc: "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Friday, November 17, 2023 1:39:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "Michael Ellerman" <mpe at ellerman.id.au>
>> To: "Timothy Pearson" <tpearson at raptorengineering.com>, "linuxppc-dev"
>> <linuxppc-dev at lists.ozlabs.org>
>> Cc: "Jens Axboe" <axboe at kernel.dk>
>> Sent: Tuesday, November 14, 2023 6:14:37 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> Hi Timothy,
>> 
>> Thanks for debugging this, but I'm unclear why this is helping because
>> we should already have a full barrier (hwsync) on both the sending and
>> receiving side.
>> 
>> More below.
> 
> I've spent another few days poking at this, and think I might finally have
> something more solid in terms of what exactly is happening, but would like some
> feedback on the concept / how best to fix the potential problem.
> 
> As background, there are several worker threads both in userspace and in kernel
> mode.  Crucially, the main MariaDB data processing thread (the one that handles
> tasks like flushing dirty pages to disk) always runs on the same core as the
> io_uring kernel thread that picks up I/O worker creation requests and handles
> them via create_worker_cb().
> 
> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
> currently sets up the new process with create_io_thread(), and immediately uses
> an IPI to forcibly schedule the new process.  Because of the way the two
> threads interact, the new process ends up grabbing the CPU from the running
> MariaDB user thread; I've never seen it schedule on a different core.  If the
> timing is right in this process, things get trampled on in userspace and the
> database server either crashes or throws a corruption fault.
> 
> Through extensive debugging, I've narrowed this down to invalid state in the VSX
> registers on return to the MariaDB user thread from the new kernel thread.  For
> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
> thread, and something in the kernel was busy writing new data to them.
> 
> A direct example I was able to observe is as follows:
> 
> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
> due to the IPI
> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
> value 0x820040000000000082004000
> xxswapd vs8,vs0        <-- vs8 now has the wrong value
> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
> stw     r9,116(r3)
> stxvd2x vs8,r3,r0
> ...
> CRASH
> 
> This is a very difficult race to hit, but MariaDB naturally does repetitive
> operations with VSX registers so it does eventually fail.  I ended up with a
> tight loop around glibc operations that use VSX to trigger the failure reliably
> enough to even understand what was going on.
> 
> As I am not as familiar with this part of the Linux kernel as with most other
> areas, what is the expected save/restore path for the FP/VSX registers around
> an IPI and associated forced thread switch?  If restore_math() is in the return
> path, note that MSR_FP is set in regs->msr.
> 
> Second question: should we even be using the VSX registers at all in kernel
> space?  Is this a side effect of io_uring interacting so closely with userspace
> threads, or something else entirely?

Thinking a bit more, a third option could be if we're restoring garbage into the registers by accident.  I know the I/O worker threads claim to use a "lightweight" version of kernel_clone(), and it'd be easy to have missed something important...