[PATCH] powerpc: Fix data corruption on IPI

Fri Nov 17 19:54:27 AEDT 2023

----- Original Message -----
> From: "Timothy Pearson" <tpearson at raptorengineering.com>
> To: "Timothy Pearson" <tpearson at raptorengineering.com>
> Cc: "npiggin" <npiggin at gmail.com>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> Sent: Friday, November 17, 2023 2:26:37 AM
> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI

> ----- Original Message -----
>> From: "Timothy Pearson" <tpearson at raptorengineering.com>
>> To: "npiggin" <npiggin at gmail.com>
>> Cc: "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
>> Sent: Friday, November 17, 2023 2:20:29 AM
>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> 
>> ----- Original Message -----
>>> From: "npiggin" <npiggin at gmail.com>
>>> To: "Timothy Pearson" <tpearson at raptorengineering.com>, "Michael Ellerman"
>>> <mpe at ellerman.id.au>
>>> Cc: "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
>>> Sent: Friday, November 17, 2023 2:01:12 AM
>>> Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>> 
>>> On Fri Nov 17, 2023 at 5:39 PM AEST, Timothy Pearson wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>> > From: "Michael Ellerman" <mpe at ellerman.id.au>
>>>> > To: "Timothy Pearson" <tpearson at raptorengineering.com>, "linuxppc-dev"
>>>> > <linuxppc-dev at lists.ozlabs.org>
>>>> > Cc: "Jens Axboe" <axboe at kernel.dk>
>>>> > Sent: Tuesday, November 14, 2023 6:14:37 AM
>>>> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
>>>>
>>>> > Hi Timothy,
>>>> > 
>>>> > Thanks for debugging this, but I'm unclear why this is helping because
>>>> > we should already have a full barrier (hwsync) on both the sending and
>>>> > receiving side.
>>>> > 
>>>> > More below.
>>>>
>>>> I've spent another few days poking at this, and think I might finally have
>>>> something more solid in terms of what exactly is happening, but would like some
>>>> feedback on the concept / how best to fix the potential problem.
>>>>
>>>> As background, there are several worker threads both in userspace and in kernel
>>>> mode.  Crucially, the main MariaDB data processing thread (the one that handles
>>>> tasks like flushing dirty pages to disk) always runs on the same core as the
>>>> io_uring kernel thread that picks up I/O worker creation requests and handles
>>>> them via create_worker_cb().
>>>>
>>>> Changes in the ~5.12 era switched away from a delayed worker setup.  io_uring
>>>> currently sets up the new process with create_io_thread(), and immediately uses
>>>> an IPI to forcibly schedule the new process.  Because of the way the two
>>>> threads interact, the new process ends up grabbing the CPU from the running
>>>> MariaDB user thread; I've never seen it schedule on a different core.  If the
>>>> timing is right in this process, things get trampled on in userspace and the
>>>> database server either crashes or throws a corruption fault.
>>>>
>>>> Through extensive debugging, I've narrowed this down to invalid state in the VSX
>>>> registers on return to the MariaDB user thread from the new kernel thread.  For
>>>> some reason, it seems we don't restore FP state on return from the PF_IO_WORKER
>>>> thread, and something in the kernel was busy writing new data to them.
>>>>
>>>> A direct example I was able to observe is as follows:
>>>>
>>>> xxspltd vs0,vs0,0      <-- vs0 now zeroed out
>>>> xori    r9,r9,1        <-- Presumably we switch to the new kernel thread here
>>>> due to the IPI
>>>> slwi    r9,r9,7        <-- On userspace thread resume, vs0 now contains the
>>>> value 0x820040000000000082004000
>>>> xxswapd vs8,vs0        <-- vs8 now has the wrong value
>>>> stxvd2x vs8,r3,r12     <-- userspace is now getting stepped on
>>>> stw     r9,116(r3)
>>>> stxvd2x vs8,r3,r0
>>>> ...
>>>> CRASH
>>> 
>>> Nice find, that looks pretty conclusive.
>>> 
>>>> This is a very difficult race to hit, but MariaDB naturally does repetitive
>>>> operations with VSX registers so it does eventually fail.  I ended up with a
>>>> tight loop around glibc operations that use VSX to trigger the failure reliably
>>>> enough to even understand what was going on.
>>>>
>>>> As I am not as familiar with this part of the Linux kernel as with most other
>>>> areas, what is the expected save/restore path for the FP/VSX registers around
>>>> an IPI and associated forced thread switch?  If restore_math() is in the return
>>>> path, note that MSR_FP is set in regs->msr.
>>> 
>>> Context switching these FP/vec registers should be pretty robust in
>>> general because it's not just io-uring that uses them. io-uring could
>>> be using some uncommon kernel code that uses the registers incorrectly
>>> though I guess.
>>> 
>>>>
>>>> Second question: should we even be using the VSX registers at all in kernel
>>>> space?  Is this a side effect of io_uring interacting so closely with userspace
>>>> threads, or something else entirely?
>>>>
>>>> If I can get pointed somewhat in the right direction I'm ready to develop the
>>>> rest of the fix for this issue...just trying to avoid another several days of
>>>> slogging through the source to see what it's supposed to be doing in the first
>>>> place. :)
>>> 
>>> Kernel can use FP/VEC/VSX registers but it has to enable and disable
>>> explicitly. Such kernel code also should not be preemptible.
>>> 
>>> enable|disable_kernel_fp|altivec|vsx().
>>> 
>>> Maybe try run the test with ppc_strict_facility_enable boot option to
>>> start with.
>>> 
>>> If no luck with that, maybe put WARN_ON(preemptible()); checks also in
>>> the disable_kernel_* functions.
>>> 
>>> You could also add an enable/disable counter for each, and make sure it
>>> is balanced on context switch or kernel->userspace exit.
>>> 
>>> Thanks,
>>> Nick
>> 
>> Will do, thanks for the hints!
>> 
>> I had a debug idea just as I sent the earlier message, and was able to confirm
>> the kernel is purposefully restoring the bad data in at least one spot in the
>> thread's history, though curiously *not* right before everything goes off the
>> rails.  I also dumped the entire kernel binary and confirmed it isn't touching
>> the vs* registers, so overall this is leaning more toward a bad value being
>> restored than kernel code inadvertently making use of the vector registers.
>> 
>> An I missing anything other than do_restore_fp() that could touch the vs*
>> registers around an interrupt-driven task switch?
> 
> One other piece of this puzzle -- I'm running this via qemu in kvm mode.  I
> noticed there is some code that touches FP state in the kvm tree, any way we
> could be having a problem lower down the stack (i.e. at hypervisor level) that
> would manifest this way in the guest?
> 
> I can try to test on bare metal tomorrow to rule that out one way or the other.

Just tried on bare metal, same corruption issues remain so I'm going to assume qemu / kvm is not the issue here.