[PATCH] powerpc: Fix data corruption on IPI

Nicholas Piggin npiggin at gmail.com
Wed Nov 15 14:11:35 AEDT 2023

(Sorry I didn't see that Michael already made the same comment,
ignore my previous.)

On Wed Nov 15, 2023 at 7:32 AM AEST, Timothy Pearson wrote:
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe at ellerman.id.au>
> > To: "Timothy Pearson" <tpearson at raptorengineering.com>, "linuxppc-dev" <linuxppc-dev at lists.ozlabs.org>
> > Cc: "Jens Axboe" <axboe at kernel.dk>
> > Sent: Tuesday, November 14, 2023 6:14:37 AM
> > Subject: Re: [PATCH] powerpc: Fix data corruption on IPI
> > Hi Timothy,
> > 
> > Thanks for debugging this, but I'm unclear why this is helping because
> > we should already have a full barrier (hwsync) on both the sending and
> > receiving side.
> > 
> > More below.
> So it looks like we might be dealing with a couple of potentially separate issues here, not sure.  This is probably the most infuriating bug I've run across in my career, so please bear with me -- with the amount of active test kernels I have installed at any one time I'm occassionally infecting one with the tests from another. ;)
> First, I'm not 100% convinced the code in try_to_wake_up() is race-free on ppc64, since adding a memory barrier between it and the call to kick_process() significantly reduces the frequency of the on-disk corruption.  Can you take a look and see if anything stands out as to why that would be important?

We have had memory ordering bugs in the scheduler before, but normally
they would result in a lost wakeup (hang) or crash in the scheduler.
AFAIK things that sleep are not supposed to assume they won't get an
interrupt before the event they are waiting on.

Where are you adding the barrier?

If it only reduces corruption then it seems like the race or ordering
bug is elsewhere (but maybe nearby).

> The second part of this though is that the barrier only reduces the frequency of the corruption, it does not eliminate the corruption.  After some consolidation, what completely eliminates the corruption is a combination of:
>  * Switching to TWA_SIGNAL_NO_IPI in task_work_add() * 
>  * Adding udelay(1000) in io_uring/rw.c:io_write(), right before the call to io_rw_init_file()

> Adding a memory barrier instead of the udelay() doesn't work, nor does adding the delay without switching to NO_IPI.

And just NO_IPI alone doesn't eliminate it?

> [1] Keeping TWA_SIGNAL and adding the barrier instruction also works, but this is conceptually simpler to understand as to why it would have an effect at all

Which barrier, the ttwu one? And by work you mean fixes completely?

Still smells like a bug in io uring code, possibly MySQL. Is it only
MySQL it's been seen with do you know? I don't suppose there are any
clues about the corruption pattern or when or where it appears?


More information about the Linuxppc-dev mailing list