[PATCH] powerpc: Fix data corruption on IPI

Michael Ellerman mpe at ellerman.id.au
Tue Nov 14 23:14:37 AEDT 2023


Hi Timothy,

Thanks for debugging this, but I'm unclear why this is helping because
we should already have a full barrier (hwsync) on both the sending and
receiving side.

More below.

Timothy Pearson <tpearson at raptorengineering.com> writes:
> From 0b2678b7cdada1a3d9aec8626f31a988d81373fa Mon Sep 17 00:00:00 2001
> From: Timothy Pearson <tpearson at raptorengineering.com>
> Date: Mon, 13 Nov 2023 22:42:58 -0600
> Subject: [PATCH] powerpc: Fix data corruption on IPI
>
> On multithreaded SMP workloads such as those using io_uring, it is possible for
> multiple threads to hold an inconsistent view of system memory when an IPI is
> issued.  This in turn leads to userspace memory corruption with varying degrees
> of probability based on workload and inter-thread timing.
>
> io_uring provokes this bug by its use of TWA_SIGNAL during thread creation,
> which is especially noticeable as significant userspace data corruption with
> certain workloads such as MariaDB (bug MDEV-30728).  While using
> TWA_SIGNAL_NO_IPI works around the corruption, no other architecture requires
> this workaround.
>
> Issue an lwsync barrier instruction prior to sending the IPI.  This ensures
> the receiving CPU has a consistent view of system memory, in line with other
> architectures.
>
> Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> POWER9 DD2.2 CPUs.
>
> Tested-by: Timothy Pearson <tpearson at raptorengineering.com>
> Signed-off-by: Timothy Pearson <tpearson at raptorengineering.com>
> ---
>  arch/powerpc/kernel/smp.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index ab691c89d787..ba42238de518 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)
>  
>  void arch_smp_send_reschedule(int cpu)
>  {
> -	if (likely(smp_ops))
> +	if (likely(smp_ops)) {
> +		__smp_lwsync();
>  		do_message_pass(cpu, PPC_MSG_RESCHEDULE);
> +	}
>  }


Expanding the code:

static inline void do_message_pass(int cpu, int msg)
{
	if (smp_ops->message_pass)
		smp_ops->message_pass(cpu, msg);
#ifdef CONFIG_PPC_SMP_MUXED_IPI
	else
		smp_muxed_ipi_message_pass(cpu, msg);
#endif
}

The kernel should be using smp_muxed_ipi_message_pass() on your machine, so:

void smp_muxed_ipi_message_pass(int cpu, int msg)
{
	smp_muxed_ipi_set_message(cpu, msg);

	/*
	 * cause_ipi functions are required to include a full barrier
	 * before doing whatever causes the IPI.
	 */
	smp_ops->cause_ipi(cpu);
}

void smp_muxed_ipi_set_message(int cpu, int msg)
{
	struct cpu_messages *info = &per_cpu(ipi_message, cpu);
	char *message = (char *)&info->messages;

	/*
	 * Order previous accesses before accesses in the IPI handler.
	 */
	smp_mb();
	WRITE_ONCE(message[msg], 1);
}

Where that smp_mb() expands to hwsync (aka sync).

The receiving side is:

irqreturn_t smp_ipi_demux(void)
{
	mb();	/* order any irq clear */

	return smp_ipi_demux_relaxed();
}

irqreturn_t smp_ipi_demux_relaxed(void)
{
	struct cpu_messages *info;
	unsigned long all;

	info = this_cpu_ptr(&ipi_message);
	do {
		all = xchg(&info->messages, 0);
                ...
		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
			scheduler_ipi();


The hwsync is not in *exactly* the same place as the lwsync you added.
But the only accesses between the lwsync and the hwsync should be the
load of ipi_messages, and I don't see why that should matter.

There must be something going wrong in this path though for your patch
to fix the issue for you, but I'm not sure how the barrier is fixing it.

You said you tested in a VM, I guess it's not easy for you to test on
the host?

I see now there's a long thread on this which I haven't read so I'll go
and read that. (Can you cc linuxppc-dev next time please? :D)

One thing you could try is just adding xnops in that location and see if
some number of them also makes the bug go away, eg:

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5826f5108a12..6c84bf55557e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -369,8 +369,10 @@ static inline void do_message_pass(int cpu, int msg)

 void arch_smp_send_reschedule(int cpu)
 {
-       if (likely(smp_ops))
+       if (likely(smp_ops)) {
+               asm volatile ("xnop");
                do_message_pass(cpu, PPC_MSG_RESCHEDULE);
+       }
 }
 EXPORT_SYMBOL_GPL(arch_smp_send_reschedule);


cheers


More information about the Linuxppc-dev mailing list