[Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

Sat Mar 30 14:10:28 AEDT 2019

Steven Rostedt's on March 30, 2019 1:31 am:
> [ Added Peter ]
> 
> On Fri, 29 Mar 2019 19:13:55 +1000
> Nicholas Piggin <npiggin at gmail.com> wrote:
> 
>> Suraj Jitindar Singh's on March 29, 2019 3:20 pm:
>> > On Wed, 2019-03-27 at 17:51 +0100, Cédric Le Goater wrote:  
>> >> On 3/27/19 5:37 PM, Cédric Le Goater wrote:  
>> >> > On 3/27/19 1:36 PM, Sebastian Andrzej Siewior wrote:  
>> >> > > With qemu-system-ppc64le -machine pseries -smp 4 I get:
>> >> > >   
>> >> > > > #  chrt 1 hackbench
>> >> > > > Running in process mode with 10 groups using 40 file
>> >> > > > descriptors each (== 400 tasks)
>> >> > > > Each sender will pass 100 messages of 100 bytes
>> >> > > > Oops: Exception in kernel mode, sig: 4 [#1]
>> >> > > > LE PAGE_SIZE=64K MMU=Hash PREEMPT SMP NR_CPUS=2048 NUMA pSeries
>> >> > > > Modules linked in:
>> >> > > > CPU: 0 PID: 629 Comm: hackbench Not tainted 5.1.0-rc2 #71
>> >> > > > NIP:  c000000000046978 LR: c000000000046a38 CTR:
>> >> > > > c0000000000b0150
>> >> > > > REGS: c0000001fffeb8e0 TRAP: 0700   Not tainted  (5.1.0-rc2)
>> >> > > > MSR:  8000000000089033 <SF,EE,ME,IR,DR,RI,LE>  CR:
>> >> > > > 42000874  XER: 00000000
>> >> > > > CFAR: c000000000046a34 IRQMASK: 1
>> >> > > > GPR00: c0000000000b0170 c0000001fffebb70 c000000000a6ba00
>> >> > > > 0000000028000000  
>> >> > > 
>> >> > > …  
>> >> > > > NIP [c000000000046978] doorbell_core_ipi+0x28/0x30
>> >> > > > LR [c000000000046a38] doorbell_try_core_ipi+0xb8/0xf0
>> >> > > > Call Trace:
>> >> > > > [c0000001fffebb70] [c0000001fffebba0] 0xc0000001fffebba0
>> >> > > > (unreliable)
>> >> > > > [c0000001fffebba0] [c0000000000b0170]
>> >> > > > smp_pseries_cause_ipi+0x20/0x70
>> >> > > > [c0000001fffebbd0] [c00000000004b02c]
>> >> > > > arch_send_call_function_single_ipi+0x8c/0xa0
>> >> > > > [c0000001fffebbf0] [c0000000001de600]
>> >> > > > irq_work_queue_on+0xe0/0x130
>> >> > > > [c0000001fffebc30] [c0000000001340c8]
>> >> > > > rto_push_irq_work_func+0xc8/0x120  
>> >> > > 
>> >> > > …  
>> >> > > > Instruction dump:
>> >> > > > 60000000 60000000 3c4c00a2 384250b0 3d220009 392949c8 81290000
>> >> > > > 3929ffff
>> >> > > > 7d231838 7c0004ac 5463017e 64632800 <7c00191c> 4e800020
>> >> > > > 3c4c00a2 38425080
>> >> > > > ---[ end trace eb842b544538cbdf ]---  
>> 
>> This is unusual and causing powerpc code to crash because the rt
>> scheduler is telling irq_work_queue_on to queue work on this CPU.
>> Is that something allowed? There's no warnings in there but it must
>> be a rarely tested path, would it be better to ban it?
>> 
>> Steven is this queue_work_on to self by design?
> 
> I just figured that a irq_work_queue_on() would default to just
> irq_work_queue() if cpu == smp_processor_id().

Yeah it actually doesn't, it raises a CALL_FUNCTION IPI on the
current CPU. Maybe it should do that.

> 
> I'm sure this case has been tested (on x86), as I stressed this quite a
> bit, and all it takes is to have an RT task queued on the CPU
> processing the current "push" logic when it's not holding the rq locks
> and read that the current CPU now has an overloaded RT task.
> 
> If calling irq_work_queue_on() is really bad to do to the same CPU,
> then we can work on changing the logic to do something different if the
> CPU returned from rto_next_cpu() is the current CPU.

I don't actually know that it is bad. It does seem to violate the
principle of least surprise for arch code though.

>From this bug report, I'm guessing it's pretty rare case to hit so
I would worry about subtle bugs. And I just cc'ed you in case your
code was actually not intending to do this.

> Or would it be possible to just change irq_work_queue_on() to call
> irq_work_queue() if cpu == smp_processor_id()?

Something like this?

kernel/irq_work: Do not raise an IPI when queueing work on the local CPU

The QEMU powerpc/pseries machine model was not expecting a self-IPI,
and it may be a bit surprising thing to do, so have irq_work_queue_on
do local queueing when target is current CPU.

Suggested-by: Steven Rostedt <rostedt at goodmis.org>
Signed-off-by: Nicholas Piggin <npiggin at gmail.com>
---
 kernel/irq_work.c | 78 ++++++++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 6b7cdf17ccf8..f0e539d0f879 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -56,61 +56,69 @@ void __weak arch_irq_work_raise(void)
 	 */
 }
 
-/*
- * Enqueue the irq_work @work on @cpu unless it's already pending
- * somewhere.
- *
- * Can be re-enqueued while the callback is still in progress.
- */
-bool irq_work_queue_on(struct irq_work *work, int cpu)
+/* Enqueue on current CPU, work must already be claimed and preempt disabled */
+static void __irq_work_queue(struct irq_work *work)
 {
-	/* All work should have been flushed before going offline */
-	WARN_ON_ONCE(cpu_is_offline(cpu));
-
-#ifdef CONFIG_SMP
-
-	/* Arch remote IPI send/receive backend aren't NMI safe */
-	WARN_ON_ONCE(in_nmi());
+	/* If the work is "lazy", handle it from next tick if any */
+	if (work->flags & IRQ_WORK_LAZY) {
+		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
+		    tick_nohz_tick_stopped())
+			arch_irq_work_raise();
+	} else {
+		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+			arch_irq_work_raise();
+	}
+}
 
+/* Enqueue the irq work @work on the current CPU */
+bool irq_work_queue(struct irq_work *work)
+{
 	/* Only queue if not already pending */
 	if (!irq_work_claim(work))
 		return false;
 
-	if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
-		arch_send_call_function_single_ipi(cpu);
-
-#else /* #ifdef CONFIG_SMP */
-	irq_work_queue(work);
-#endif /* #else #ifdef CONFIG_SMP */
+	/* Queue the entry and raise the IPI if needed. */
+	preempt_disable();
+	__irq_work_queue(work);
+	preempt_enable();
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(irq_work_queue);
 
-/* Enqueue the irq work @work on the current CPU */
-bool irq_work_queue(struct irq_work *work)
+/*
+ * Enqueue the irq_work @work on @cpu unless it's already pending
+ * somewhere.
+ *
+ * Can be re-enqueued while the callback is still in progress.
+ */
+bool irq_work_queue_on(struct irq_work *work, int cpu)
 {
+#ifndef CONFIG_SMP
+	return irq_work_queue(work);
+
+#else /* #ifndef CONFIG_SMP */
+	/* All work should have been flushed before going offline */
+	WARN_ON_ONCE(cpu_is_offline(cpu));
+
 	/* Only queue if not already pending */
 	if (!irq_work_claim(work))
 		return false;
 
-	/* Queue the entry and raise the IPI if needed. */
 	preempt_disable();
-
-	/* If the work is "lazy", handle it from next tick if any */
-	if (work->flags & IRQ_WORK_LAZY) {
-		if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
-		    tick_nohz_tick_stopped())
-			arch_irq_work_raise();
-	} else {
-		if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
-			arch_irq_work_raise();
-	}
-
+	if (cpu != smp_processor_id()) {
+		/* Arch remote IPI send/receive backend aren't NMI safe */
+		WARN_ON_ONCE(in_nmi());
+		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+			arch_send_call_function_single_ipi(cpu);
+	} else
+		__irq_work_queue(work);
 	preempt_enable();
 
 	return true;
+#endif /* #else #ifndef CONFIG_SMP */
 }
-EXPORT_SYMBOL_GPL(irq_work_queue);
+
 
 bool irq_work_needs_cpu(void)
 {
-- 
2.20.1