[PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Fri Apr 3 11:37:40 AEDT 2020

On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
> Leonardo Bras <leonardo at linux.ibm.com> writes:
> > During a crash, there is chance that the cpus that handle the NMI IPI
> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
> > 
> > This is a problem if the system has kdump set up, given if it crashes
> > for any reason kdump may not be saved for crash analysis.
> > 
> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
> > needed for finishing crash routine.
> 
> I'm not convinced this is the right approach.

Me neither. I think it's a very hacky solution, but I couldn't think of
anything better at the time.

> Busting locks is risky, it could easily cause a crash if data structures
> are left in some inconsistent state.
> 
> I think we need to make this code more careful about what it's doing.
> There's a clue at the top of default_machine_crash_shutdown(), which
> calls crash_kexec_prepare_cpus():
> 
> 	 * This function is only called after the system
> 	 * has panicked or is otherwise in a critical state.
> 	 * The minimum amount of code to allow a kexec'd kernel
> 	 * to run successfully needs to happen here.
> 
> 
> You said the "IPI complete" message was the cause of one lockup:
> 
>   #0  arch_spin_lock 
>   #1  do_raw_spin_lock 
>   #2  __raw_spin_lock 
>   #3  _raw_spin_lock 
>   #4  vprintk_emit 
>   #5  vprintk_func
>   #7  crash_kexec_prepare_cpus 
>   #8  default_machine_crash_shutdown
>   #9  machine_crash_shutdown 
>   #10 __crash_kexec
>   #11 crash_kexec
>   #12 oops_end
> 
> TBH I think we could just drop that printk() entirely.
> 
> Or we could tell printk() that we're in NMI context so that it uses the
> percpu buffers.
> 
> We should probably do the latter anyway, in case there's any other code
> we call that inadvertently calls printk().
> 

I was not aware of using per-cpu buffers in printk. It may be a nice
solution.

There is another printk call there:
printk("kexec: Starting switchover sequence.\n");
in default_machine_kexec().

Two printk and one rtas call: it's all I could see using a spinlock
after IPI Complete.

> 
> The RTAS trace you sent was:
> 
>   #0 arch_spin_lock
>   #1  lock_rtas () 
>   #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
>   #3  ics_rtas_mask_real_irq (hw_irq=4100) 
>   #4  machine_kexec_mask_interrupts
>   #5  default_machine_crash_shutdown
>   #6  machine_crash_shutdown 
>   #7  __crash_kexec
>   #8  crash_kexec
>   #9  oops_end
> 
> 
> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
> be crashing while holding the RTAS lock, but I guess it could happen.
> Can you get a full backtrace?
> 

Oh, all traces are from the thread that called the crash, by writing
'c' to sysrq. That is what I am using to reproduce.

#10 bad_page_fault
#11 handle_page_fault
#12 __handle_sysrq (key=99, check_mask=false) 
#13 write_sysrq_trigger 
#14 proc_reg_write
#15 __vfs_write
#16 vfs_write
#17 SYSC_write
#18 SyS_write
#19 system_call

> 
> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
> except for a very small list of RTAS calls. So if we bust the RTAS lock
> there's a risk we violate that part of PAPR and crash even harder.

Interesting, I was not aware.

> 
> Also it's not specific to kdump, we can't even get through a normal
> reboot if we crash with the RTAS lock held.
> 
> Anyway here's a patch with some ideas. That allows me to get from a
> crash with the RTAS lock held through kdump into the 2nd kernel. But it
> only works if it's the crashing CPU that holds the RTAS lock.
> 

Nice idea. 
But my test environment is just triggering a crash from sysrq, so I
think it would not improve the result, given that this thread is
probably not holding the lock by the time.

I noticed that when rtas is locked, irqs and preemption are also
disabled.

Should the IPI send by crash be able to interrupt a thread with
disabled irqs?

Best regards,
Leonardo Bras

> cheers
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c5fa251b8950..44ce74966d60 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -25,6 +25,7 @@
>  #include <linux/reboot.h>
>  #include <linux/syscalls.h>
>  
> +#include <asm/debugfs.h>
>  #include <asm/prom.h>
>  #include <asm/rtas.h>
>  #include <asm/hvcall.h>
> @@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
>  void (*rtas_flash_term_hook)(int);
>  EXPORT_SYMBOL(rtas_flash_term_hook);
>  
> +static int rtas_lock_holder = -1;
> +
>  /* RTAS use home made raw locking instead of spin_lock_irqsave
>   * because those can be called from within really nasty contexts
>   * such as having the timebase stopped which would lockup with
> @@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
>  
>  	local_irq_save(flags);
>  	preempt_disable();
> -	arch_spin_lock(&rtas.lock);
> +
> +	if (!arch_spin_trylock(&rtas.lock)) {
> +		// Couldn't get the lock, do we already hold it?
> +		if (rtas_lock_holder == smp_processor_id())
> +			// Yes, so we would have deadlocked on ourself. Assume
> +			// we're crashing and continue on hopefully ...
> +			return flags;
> +
> +		// No, wait on the lock
> +		arch_spin_lock(&rtas.lock);
> +	}
> +
> +	rtas_lock_holder = smp_processor_id();
> +
>  	return flags;
>  }
>  
> @@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
>  	arch_spin_unlock(&rtas.lock);
>  	local_irq_restore(flags);
>  	preempt_enable();
> +
> +	rtas_lock_holder = -1;
>  }
>  
>  /*
> @@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
>  	timebase = 0;
>  	arch_spin_unlock(&timebase_lock);
>  }
> +
> +static int rtas_crash_set(void *data, u64 val)
> +{
> +	printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
> +	lock_rtas();
> +
> +	*((volatile int *) 0) = 0;
> +
> +	return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
> +
> +static __init int rtas_crash_debugfs_init(void)
> +{
> +	debugfs_create_file_unsafe("crash_in_rtas", 0200,
> +				   powerpc_debugfs_root, NULL,
> +				   &fops_rtas_crash);
> +	return 0;
> +}
> +device_initcall(rtas_crash_debugfs_init);
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..4c52cb58e889 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -15,6 +15,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/delay.h>
>  #include <linux/irq.h>
> +#include <linux/printk.h>
>  #include <linux/types.h>
>  
>  #include <asm/processor.h>
> @@ -311,6 +312,8 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
>  	unsigned int i;
>  	int (*old_handler)(struct pt_regs *regs);
>  
> +	printk_nmi_enter();
> +
>  	/*
>  	 * This function is only called after the system
>  	 * has panicked or is otherwise in a critical state.
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20200402/64bc1ce3/attachment.sig>