[OpenPower-Firmware] Generate a dump of the Linux kernel on host OS (P8)

Tue Feb 26 17:36:18 AEDT 2019

Artem Senichev's on February 21, 2019 6:17 pm:
> On Wed, Feb 20, 2019 at 10:19:18PM +1000, Nicholas Piggin wrote:
>> Artem Senichev's on February 20, 2019 9:02 pm:
>> > On Tue, Feb 19, 2019 at 11:47:43PM +1000, Nicholas Piggin wrote:
>> >> Artem Senichev's on February 19, 2019 9:22 pm:
>> >> > On Fri, Apr 13, 2018 at 01:56:17PM +1000, Nicholas Piggin wrote:
>> >> >> > Artem Senichev <artemsen at gmail.com> writes:
>> >> >> > > I need the ability to generate a dump of the Linux kernel on host OS
>> >> >> > > using a command from BMC.
>> >> >> 
>> >> >> The dump will be initiated when we get a crash or sreset. We can kick
>> >> >> off a dump without using sreset. The benefits of sreset is that it can
>> >> >> be generated from the BMC, and that the host CPUs can't block it if they
>> >> >> have crashed with interrupts off.
>> >> >> 
>> >> >> My thought is that we could use libpdbg to send the sreset to the host.
>> >> >> If we could get ipmi wired up to use that for the nmi command, it should
>> >> >> work.
>> >> >> 
>> >> >> We have just been talking about this a bit more. Ramming is a bit
>> >> >> complex and has some restrictions. On P8 we can actually send a sreset,
>> >> >> but the SRR1 register may end up being incorrect. This means we can not
>> >> >> return from the interrupt and continue, but we should be able to go on
>> >> >> to take a crash dump and restart the machine.
>> >> >> 
>> >> >> Most of the P8 code is already there in skiboot to do this for fast
>> >> >> reboot as an IPI with OPAL_SIGNAL_SYSTEM_RESET (core/direct-controls.c),
>> >> >> and pdbg on the BMC has the sreset command.
>> >> > 
>> >> > Yes, in fact we don't need any patches for skiboot to get the NMI/SRESET
>> >> > functionality. Existing code works fine in most cases and handles
>> >> > SRESET signal correctly.
>> >> > 
>> >> > The entire solution includes only one patch for PDBG, that allows us to
>> >> > send SRESET signal from OpenBMC console:
>> >> > http://patchwork.ozlabs.org/patch/1038525/
>> >> > 
>> >> > The only problem I have is the case when I load the CPU's thread that should
>> >> > handle SRESET signal. If I understand right, we should send SRESET to one only
>> >> > thread on host's CPU.
>> >> 
>> >> Linux can deal with one or more threads taking sreset. You should sreset
>> >> all, because if Linux does not see all threads getting sreset, it will 
>> >> use IPIs to bring the remaining threads in. If you are going to use P8
>> >> with no skiboot patch, then Linux will have no NMI IPI.
>> >> 
>> > 
>> > I tried to send SRESET to all threads (with '-a' option of pdbg),
>> > in this case I get a lot of kernel messages about system reset, one message
>> > per logical CPU:
>> > 
>> > cpu 0x47: Vector: 100 (System Reset) at [c000003fcac4fbd8]
>> > ...
>> > 
>> > but it stops working after that, kernel just hangs. Also, the last
>> > message says that the last CPU that received sreset is 71 (0x47),
>> > but I have 256 logical CPU in the system.
>> 
>> Okay. It's not supposed to of course, and guest kernels under hypervisor 
>> (PowerVM or KVM) get a 0x100 interrupt on every CPU when the HV gives a 
>> crash or NMI signal.
>> 
>> Is this happening with an upstream kernel? Not running KVM?
> 
> It's a PowerNV machine, without KVM. I test the solution with vanilla
> linux kernel 5.0-rc7.
> 
>> > 
>> >> > signal.
>> >> > Step to reproduce:
>> >> > 1. On the host's side: call `stress` for the first thread of CPU0:
>> >> >    # taskset 01 stress -c 1
>> >> > 2. From OpenBMC: send SRESET signal for the first host's thread:
>> >> >    # pdbg --backend=i2c --device=/dev/i2c-4 -p 0 -c 1 -t 0 sreset
>> >> > In this scenario, as a result, SRESET signal is ignored, there are no any
>> >> > messages in OPAL's or kernel's logs. I can just stop `stress` execution by
>> >> > Ctrl-C and the system continues to work as usual. After that, I can resend
>> >> > SRESET and everything works as expected: kernel starts 'System Reset' signal
>> >> > handler and initiates reload kernel to perform memory dump creation.
>> >> 
>> >> You may need to stop the thread first with pdbg. P9 requires that I 
>> >> think. Some documentation indicates it works without stopping first,
>> >> but I don't think that's the case. P8 may be similar.
>> >> 
>> >> The stop sequence in pdbg for P8 does not exactly match the workbook 
>> >> either, by the looks. It doesn't check for maint mode, it does some
>> >> funny thing for RAM mode at the end, etc. If it does not work
>> >> properly for sreset then it would be worth experimenting with that
>> >> (I would try take out the last bit of code from p8_thread_stop() that
>> >> sets the thread active).
>> >> 
>> > 
>> > Nick, what do you mean, "stop the thread"?
>> > Is it something like Alister suggested to do in the patch
>> > "core/fast-reboot.c: Add sreset opal call":
>> > https://patchwork.ozlabs.org/patch/694794/
>> > By ramming an instruction sequence into an active thread?
>> 
>> No, I meant stop with pdbg.
> 
> That trick doesn't work, if I send sreset to the stopped thread, the signal
> is not handled.

Maybe try removing this from p8_thread_stop

        /* Make the threads RAM thread active */
        CHECK_ERR(pib_read(&chip->target, THREAD_ACTIVE_REG, &val));
        val |= PPC_BIT(8) >> thread->id;
        CHECK_ERR(pib_write(&chip->target, THREAD_ACTIVE_REG, val));

Also try setting the thread to "prenap", see 
skiboot/core/direct-controls.c. The P8 sreset code there for fast
reboots is relatively well tested.

> 
>> > Because if I stop thread from pdbg (with 'stop' command), the SRESET
>> > signal doesn't handle by host, it has same effect as using `stress`.
>> 
>> I'm not sure what stress is. Does nothing appear to happen? It could be
>> due to the ramming thing in the p8 stop sequence in pdbg.
> 
> `stress` is a small utility to perform stress tests (load CPUs in
> my case):
> http://manpages.ubuntu.com/manpages/cosmic/man1/stress-ng.1.html
> 
> If I load the first core up to 80%, everything works fine. 90% - it
> works time to time. 100% load - it doesn't work at all.
> Anyway, how is it possible that NMI is ignored by kernel?

NMI can be "ignored" by the kernel if it hits when in a nap state and 
wakes up without the correct SRR1 reason bit set. Kernel can then wake 
up but not do anything with the sreset.

I don't think that sounds like the case here, more like the scom
sequences are not correct. It can be quite strange behaviour if you 
don't have everything exactly right, espeically when you add power 
saving to the mix.

> Is it possible to get an acknowledgment on BMC side, that NMI has been sent
> or handled?

I'm not aware of anything like a synchronous notification, but if you 
STOP a thread, confirm it to be stopped, and then SRESET it and see that 
it is no longer stopped, then you can be quite sure the sreset has been 
delivered.

Thanks,
Nick