[OpenPower-Firmware] Generate a dump of the Linux kernel on host OS (P8)

Artem Senichev artemsen at gmail.com
Wed Mar 6 18:51:30 AEDT 2019


On Wed, Mar 06, 2019 at 05:47:47PM +1100, Stewart Smith wrote:
> Artem Senichev <artemsen at gmail.com> writes:
> > On Tue, Feb 26, 2019 at 04:36:18PM +1000, Nicholas Piggin wrote:
> >> Artem Senichev's on February 21, 2019 6:17 pm:
> >> > On Wed, Feb 20, 2019 at 10:19:18PM +1000, Nicholas Piggin wrote:
> >> >> Artem Senichev's on February 20, 2019 9:02 pm:
> >> >> > On Tue, Feb 19, 2019 at 11:47:43PM +1000, Nicholas Piggin wrote:
> >> >> >> Artem Senichev's on February 19, 2019 9:22 pm:
> >> >> >> > On Fri, Apr 13, 2018 at 01:56:17PM +1000, Nicholas Piggin wrote:
> >> >> >> >> > Artem Senichev <artemsen at gmail.com> writes:
> >> >> >> >> > > I need the ability to generate a dump of the Linux kernel on host OS
> >> >> >> >> > > using a command from BMC.
> >> >> >> >> 
> >> >> >> >> The dump will be initiated when we get a crash or sreset. We can kick
> >> >> >> >> off a dump without using sreset. The benefits of sreset is that it can
> >> >> >> >> be generated from the BMC, and that the host CPUs can't block it if they
> >> >> >> >> have crashed with interrupts off.
> >> >> >> >> 
> >> >> >> >> My thought is that we could use libpdbg to send the sreset to the host.
> >> >> >> >> If we could get ipmi wired up to use that for the nmi command, it should
> >> >> >> >> work.
> >> >> >> >> 
> >> >> >> >> We have just been talking about this a bit more. Ramming is a bit
> >> >> >> >> complex and has some restrictions. On P8 we can actually send a sreset,
> >> >> >> >> but the SRR1 register may end up being incorrect. This means we can not
> >> >> >> >> return from the interrupt and continue, but we should be able to go on
> >> >> >> >> to take a crash dump and restart the machine.
> >> >> >> >> 
> >> >> >> >> Most of the P8 code is already there in skiboot to do this for fast
> >> >> >> >> reboot as an IPI with OPAL_SIGNAL_SYSTEM_RESET (core/direct-controls.c),
> >> >> >> >> and pdbg on the BMC has the sreset command.
> >> >> >> > 
> >> >> >> > Yes, in fact we don't need any patches for skiboot to get the NMI/SRESET
> >> >> >> > functionality. Existing code works fine in most cases and handles
> >> >> >> > SRESET signal correctly.
> >> >> >> > 
> >> >> >> > The entire solution includes only one patch for PDBG, that allows us to
> >> >> >> > send SRESET signal from OpenBMC console:
> >> >> >> > http://patchwork.ozlabs.org/patch/1038525/
> >> >> >> > 
> >> >> >> > The only problem I have is the case when I load the CPU's thread that should
> >> >> >> > handle SRESET signal. If I understand right, we should send SRESET to one only
> >> >> >> > thread on host's CPU.
> >> >> >> 
> >> >> >> Linux can deal with one or more threads taking sreset. You should sreset
> >> >> >> all, because if Linux does not see all threads getting sreset, it will 
> >> >> >> use IPIs to bring the remaining threads in. If you are going to use P8
> >> >> >> with no skiboot patch, then Linux will have no NMI IPI.
> >> >> >> 
> >> >> > 
> >> >> > I tried to send SRESET to all threads (with '-a' option of pdbg),
> >> >> > in this case I get a lot of kernel messages about system reset, one message
> >> >> > per logical CPU:
> >> >> > 
> >> >> > cpu 0x47: Vector: 100 (System Reset) at [c000003fcac4fbd8]
> >> >> > ...
> >> >> > 
> >> >> > but it stops working after that, kernel just hangs. Also, the last
> >> >> > message says that the last CPU that received sreset is 71 (0x47),
> >> >> > but I have 256 logical CPU in the system.
> >> >> 
> >> >> Okay. It's not supposed to of course, and guest kernels under hypervisor 
> >> >> (PowerVM or KVM) get a 0x100 interrupt on every CPU when the HV gives a 
> >> >> crash or NMI signal.
> >> >> 
> >> >> Is this happening with an upstream kernel? Not running KVM?
> >> > 
> >> > It's a PowerNV machine, without KVM. I test the solution with vanilla
> >> > linux kernel 5.0-rc7.
> >> > 
> >> >> > 
> >> >> >> > signal.
> >> >> >> > Step to reproduce:
> >> >> >> > 1. On the host's side: call `stress` for the first thread of CPU0:
> >> >> >> >    # taskset 01 stress -c 1
> >> >> >> > 2. From OpenBMC: send SRESET signal for the first host's thread:
> >> >> >> >    # pdbg --backend=i2c --device=/dev/i2c-4 -p 0 -c 1 -t 0 sreset
> >> >> >> > In this scenario, as a result, SRESET signal is ignored, there are no any
> >> >> >> > messages in OPAL's or kernel's logs. I can just stop `stress` execution by
> >> >> >> > Ctrl-C and the system continues to work as usual. After that, I can resend
> >> >> >> > SRESET and everything works as expected: kernel starts 'System Reset' signal
> >> >> >> > handler and initiates reload kernel to perform memory dump creation.
> >> >> >> 
> >> >> >> You may need to stop the thread first with pdbg. P9 requires that I 
> >> >> >> think. Some documentation indicates it works without stopping first,
> >> >> >> but I don't think that's the case. P8 may be similar.
> >> >> >> 
> >> >> >> The stop sequence in pdbg for P8 does not exactly match the workbook 
> >> >> >> either, by the looks. It doesn't check for maint mode, it does some
> >> >> >> funny thing for RAM mode at the end, etc. If it does not work
> >> >> >> properly for sreset then it would be worth experimenting with that
> >> >> >> (I would try take out the last bit of code from p8_thread_stop() that
> >> >> >> sets the thread active).
> >> >> >> 
> >> >> > 
> >> >> > Nick, what do you mean, "stop the thread"?
> >> >> > Is it something like Alister suggested to do in the patch
> >> >> > "core/fast-reboot.c: Add sreset opal call":
> >> >> > https://patchwork.ozlabs.org/patch/694794/
> >> >> > By ramming an instruction sequence into an active thread?
> >> >> 
> >> >> No, I meant stop with pdbg.
> >> > 
> >> > That trick doesn't work, if I send sreset to the stopped thread, the signal
> >> > is not handled.
> >> 
> >> Maybe try removing this from p8_thread_stop
> >> 
> >>         /* Make the threads RAM thread active */
> >>         CHECK_ERR(pib_read(&chip->target, THREAD_ACTIVE_REG, &val));
> >>         val |= PPC_BIT(8) >> thread->id;
> >>         CHECK_ERR(pib_write(&chip->target, THREAD_ACTIVE_REG, val));
> >> 
> >> Also try setting the thread to "prenap", see 
> >> skiboot/core/direct-controls.c.
> >
> > I've tried all combinations, the only worked one is simple SRESET sending,
> > even without stopping the thread.
> >
> > Here are the results (with 'prenap' and removed mentioned code from
> > p8_thread_stop):
> >
> > 1. 0% CPU load, send sreset without stopping the thread: everything work as
> >    expected, I get "cpu 0x0: Vector: 100 (System Reset)";
> >
> > 2. 0% CPU load, stop the thread and send sreset: signal is ignored by the
> >    kernel;
> >
> > 3. 100% CPU load, send sreset without stopping the thread: SRESET signal is
> >    ignored, the usermode application (stress that loads the CPU) failed:
> >    "cpu 0x0: Vector: 400 (Instruction Access)";
> >
> > 4. 100% CPU load, stop the thread and send sreset: same as previous, but with
> >    another exception: "cpu 0x0: Vector: e40 (Emulation Assist)".
> >
> > If I stop all threads and send sreset - system (kernel) hangs.
> >
> >> The P8 sreset code there for fast reboots is relatively well tested.
> >
> > Sorry, I can't agree to that. Fast reboot on P8 was broken since Linux
> > kernel 4.17:
> > https://github.com/open-power/skiboot/issues/185
> 
> So, in trying to delve into this, I went and modified op-test to do
> something that I suspected could be similar (or at least I'd like to
> rule out): https://github.com/open-power/op-test-framework/pull/437
> 
> So, the good news is I've managed to reproduce the problem in a
> relatively simple setting.
> 
> The bad news is that I've also made it go away without changing anything
> that should affect it... So... umm... fun times :)

It's strange, because the problem with fast reboot is consistently reproduced
with any linux kernel 4.17+, even without using 'stress'.
In fact, 'stress' doesn't allow us to use SRESET from BMC, that's the
main problem.

> I'll keep digging and keep places updated

If I can help, just let me know :)
Thanks!

-- 
Regards,
Artem Senichev
Software Engineer, YADRO.


More information about the OpenPower-Firmware mailing list