[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Tue Jun 4 20:51:32 AEST 2019

On Thu, May 30, 2019 at 3:54 PM Vasant Hegde
<hegdevasant at linux.vnet.ibm.com> wrote:
>
> On 05/29/2019 12:09 PM, Oliver wrote:
> > On Tue, May 28, 2019 at 10:46 PM Vasant Hegde
> > <hegdevasant at linux.vnet.ibm.com> wrote:
> >>
> >> On 05/20/2019 11:50 AM, Oliver wrote:
> >>> *snip*
> >
>
> .../...
>
> >
> >>> I'd say we gain nothing from doing one OPAL call.
> >>
> >> Well, I don't think performance is the concern here...as this new API is called
> >> during FADUMP registration only (once during boot).
> >
> > Right, so as Mahesh said the real reason for the single call + giant
> > structure was to make configuring the dump tables an atomic operation.
> > As far as I can tell the current implementation parses each entry
> > (src, dst, size) entry from the structure and updates the MDST and
> > MDDT entry-by-entry so it's not atomic either. The race window is
> > pretty small so I'm not sure how much I care about it, but if you
> > really want it to be atomic you would need to follow the suggestions
> > in the HDAT spec about how to update the table pointers since they
> > need to be valid at all times.
>
> Yes. The race window is close to NIL.  I think its safe to update entry by entry.
> Also I'm making sure HDAT pointers are valid all the time.

Mahesh seemed to think the table should be updated atomically. I don't
really mind though.

> >>>> I'm still pretty set on no structure and no metadata (except one tag
> >>>> that has no predefined semantics to the MPIPL layer).
> >>>>
> >>>> That's the minimum necessary and sufficient for a "preserve memory
> >>>> across reboot" facility to support Linux crash dumps, right?
> >>>
> >>> I think we can (and probably need to) do better than the minimum. For
> >>> the current design the "good" path is something like:
> >>>
> >>> Old kernel does a bad -> MPIPL request -> *magic occurs* -> hostboot
> >>> -> skiboot -> petitboot -> ???
> >>
> >> Register during Linux boot -> MPIPL request-> .....-> peritboot -> Host Linux
> >>
> >> For now we are adding support for Host Linux. In future we will add support
> >> for petitboot kernel as well. So during petitboot kernel load it will register for
> >> FADUMP and before doing kexec, it will call unregister .. So that host linux can
> >> register again.
> >
> > What's the difference between supporting this in the petitboot and host kernels?
>
> Technically there is no difference. Its same code. But we have see how to
> offload petitboot crashes.

We have the pb-sos tool to capture debug information from the
petitboot environment. You might be able to extend that to support to
creating a dump file if a vmcore exists. You should talk to Sam and
see what he thinks.

That said, if we're going to have petitboot initiate MPIPLs then we
need to differentiate between a dump that's intended to be collected
by petitboot and a dump that's intended to be collected further down
the line. How are we going to do that?

> >>> I'm wondering what we can safely do once we hit the final step. As far
> >>> as I can tell the intention is to boot into the same kernel that we
> >>> crashed from so that it can run makedumpsterfire to produce a
> >>> crashdump, invalidate the dump, and continue to boot into a
> >>> functioning OS. However I don't see how we'd actually guarantee that
> >>> actually happens. I realise that it's *probably* going to work most of
> >>> the time since we'll probably be running the same kernel that's the
> >>> default boot option, but surely we can come up with something that's
> >>> less jank.
> >>
> >> Yes. Idea is to boot back the same kernel. Since we are initializing everything
> >> again most of the time it will work fine (much better than kdump situation).
> >
> > I'm not really convinced that MPIPL is a drasticly better than kdump.
> > The main reason kdump doesn't work well is GPUs with NVlink. As far as
> > I can tell MPIPL doesn't do anything to help there.
>
> As Hari mentioned kdump has many other issues. GPU with NVLink is one such
> issue. MPIPL is much better here.. as firmware will take care of preserving
> memory. And we do fresh boot of same kernel.
>
> >>> For contrast the kdump approach allows the crashing kernel to specify
> >>> what the crash environment is going to look like. If I were an OS
> >>> vendor I'd say that's a pretty compelling reason to use kdump instead
> >>> of this. If the main benifit of fadump is that we can reliably reset
> >>> and reinitialise hardware devices then maybe we should look at trying
> >>> to use MPIPL as an alternative kdump entry path. Rather than having
> >>> skiboot load petitboot from flash, we could have skiboot enter the
> >>> preloaded crash kernel and go from there.
> >>
> >> preloaded crash kernel is similar to kdump right? I don't think we again
> >> much from this approach.
> >
> > Can you actually respond to what I'm saying rather than dismissing it
> > out of hand with a non-argument?
>
> Sure. preloaded crash kernel will have same limitation as kdump..
> because we are not re-initializing hardware. Also with this approach
> we can't collect skiboot dump.. as skiboot itself may have crashed.
>
> OTOH with MPIPL once system crashes freshly loaded hostboot will take care
> of preserving memory. So we can collect skiboot dump as well.
> And we are doing fresh boot (with memory preserved). So most of the time
> system will boot back properly.

See my response to Hari.

> > If we can use MPIPL to make kdump more robust then I think we should
> > do that rather than having a completely separate mechanism to capture
> > a crash dump. One of the goals of OpenPower is to have tooling and
> > processes that are consistent with what is used by the rest of the
> > industry rather than inventing IBM specific ways of doing everything.
> > So why are we doing this instead? I'm not saying that there's no good
> > reasons to take the approach you have, but you, Hari and Mahesh need
> > to do a better job of spelling them out. Have you spoken with anyone
> > from SuSE or RH about what they would prefer?
>
> I think I missed to elaborate tooling part in the cover page.
>
> All these changes are internal to kernel/firmware.
>
> We are retaining existing
> dump format (vmcore format for kernel and ELF core format for opalcore).
> There is no change in user space tools. Existing crash/gdb will work fine. That
> means
> there is no impact for end user. They can continue to use same tools.
>
> FADUMP is not new in kernel. We already have similar feature in PowerVM world.

Yeah and this is my main problem with the series. It seems to me that
the proposed design (originally, the updates are better!) is mostly
copied from fadump under PowerVM without considering whether it makes
sense to do so. RTAS calls are done by shoving the parameters into a
buffer so passing around large structures sort of makes sense there.
However, it makes a lot less sense for a register based interface like
the OPAL API. There are cases where we pass around structures to and
from OPAL, but they're the exceptions. Similarly, in an LPAR the only
thing that runs before the kernel is OF and grub, both of which are
fairly self contained. On PowerNV the normal boot process requires
booting a full linux system so it's not really the same scenario.

I'm not saying that we *need* to do things the way I'm suggesting. I
just think it would be a good if we spent more time looking at options
beyond whatever we do on pseries today.

Oliver