[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Wed May 29 16:39:32 AEST 2019

On Tue, May 28, 2019 at 10:46 PM Vasant Hegde
<hegdevasant at linux.vnet.ibm.com> wrote:
>
> On 05/20/2019 11:50 AM, Oliver wrote:
> > *snip*

> Are you sure you are counting it properly? I suspect some of these numbers.
> Either numbers are wrong -OR- we messed up badly.
>
> > +---------------------------------+-------+
> > |            OPAL Call            | Count |
> > +---------------------------------+-------+
> > | OPAL_READ_NVRAM                 |  4612 |
>
> Why are we doing so many NVRAM calls.

Apparently the kernel reads the whole of NVRAM in 200 byte chunks. The
default NVRAM size of OpenPower systems seems to be 576KB and 1MB on
FSP systems so you end up doing a lot of them.

> > | OPAL_FLASH_READ                 |    31 |
> > | OPAL_PCI_SET_PELTV              |    23 |
> > | OPAL_PCI_SET_PE                 |    16 |
> > | OPAL_REGISTER_DUMP_REGION       |    15 |
>
> It should be 1. We just make one call from Linux for registration.

The script I used to map the counts to the OPAL call names was a bit
dumb. OPAL call token 46,47 and 48 are missing from opal-api.h which
threw off the counts for all the call numbers above 45. All the
high-frequency calls (nvram, pci cfg access) have tokens below #45 so
their counts are accurate. Either way, the point is that we do a lot
of OPAL calls at boot already so there's no reason to prefer an API
that minimises the number of calls.

> > I'd say we gain nothing from doing one OPAL call.
>
> Well, I don't think performance is the concern here...as this new API is called
> during FADUMP registration only (once during boot).

Right, so as Mahesh said the real reason for the single call + giant
structure was to make configuring the dump tables an atomic operation.
As far as I can tell the current implementation parses each entry
(src, dst, size) entry from the structure and updates the MDST and
MDDT entry-by-entry so it's not atomic either. The race window is
pretty small so I'm not sure how much I care about it, but if you
really want it to be atomic you would need to follow the suggestions
in the HDAT spec about how to update the table pointers since they
need to be valid at all times.

> >> I'm still pretty set on no structure and no metadata (except one tag
> >> that has no predefined semantics to the MPIPL layer).
> >>
> >> That's the minimum necessary and sufficient for a "preserve memory
> >> across reboot" facility to support Linux crash dumps, right?
> >
> > I think we can (and probably need to) do better than the minimum. For
> > the current design the "good" path is something like:
> >
> > Old kernel does a bad -> MPIPL request -> *magic occurs* -> hostboot
> > -> skiboot -> petitboot -> ???
>
> Register during Linux boot -> MPIPL request-> .....-> peritboot -> Host Linux
>
> For now we are adding support for Host Linux. In future we will add support
> for petitboot kernel as well. So during petitboot kernel load it will register for
> FADUMP and before doing kexec, it will call unregister .. So that host linux can
> register again.

What's the difference between supporting this in the petitboot and host kernels?

> > I'm wondering what we can safely do once we hit the final step. As far
> > as I can tell the intention is to boot into the same kernel that we
> > crashed from so that it can run makedumpsterfire to produce a
> > crashdump, invalidate the dump, and continue to boot into a
> > functioning OS. However I don't see how we'd actually guarantee that
> > actually happens. I realise that it's *probably* going to work most of
> > the time since we'll probably be running the same kernel that's the
> > default boot option, but surely we can come up with something that's
> > less jank.
>
> Yes. Idea is to boot back the same kernel. Since we are initializing everything
> again most of the time it will work fine (much better than kdump situation).

I'm not really convinced that MPIPL is a drasticly better than kdump.
The main reason kdump doesn't work well is GPUs with NVlink. As far as
I can tell MPIPL doesn't do anything to help there.

> > For contrast the kdump approach allows the crashing kernel to specify
> > what the crash environment is going to look like. If I were an OS
> > vendor I'd say that's a pretty compelling reason to use kdump instead
> > of this. If the main benifit of fadump is that we can reliably reset
> > and reinitialise hardware devices then maybe we should look at trying
> > to use MPIPL as an alternative kdump entry path. Rather than having
> > skiboot load petitboot from flash, we could have skiboot enter the
> > preloaded crash kernel and go from there.
>
> preloaded crash kernel is similar to kdump right? I don't think we again
> much from this approach.

Can you actually respond to what I'm saying rather than dismissing it
out of hand with a non-argument?

If we can use MPIPL to make kdump more robust then I think we should
do that rather than having a completely separate mechanism to capture
a crash dump. One of the goals of OpenPower is to have tooling and
processes that are consistent with what is used by the rest of the
industry rather than inventing IBM specific ways of doing everything.
So why are we doing this instead? I'm not saying that there's no good
reasons to take the approach you have, but you, Hari and Mahesh need
to do a better job of spelling them out. Have you spoken with anyone
from SuSE or RH about what they would prefer?

> Also we will not be able to capture skiboot dumps.

Why not? Seems entirely possible to me.

> @Mahesh, Hari can explain kdump issue better.
>
> -Vasant
>