[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Thu May 30 15:54:42 AEST 2019

On 05/29/2019 12:09 PM, Oliver wrote:
> On Tue, May 28, 2019 at 10:46 PM Vasant Hegde
> <hegdevasant at linux.vnet.ibm.com> wrote:
>>
>> On 05/20/2019 11:50 AM, Oliver wrote:
>>> *snip*
> 

.../...

> 
>>> I'd say we gain nothing from doing one OPAL call.
>>
>> Well, I don't think performance is the concern here...as this new API is called
>> during FADUMP registration only (once during boot).
> 
> Right, so as Mahesh said the real reason for the single call + giant
> structure was to make configuring the dump tables an atomic operation.
> As far as I can tell the current implementation parses each entry
> (src, dst, size) entry from the structure and updates the MDST and
> MDDT entry-by-entry so it's not atomic either. The race window is
> pretty small so I'm not sure how much I care about it, but if you
> really want it to be atomic you would need to follow the suggestions
> in the HDAT spec about how to update the table pointers since they
> need to be valid at all times.

Yes. The race window is close to NIL.  I think its safe to update entry by entry.
Also I'm making sure HDAT pointers are valid all the time.

> 
>>>> I'm still pretty set on no structure and no metadata (except one tag
>>>> that has no predefined semantics to the MPIPL layer).
>>>>
>>>> That's the minimum necessary and sufficient for a "preserve memory
>>>> across reboot" facility to support Linux crash dumps, right?
>>>
>>> I think we can (and probably need to) do better than the minimum. For
>>> the current design the "good" path is something like:
>>>
>>> Old kernel does a bad -> MPIPL request -> *magic occurs* -> hostboot
>>> -> skiboot -> petitboot -> ???
>>
>> Register during Linux boot -> MPIPL request-> .....-> peritboot -> Host Linux
>>
>> For now we are adding support for Host Linux. In future we will add support
>> for petitboot kernel as well. So during petitboot kernel load it will register for
>> FADUMP and before doing kexec, it will call unregister .. So that host linux can
>> register again.
> 
> What's the difference between supporting this in the petitboot and host kernels?

Technically there is no difference. Its same code. But we have see how to
offload petitboot crashes.

> 
>>> I'm wondering what we can safely do once we hit the final step. As far
>>> as I can tell the intention is to boot into the same kernel that we
>>> crashed from so that it can run makedumpsterfire to produce a
>>> crashdump, invalidate the dump, and continue to boot into a
>>> functioning OS. However I don't see how we'd actually guarantee that
>>> actually happens. I realise that it's *probably* going to work most of
>>> the time since we'll probably be running the same kernel that's the
>>> default boot option, but surely we can come up with something that's
>>> less jank.
>>
>> Yes. Idea is to boot back the same kernel. Since we are initializing everything
>> again most of the time it will work fine (much better than kdump situation).
> 
> I'm not really convinced that MPIPL is a drasticly better than kdump.
> The main reason kdump doesn't work well is GPUs with NVlink. As far as
> I can tell MPIPL doesn't do anything to help there.

As Hari mentioned kdump has many other issues. GPU with NVLink is one such
issue. MPIPL is much better here .. as firmware will take care of preserving
memory. And we do fresh boot of same kernel. So most of the time
system will boot back properly and we can collect dump.

> 
>>> For contrast the kdump approach allows the crashing kernel to specify
>>> what the crash environment is going to look like. If I were an OS
>>> vendor I'd say that's a pretty compelling reason to use kdump instead
>>> of this. If the main benifit of fadump is that we can reliably reset
>>> and reinitialise hardware devices then maybe we should look at trying
>>> to use MPIPL as an alternative kdump entry path. Rather than having
>>> skiboot load petitboot from flash, we could have skiboot enter the
>>> preloaded crash kernel and go from there.
>>
>> preloaded crash kernel is similar to kdump right? I don't think we again
>> much from this approach.
> 
> Can you actually respond to what I'm saying rather than dismissing it
> out of hand with a non-argument?

Sure. preloaded crash kernel will have same limitation as kdump..
because we are not re-initializing hardware. Also with this approach
we can't collect skiboot dump.. as skiboot itself may have crashed.

OTOH with MPIPL once system crashes freshly loaded hostboot will take care
of preserving memory. So we can collect skiboot dump as well.
And we are doing fresh boot (with memory preserved). So most of the time
system will boot back properly.

> 
> If we can use MPIPL to make kdump more robust then I think we should
> do that rather than having a completely separate mechanism to capture
> a crash dump. One of the goals of OpenPower is to have tooling and
> processes that are consistent with what is used by the rest of the
> industry rather than inventing IBM specific ways of doing everything.
> So why are we doing this instead? I'm not saying that there's no good
> reasons to take the approach you have, but you, Hari and Mahesh need
> to do a better job of spelling them out. Have you spoken with anyone
> from SuSE or RH about what they would prefer?

I think I missed to elaborate tooling part in the cover page.

All these changes are internal to kernel/firmware. We are retaining existing
dump format (vmcore format for kernel and ELF core format for opalcore).
There is no change in user space tools. Existing crash/gdb will work fine. That 
means
there is no impact for end user. They can continue to use same tools.

FADUMP is not new in kernel. We already have similar feature in PowerVM world.
So from kernel point of view its just extending FADUMP feature to PowerNV 
platform with
MPIPL as backend.

So from end user point of view, all we are trying to do is, "get reliable kernel 
dump" plus
"OPAL dump" without impacting existing workflow (tooling, debugging method).

-Vasant