[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Tue May 28 22:46:01 AEST 2019

On 05/20/2019 11:50 AM, Oliver wrote:
> On Mon, May 20, 2019 at 12:30 PM Nicholas Piggin <npiggin at gmail.com> wrote:
>>
>> Vasant Hegde's on May 18, 2019 9:12 pm:
>>> On 05/16/2019 11:05 AM, Nicholas Piggin wrote:
>>>> Vasant Hegde's on May 14, 2019 9:23 pm:
>>>>> On 05/09/2019 10:28 AM, Nicholas Piggin wrote:
>>>>>> Vasant Hegde's on April 13, 2019 7:15 pm:
>>>>>>> diff --git a/doc/opal-api/opal-fadump-manage-173.rst b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>> new file mode 100644
>>>>>>> index 000000000..916167503
>>>>>>> --- /dev/null
>>>>>>> +++ b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>> @@ -0,0 +1,73 @@
>>>>>>> +.. _opal-api-fadump-manage:
>>>>>>> +
>>>>>>> +OPAL fadump manage call
>>>>>>> +=======================
>>>>>>> +::
>>>>>>> +
>>>>>>> +   #define OPAL_FADUMP_MANAGE                      173
>>>>>>> +
>>>>>>> +This call is used to manage FADUMP (aka MPIPL) on OPAL platform.
>>>>>>> +Linux kernel will use this call to register/unregister FADUMP.
>>>>>>> +
>>>>>>> +Parameters
>>>>>>> +----------
>>>>>>> +::
>>>>>>> +
>>>>>>> +   uint64_t     command
>>>>>>> +   void         *data
>>>>>>> +   uint64_t     dsize
>>>>>>> +
>>>>>>> +``command``
>>>>>>> +   ``command`` parameter supports below values:
>>>>>>> +
>>>>>>> +::
>>>>>>> +
>>>>>>> +      0x01 - Register for fadump
>>>>>>> +      0x02 - Unregister fadump
>>>>>>> +      0x03 - Invalidate existing fadump
>>>>>>> +
>>>>>>> +``data``
>>>>>>> +   ``data`` is valid when ``command`` is 0x01 (registration).
>>>>>>> +   We use fadump structure (see below) to pass Linux kernel
>>>>>>> +   memory reservation details.
>>>>>>> +
>>>>>>> +::
>>>>>>> +
>>>>>>> +
>>>>>>> +   struct fadump_section {
>>>>>>> + u8      source_type;
>>>>>>> + u8      reserved[7];
>>>>>>> + u64     source_addr;
>>>>>>> + u64     source_size;
>>>>>>> + u64     dest_addr;
>>>>>>> + u64     dest_size;
>>>>>>> +   } __packed;
>>>>>>> +
>>>>>>> +   struct fadump {
>>>>>>> + u16     fadump_section_size;
>>>>>>> + u16     section_count;
>>>>>>> + u32     crashing_cpu;
>>>>>>> + u64     reserved;
>>>>>>> + struct  fadump_section section[];
>>>>>>> +   };
>>>>>>
>>>>>> This API seems quite complicated. The kernel wants to tell firmware to
>>>>>> preserve some ranges of memory in case of reboot, and to have those
>>>>>> ranges advertised to the reboot kernel.
>>>>>
>>>>> Kernel informs OPAL about range of memory to be preserved during MPIPL
>>>>> (source, destination, size).
>>>>
>>>> Well it also contains crashing_cpu, type, and comes in this clunky
>>>> structure.
>>>
>>> crashing_cpu : This information is passed by OPAL to kernel during MPIPL boot.
>>> So that
>>> kernel can generate proper backtrace for OPAL dump.  This is not needed for
>>> registration.
>>> This is *OPAL* generated information. Kernel won't pass this information.
>>> (For kernel initiated crash, kernel will keep track of crashing CPU pt_regs data
>>> and it will use
>>>     that to generate vmcore).
>>>
>>>
>>> Type : Identifies memory content type (like OPAL, kernel, etc). During MPIPL
>>> registration
>>> we pass this data to HDAT.. Hostboot will just copy this back to Result table
>>> inside HDAT.
>>> During MPIPL boot, OPAL passes this information to kernel.. so that kernel can
>>> generate
>>> proper dumps.
>>
>> Right. But it's all metadata that "MPIPL" does not need to know. We want
>> a service that preserves memory over reboot. Then Linux can create its
>> own metadata to use that for fadump crashes, for example.
>>
>>>>
>>>>> After reboot, we will result range from hostboot . We pass that to kernel via
>>>>> device tree.
>>>>>
>>>>>>
>>>>>> Why not just an API which can add a range, and delete a range, and
>>>>>> that's it? Range would just be physical start, end, plus an arbitrary
>>>>>> tag (which caller can use to retrieve metadata that is used to
>>>>>> decipher the dump).
>>>>>
>>>>> We want one to one mapping between source and destination.
>>>>
>>>> Ah yes, sure that too. So two calls, one which adds or removes
>>>> (source, dest, length) entries, and another which sets a tag.
>>>
>>> Sorry. I'm still not getting what we gain by multiple calls here.
>>
>> No ugly structure that's tied to some internal dump metadata.
>>
>>>
>>> - With structure we can pass all information in one call. So kernel can make
>>> single call for registration.
>>
>> We don't gain much there AFAIKS.
> 
> I added some code to count what OPAL calls we do to get into petitboot
> on ozrom2 and got this:

Are you sure you are counting it properly? I suspect some of these numbers.
Either numbers are wrong -OR- we messed up badly.

> 
> +---------------------------------+-------+
> |            OPAL Call            | Count |
> +---------------------------------+-------+
> | OPAL_READ_NVRAM                 |  4612 |

Why are we doing so many NVRAM calls.

> | OPAL_FLASH_READ                 |    31 |
> | OPAL_PCI_SET_PELTV              |    23 |
> | OPAL_PCI_SET_PE                 |    16 |
> | OPAL_REGISTER_DUMP_REGION       |    15 |

It should be 1. We just make one call from Linux for registration.

.../...

> 
> I'd say we gain nothing from doing one OPAL call.

Well, I don't think performance is the concern here...as this new API is called
during FADUMP registration only (once during boot).

.../...

>>
>> I'm still pretty set on no structure and no metadata (except one tag
>> that has no predefined semantics to the MPIPL layer).
>>
>> That's the minimum necessary and sufficient for a "preserve memory
>> across reboot" facility to support Linux crash dumps, right?
> 
> I think we can (and probably need to) do better than the minimum. For
> the current design the "good" path is something like:
> 
> Old kernel does a bad -> MPIPL request -> *magic occurs* -> hostboot
> -> skiboot -> petitboot -> ???

Register during Linux boot -> MPIPL request-> .....-> peritboot -> Host Linux

For now we are adding support for Host Linux. In future we will add support
for petitboot kernel as well. So during petitboot kernel load it will register for
FADUMP and before doing kexec, it will call unregister .. So that host linux can
register again.

> 
> I'm wondering what we can safely do once we hit the final step. As far
> as I can tell the intention is to boot into the same kernel that we
> crashed from so that it can run makedumpsterfire to produce a
> crashdump, invalidate the dump, and continue to boot into a
> functioning OS. However I don't see how we'd actually guarantee that
> actually happens. I realise that it's *probably* going to work most of
> the time since we'll probably be running the same kernel that's the
> default boot option, but surely we can come up with something that's
> less jank.

Yes. Idea is to boot back the same kernel. Since we are initializing everything
again most of the time it will work fine (much better than kdump situation).

> 
> For contrast the kdump approach allows the crashing kernel to specify
> what the crash environment is going to look like. If I were an OS
> vendor I'd say that's a pretty compelling reason to use kdump instead
> of this. If the main benifit of fadump is that we can reliably reset
> and reinitialise hardware devices then maybe we should look at trying
> to use MPIPL as an alternative kdump entry path. Rather than having
> skiboot load petitboot from flash, we could have skiboot enter the
> preloaded crash kernel and go from there.

preloaded crash kernel is similar to kdump right? I don't think we again
much from this approach. Also we will not be able to capture skiboot
dumps.

@Mahesh, Hari can explain kdump issue better.

-Vasant