[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Fri May 31 13:40:28 AEST 2019

Vasant Hegde's on May 29, 2019 4:01 am:
> On 05/20/2019 07:59 AM, Nicholas Piggin wrote:
>> Vasant Hegde's on May 18, 2019 9:12 pm:
>>> On 05/16/2019 11:05 AM, Nicholas Piggin wrote:
>>>> Vasant Hegde's on May 14, 2019 9:23 pm:
>>>>> On 05/09/2019 10:28 AM, Nicholas Piggin wrote:
>>>>>> Vasant Hegde's on April 13, 2019 7:15 pm:
>>>>>>> diff --git a/doc/opal-api/opal-fadump-manage-173.rst b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>> new file mode 100644
>>>>>>> index 000000000..916167503
>>>>>>> --- /dev/null
>>>>>>> +++ b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>> @@ -0,0 +1,73 @@
>>>>>>> +.. _opal-api-fadump-manage:
>>>>>>> +
>>>>>>> +OPAL fadump manage call
>>>>>>> +=======================
>>>>>>> +::
>>>>>>> +
>>>>>>> +   #define OPAL_FADUMP_MANAGE                      173
>>>>>>> +
>>>>>>> +This call is used to manage FADUMP (aka MPIPL) on OPAL platform.
>>>>>>> +Linux kernel will use this call to register/unregister FADUMP.
>>>>>>> +
>>>>>>> +Parameters
>>>>>>> +----------
>>>>>>> +::
>>>>>>> +
>>>>>>> +   uint64_t     command
>>>>>>> +   void         *data
>>>>>>> +   uint64_t     dsize
>>>>>>> +
>>>>>>> +``command``
>>>>>>> +   ``command`` parameter supports below values:
>>>>>>> +
>>>>>>> +::
>>>>>>> +
>>>>>>> +      0x01 - Register for fadump
>>>>>>> +      0x02 - Unregister fadump
>>>>>>> +      0x03 - Invalidate existing fadump
>>>>>>> +
>>>>>>> +``data``
>>>>>>> +   ``data`` is valid when ``command`` is 0x01 (registration).
>>>>>>> +   We use fadump structure (see below) to pass Linux kernel
>>>>>>> +   memory reservation details.
>>>>>>> +
>>>>>>> +::
>>>>>>> +
>>>>>>> +
>>>>>>> +   struct fadump_section {
>>>>>>> +	u8	source_type;
>>>>>>> +	u8	reserved[7];
>>>>>>> +	u64	source_addr;
>>>>>>> +	u64	source_size;
>>>>>>> +	u64	dest_addr;
>>>>>>> +	u64	dest_size;
>>>>>>> +   } __packed;
>>>>>>> +
>>>>>>> +   struct fadump {
>>>>>>> +	u16	fadump_section_size;
>>>>>>> +	u16	section_count;
>>>>>>> +	u32	crashing_cpu;
>>>>>>> +	u64	reserved;
>>>>>>> +	struct	fadump_section section[];
>>>>>>> +   };
>>>>>>
>>>>>> This API seems quite complicated. The kernel wants to tell firmware to
>>>>>> preserve some ranges of memory in case of reboot, and to have those
>>>>>> ranges advertised to the reboot kernel.
>>>>>
>>>>> Kernel informs OPAL about range of memory to be preserved during MPIPL
>>>>> (source, destination, size).
>>>>
>>>> Well it also contains crashing_cpu, type, and comes in this clunky
>>>> structure.
>>>
>>> crashing_cpu : This information is passed by OPAL to kernel during MPIPL boot.
>>> So that
>>> kernel can generate proper backtrace for OPAL dump.  This is not needed for
>>> registration.
>>> This is *OPAL* generated information. Kernel won't pass this information.
>>> (For kernel initiated crash, kernel will keep track of crashing CPU pt_regs data
>>> and it will use
>>>     that to generate vmcore).
>>>
>>>
>>> Type : Identifies memory content type (like OPAL, kernel, etc). During MPIPL
>>> registration
>>> we pass this data to HDAT.. Hostboot will just copy this back to Result table
>>> inside HDAT.
>>> During MPIPL boot, OPAL passes this information to kernel.. so that kernel can
>>> generate
>>> proper dumps.
>> 
>> Right. But it's all metadata that "MPIPL" does not need to know. We want
>> a service that preserves memory over reboot. Then Linux can create its
>> own metadata to use that for fadump crashes, for example.
> 
> OPAL/firmware is not using this metadata. It just passes it back.

That's a good hint to just leave it out and let Linux define its own
metadata.

> But I got your point. We can skip passing "type" information to OPAL.
> 
>> 
>>>>
>>>>> After reboot, we will result range from hostboot . We pass that to kernel via
>>>>> device tree.
>>>>>
>>>>>>
>>>>>> Why not just an API which can add a range, and delete a range, and
>>>>>> that's it? Range would just be physical start, end, plus an arbitrary
>>>>>> tag (which caller can use to retrieve metadata that is used to
>>>>>> decipher the dump).
>>>>>
>>>>> We want one to one mapping between source and destination.
>>>>
>>>> Ah yes, sure that too. So two calls, one which adds or removes
>>>> (source, dest, length) entries, and another which sets a tag.
>>>
>>> Sorry. I'm still not getting what we gain by multiple calls here.
>> 
>> No ugly structure that's tied to some internal dump metadata.
> 
> I understand the metadata part. But structure is not tied to anything else.
> The data format used to pass information from OPAL to hostboot is completely 
> different.
> This structure is used to pass data between OPAL and kernel.

It's just ugly. Start with an API which is as simple and minimal as
possible, and then explain why you need to complicate it.

Performance and race windows are not good reasons AFAIKS. We can do
about 1-2 million OPAL calls per second last I checked, and the
race/atomicity issue is better solved in the kernel, where you can
set a flag after registration is done, and only issue the MPIPL reboot
if that flag is set.

>>> - With structure we can pass all information in one call. So kernel can make
>>> single call for registration.
>> 
>> We don't gain much there AFAIKS.
>> 
>>> - Its controlled by version field (we realized need of version field during
>>> review and I will
>>> add that in v8) . So makes it easy to handle comparability issues.
>> 
>> But you only have backward compatibility concerns because you're
>> exposing the structure and putting crash metadata into it in the first
>> place.
> 
> Otherwise for any change in interface we will endup creating new API
> (like we did OPAL_DUMP_INFO2).

If you take all the unnecessary stuff out of your API, and you have a
better chance you get it right and won't need to change.

"Preserve these memory ranges across boot" is the be all and end all of 
preserving memory ranges across boot :) But if we do need to change
things, a new OPAL call is not worse than versioning a structure.

> I don't think we gain much with multiple APIs (for registration, un-registration,
> invalidate, complete). Anyway let me come up with new API which passes
> src, dest, size.

Thanks for considering it. Not sure if you need all those. One OPAL call 
number that takes a range and flag to preserve/unpreserve, and another 
to initiate the MPIPL. Possibly another to save a tag/pointer.

>> No, just make it an arbitrary tag. Then the caller can use that as a
>> pointer and use that to find its own metadata.
> 
> Sorry. I didn't get this. What you mean by "arbitrary tag"?

Well it lets the kernel preserve a pointer to its crash metadata within
a preserved memory range. Although maybe that still has a chicken and egg
problem when you retrieve the data if the memory has been moved. I guess
rather than arbitrary tag, it could be a pointer offset that will get
updated to point to the destination location when you boot up.

> What I was telling is, post crash kernel needs below information so that it can 
> create proper
> dump.
>    - Crashing CPU --> In case of OPAL crash
>    - source_addr, dest_addr, size, type
>      Kernel uses type field to separate out the entry and to create dump.
> 
> Now either I have to pass all these data via device tree (then I can skip type 
> field)
> 1 . something like below:
>       `opal-result-entry` ->  src, dest, size for OPAL dump (and it can have 
> multiple entries)
>       `kernel-result-entry` -> src, dest, size for kernel DUMP (and it can have 
> multiple entries)
> 
>        But then as Oliver pointed, this is not static data. Once kernel consumes 
> it, data becomes
>        invalid. So device tree may not be the right place.
> 
> -OR-
> 
> 2. Pass entry size via device tree (like kernel-result-table-size, 
> opal-result-table-size).
>      Kernel will use API to get the data. But here kernel has to pass the type 
> information
>      to get the data.
> 
> -OR-
> 
> 3. Use structure to pass entire data. Device tree will contain total 
> result-table size.
>      Kernel will make API call to get data.

Does it need any device tree? How about OPAL calls to retrieve the
preserved ranges?

So you retrieve you kernel metadata pointer that's in dest memory,
and that contains all your saved ranges and other metadata.

Thanks,
Nick