[Skiboot] [PATCH v7 18/22] fadump: Add documentation

Mon Jun 3 18:28:22 AEST 2019

On 05/31/2019 09:10 AM, Nicholas Piggin wrote:
> Vasant Hegde's on May 29, 2019 4:01 am:
>> On 05/20/2019 07:59 AM, Nicholas Piggin wrote:
>>> Vasant Hegde's on May 18, 2019 9:12 pm:
>>>> On 05/16/2019 11:05 AM, Nicholas Piggin wrote:
>>>>> Vasant Hegde's on May 14, 2019 9:23 pm:
>>>>>> On 05/09/2019 10:28 AM, Nicholas Piggin wrote:
>>>>>>> Vasant Hegde's on April 13, 2019 7:15 pm:
>>>>>>>> diff --git a/doc/opal-api/opal-fadump-manage-173.rst b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000..916167503
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/doc/opal-api/opal-fadump-manage-173.rst
>>>>>>>> @@ -0,0 +1,73 @@
>>>>>>>> +.. _opal-api-fadump-manage:
>>>>>>>> +

.../...

>>>>
>>>> Type : Identifies memory content type (like OPAL, kernel, etc). During MPIPL
>>>> registration
>>>> we pass this data to HDAT.. Hostboot will just copy this back to Result table
>>>> inside HDAT.
>>>> During MPIPL boot, OPAL passes this information to kernel.. so that kernel can
>>>> generate
>>>> proper dumps.
>>>
>>> Right. But it's all metadata that "MPIPL" does not need to know. We want
>>> a service that preserves memory over reboot. Then Linux can create its
>>> own metadata to use that for fadump crashes, for example.
>>
>> OPAL/firmware is not using this metadata. It just passes it back.
> 
> That's a good hint to just leave it out and let Linux define its own
> metadata.
> 
>> But I got your point. We can skip passing "type" information to OPAL.
>>
>>>
>>>>>
>>>>>> After reboot, we will result range from hostboot . We pass that to kernel via
>>>>>> device tree.
>>>>>>
>>>>>>>
>>>>>>> Why not just an API which can add a range, and delete a range, and
>>>>>>> that's it? Range would just be physical start, end, plus an arbitrary
>>>>>>> tag (which caller can use to retrieve metadata that is used to
>>>>>>> decipher the dump).
>>>>>>
>>>>>> We want one to one mapping between source and destination.
>>>>>
>>>>> Ah yes, sure that too. So two calls, one which adds or removes
>>>>> (source, dest, length) entries, and another which sets a tag.
>>>>
>>>> Sorry. I'm still not getting what we gain by multiple calls here.
>>>
>>> No ugly structure that's tied to some internal dump metadata.
>>
>> I understand the metadata part. But structure is not tied to anything else.
>> The data format used to pass information from OPAL to hostboot is completely
>> different.
>> This structure is used to pass data between OPAL and kernel.
> 
> It's just ugly. Start with an API which is as simple and minimal as
> possible, and then explain why you need to complicate it.
> 
> Performance and race windows are not good reasons AFAIKS. We can do

This API is mostly used during system boot time. So we were not considering
performance.

> about 1-2 million OPAL calls per second last I checked, and the
> race/atomicity issue is better solved in the kernel, where you can
> set a flag after registration is done, and only issue the MPIPL reboot
> if that flag is set.
> 
>>>> - With structure we can pass all information in one call. So kernel can make
>>>> single call for registration.
>>>
>>> We don't gain much there AFAIKS.
>>>
>>>> - Its controlled by version field (we realized need of version field during
>>>> review and I will
>>>> add that in v8) . So makes it easy to handle comparability issues.
>>>
>>> But you only have backward compatibility concerns because you're
>>> exposing the structure and putting crash metadata into it in the first
>>> place.
>>
>> Otherwise for any change in interface we will endup creating new API
>> (like we did OPAL_DUMP_INFO2).
> 
> If you take all the unnecessary stuff out of your API, and you have a
> better chance you get it right and won't need to change.
> 
> "Preserve these memory ranges across boot" is the be all and end all of
> preserving memory ranges across boot :) But if we do need to change
> things, a new OPAL call is not worse than versioning a structure.

Both as bad anyway.

> 
>> I don't think we gain much with multiple APIs (for registration, un-registration,
>> invalidate, complete). Anyway let me come up with new API which passes
>> src, dest, size.
> 
> Thanks for considering it. Not sure if you need all those. One OPAL call
> number that takes a range and flag to preserve/unpreserve, and another
> to initiate the MPIPL. Possibly another to save a tag/pointer.
> 
>>> No, just make it an arbitrary tag. Then the caller can use that as a
>>> pointer and use that to find its own metadata.
>>
>> Sorry. I didn't get this. What you mean by "arbitrary tag"?
> 
> Well it lets the kernel preserve a pointer to its crash metadata within
> a preserved memory range. Although maybe that still has a chicken and egg
> problem when you retrieve the data if the memory has been moved. I guess
> rather than arbitrary tag, it could be a pointer offset that will get
> updated to point to the destination location when you boot up.

We don't need such tags. Based on destination memory address, kernel can
retrieve meta data.

> 
>> What I was telling is, post crash kernel needs below information so that it can
>> create proper
>> dump.
>>     - Crashing CPU --> In case of OPAL crash
>>     - source_addr, dest_addr, size, type
>>       Kernel uses type field to separate out the entry and to create dump.
>>
>> Now either I have to pass all these data via device tree (then I can skip type
>> field)
>> 1 . something like below:
>>        `opal-result-entry` ->  src, dest, size for OPAL dump (and it can have
>> multiple entries)
>>        `kernel-result-entry` -> src, dest, size for kernel DUMP (and it can have
>> multiple entries)
>>
>>         But then as Oliver pointed, this is not static data. Once kernel consumes
>> it, data becomes
>>         invalid. So device tree may not be the right place.
>>
>> -OR-
>>
>> 2. Pass entry size via device tree (like kernel-result-table-size,
>> opal-result-table-size).
>>       Kernel will use API to get the data. But here kernel has to pass the type
>> information
>>       to get the data.
>>
>> -OR-
>>
>> 3. Use structure to pass entire data. Device tree will contain total
>> result-table size.
>>       Kernel will make API call to get data.
> 
> Does it need any device tree?

Yes. At least we should indicate we its MPIPL boot and size of result table.

>  How about OPAL calls to retrieve the
> preserved ranges?

Yeah. We can consider OPAL API.

> 
> So you retrieve you kernel metadata pointer that's in dest memory,
> and that contains all your saved ranges and other metadata.

During MPIPL, hostboot takes care of  moving data from source to destination memory.
Hostboot point of view its just memory preserving (copying memory from source to
destination). It doesn't differentiate whether its kernel memory -OR- OPAL memory.
Once it completes it boots OPAL and passes result table to OPAL. This result 
table contains both OPAL and kernel entries. We have to pass this information to 
kernel. Also kernel needs a way
to differentiate between OPAL and kernel entries.

So here is my new interface proposal. Does this looks ok?

Registration :
   opal_fadump_manage(u8 cmd, u64 src, u64 dest,  u64 size)
       cmd :  0x01 -> OPAL_FADUMP_REGISTER , kernel should pass all 3 parameters
                  0x02 -> OPAL_FADUMP_UNREGISTER, kernel should pass src and 
dest parameter
                               if src = dest = 0, then UNREGISTER all kernel entries
                  0x03 -> OPAL_FADUMP_INVALIDATE, No parameter is required.

     - OPAL will add new entry to MDST/MDST table
     - If system crashes before kernel adds all entry, we will have partial 
kernel dump
     - Kernel should handle partial dumps

     - Kernel has to maintain its metadata. If in future, multiple driver wants 
different dumps
       they have to go through same interface (CONFIG_FADUMP).

Post MPIPL pass result table to kernel :
     - Device tree indicates its MPIPL boot (/ibm,opal/dump/fadump)
     - For OPAL dump, device tree contains crashing CPU PIR 
(/ibm,opal/dump/crashing-pir)
     - Device tree contains size (in bytes) of result table 
(/ibm,opal/dump/result-table-size)
       data format : u8 type, u64 source , u64 destination, u64 size
           --> type field is required to differentiate between OPAL and kernel 
entries. In future
                if we add new dump support, we will add new type.

     - OPAL API : opal_fadump_result_table(u64 data, u64 size)
         Kernel allocates memory for data and makes OPAL call
         OPAL copies result table to kernel passed memory.
         Kernel parses entry based on type field.

-Vasant