[Skiboot] [PATCH v8 20/24] MPIPL: Add OPAL API to query saved tags

Vasant Hegde hegdevasant at linux.vnet.ibm.com
Wed Jul 10 01:23:04 AEST 2019

On 07/09/2019 03:33 PM, Oliver O'Halloran wrote:
> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote:
>> Pre-MPIPL kernel saves various information required to create vmcore in
>> metadata area and passes metadata area pointer to OPAL. OPAL will preserve
>> this pointer across MPIPL. Post MPIPL kernel will request for saved tags
>> via this API. Kernel also needs below tags:
>>    - Saved CPU registers data to access CPU registers
>>    - OPAL metadata area to create opalcore
>> Format:
>>    opal_mpipl_query_tag(uint32_t idx, uint64_t *tag)
>>    idx : tag index (0..n)
>>    tag : OPAL will pass saved tag
>>    Kernel will make this call with increased `index` until OPAL returns
>> Return values:
>>    OPAL_SUCCESS   : Operation success
>>    OPAL_PARAMETER : Invalid parameter
>>    OPAL_EMPTY	 : OPAL completed sending all tags to kernel
> After spending a while picking through the linux patches I'll say that the
> revised API is mostly fine, BUT:
> a) The documentation is *woefully* inadequate, and

I have documented API under `doc/opal-api/opal-mpipl-173-174.rst` with expected 
under various scenarios (pretty much similar to how we documented other APIs).

> b) We need to think a bit more about how tags are used.

We (Myself, Ananth, Nick, Hari, Mahesh) had a detailed *offline* discussion on 
tag. We did
thought about passing `tag` as part of opal_mpipl_query_tag() API. But Nick 
don't wanted
that and wanted to keep API simple. So now our API doesn't care about `tag` 
type. Its just
a interface to retrieve various tags. So we had to push the tag field to structure.

> My interpretation is that the kernel tag is there to allow the crashing kernel
> to define it's own dump metadata for the recieving kernel. As a result firmware,

Yes. Kernel will have its metadata area and tag points to the metadata area. 
(well, actually tag
can be anything using which post kernel can retrieve its metadata area).

> which includes the petitboot kernel, can't make assumptions about the contents
> of the tagged data since it could be populated by a version of Linux with a
> different metadata ABI or by another OS entirely.

OPAL will not care about tag content. Its job is to preserve tag across MPIPL and
pass it back to kernel.

> So as far as I can tell the overall process is something like:
> 1. We boot an OS. This may or may not be based on a Linux kernel.
> 2. OS configures MPIPL and sets the kernel tag to some structure it defines.
> 3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of
>     memory is left as-is. Some memory used by hostboot, the ranges in the MDDT,
>     and the skiboot region at 0x3000_0000 are overwritten.
> 4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL
>     happened and populates the tags which are defined by OPAL.
> 5. Petitkernel boots and detects that an MPIPL happened and tries to preserve
>     the crashed kernel's memory.
> 6. We kexec into a capture kernel that knows how to parse the kernel tag's
>     metadata and generates a dump from that.


> This raises a few questions:
> 1) How much memory does firmware have to work with? This doesn't seem to be

So Post MPIPL,
  -  hostboot needs to be loaded and executed. Its done at predefined address and
     that memory is part of reserved-memory range. So we don't need to worry
     about this memory.
  - OPAL memory
    By the time OPAL loads, hostboot already preserved memory based on MDST, 
MDDT table.
    These table contains OPAL runtime memory as well. So by the time we load 
OPAL its already
     preserved. We are good to reload OPAL on same memory.
- Memory used by OPAL to load petitboot kernel
   First kernel advertises these range to kernel via `fw-load-area` so that 
kernel care of
   preserving these memory.

> actually documented anywhere. The closest thing I can find to an answer is that
> in the Linux patches will preserve at least 0x0...0x3000_0000 since that's what
> the fadump.ops->get_bootmem_min() returns for powernv. 

Kernel will preserve kernel memory ranges. It won't care about firmware memory area.

> However, I think the
> minimum memory that firmware needs is something that it should tell the OS via
> the DT rather than having the OS guess.

OS won't guess anything about firmware memory usage. In fact it won't care about
firmware memory except `fw-load-area`.

> We already communicate some of that
> information via the fw-load-area DT property, but that only contains the
> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff).

OS needs to know about these memory ranges *only*...as first kernel may use these
memory. If it uses it should take care of reserving destination memory to 
preserve these
memory and pass it during MPIPL registration.

> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel can

As mentioned above kernel need not worry about hostboot memory usage...as its
part of reserved-memory ranges.

> use quite a bit. I checked on a random lab system and it was using ~400MB
> after a reboot, just sitting at the petitboot shell prompt.

We are introducing new kernel config (FADUMP_PRESERVE) and that will take care of
petitboot kernel part. Hari can elaborate this.

> 2) What happens if the tag points into a range that is moved by firmware? If
> this is considered a kernel bug then that needs to be documented in the API.

Kernel tag : During registration kernel passes tag (which points to metadata area)
and post MPIPL OPAL has to send back this tag. It won't care whether its part of
moved memory or not. From kernel point of view it should get back what it passed.

OPAL tag : Points to kernel meta data area. Using this pointer kernel can 
retrieve OPAL dump
area and generate opalcore. Kernel won't care whether its part of original 
memory or moved

> 3) How does a kernel know if it can parse the data pointed to be the kernel tag?

OPAL guarantees to return tag sent by kernel during registration without 
Additionally it can validate the header.

> Right now we assume that each tag (kernel tag included) points to a structure
> with the following header:
> struct tag_header {
> 	u8 type;
> 	u8 version;
> 	u8 reserved[6]
> 	/* stuff */
> }
> The fact this is a common header is not documented anywhere or described in any
> of the relevant structures (mpipl_fadump in OPAL, and opal_fadump_mem_struct
> in Linux), but the code in Linux which scans for the kernel tag assumes it's
> there:

I think we missed to sync opal-api.h before sending patches. It will be fixed in 
next version.

>  From arch/powerpc/platforms/powernv/opal-fadump.c, line 597:
> 	do {
> 		ret = opal_mpipl_query_tag(idx, &addr);
> 		if (ret != OPAL_SUCCESS)
> 			break;
> 		addr = be64_to_cpu(addr);
> 		type = *((u8 *)addr);
> 		switch (type) {
> 			opal_cpu_metadata = __va(addr);
> 			break;
> 			opal_fdm_active = __va(addr);
> 			break;
> 		}
> 		pr_debug("idx: %d, addr: %llx, type: %x\n",
> 			 idx, addr, type);
> 		idx++;
> 	} while (ret == OPAL_SUCCESS);
> So if the crashing OS doesn't use that header and the first byte happens to be
> the same as MPIPL_FADUMP_TYPE_KERNEL, we'll probably blow up horrible later on

First OS registers for MPIPL only after setting up metadata area. If it crashes 
before that means it is not registered for MPIPL. And OPAL promises to return 
same tag after MPIPL.
So I don't see why things will go wrong here.

> while parsing the kernel-defined fadump structure. While we're here, why is
> MPIPL_FADUMP_TYPE_KERNEL = 0x80? It seems like a relic of the old API.

Well, we need a way to differentiate tag type. Kernel, OPAL. etc.
(in future we may have new tag type as well). And we have byte. So I just
used 0x01 -> 0x7F for OPAL range and 0x80 - 0xff for kernel.

> That all said, I think most of these problems can be solved by making the
> "type" field an output parameter of opal_mpipl_query_tag() so the common
> header can be dropped.

We don't really gain much by moving "type" field from structure to API. Whether
its structure -OR- API, OPAL will just return whatever kernel passed tag. (and 
of course
OPAL tag for parsing OPAL core).

We cannot drop common structure anyway. Kernel needs to know how to parse
OPAL metadata. So we need that.

> Without the common header you could then add an address-only tag that's defined
> by the crashing kernel that indicates the upper limit of preserved memory. That way
> the crashing kernel can configure the dump table to make a few GB available for
> petitboot and all petitboot needs to do is query that tag to determine how much
> memory is availble to it.
> Adding a uint64_t structure magic field to the structures defined by OPAL and
> Linux is also a good idea since it gives you a reliable way to determine if
> it's a structure you know how deal with or not.
> Thoughts?

I don't think we gain anything with that.

> p.s.
> Can we please settle on some kind of sane naming convention, like:
> 	opal_mpipl_  - only used for stuff defined by OPAL and
> 	fadump_      - generic fadump in linux
> 	pnv_fadump_  - powernv specific fadump stuff.

I think I had mixed up fadump and mpipl in earlier version. Now I have settled with
opal_mpipl  for OPAL side. Kernel side I think we are using above convention. If we
missed it we will fix it in next version.

> One of the things that has made reviewing the mpipl patches painful is that
> there's no clear seperation between what's defined in OPAL vs the kernel. It's
> not a showstopper, but it makes an already hard job even harder. A particularly
> eregious example is this bastard structure:

As mentioned above we missed to sync structures before sending out patches.
Sorry for that. It will be fixed in next version.


 >  From skiboot's opal-api.h:
 > struct mpipl_fadump {
 >          u8      type;           /* MPIPL_FADUMP_TYPE_* */
 >          u8      version;
 >          u8      reserved[6];
 >          u32     crashing_pir;   /* OPAL crashing CPU PIR */
 >          u32     cpu_data_version;
 >          u32     cpu_data_size;
 >          u32     fadump_region_cnt;
 >          struct  fadump_region region[];
 > } __packed;
 >  From the kernel's opal-api.h:
 > struct opal_mpipl_fadump {
 >          u8      type;
 >          u8      version;
 >          u8      reserved[6];
 >          __be32  crashing_pir;
 >          __be32  cpu_data_version;
 >          __be32  cpu_data_size;
 >          __be32  region_cnt;
 >          struct opal_fadump_region       region[OPAL_FADUMP_NR_SECTIONS];
 > } __attribute__((packed));

More information about the Skiboot mailing list