[Skiboot] [PATCH v8 20/24] MPIPL: Add OPAL API to query saved tags
Vasant Hegde
hegdevasant at linux.vnet.ibm.com
Wed Jul 10 01:23:04 AEST 2019
On 07/09/2019 03:33 PM, Oliver O'Halloran wrote:
> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote:
>> Pre-MPIPL kernel saves various information required to create vmcore in
>> metadata area and passes metadata area pointer to OPAL. OPAL will preserve
>> this pointer across MPIPL. Post MPIPL kernel will request for saved tags
>> via this API. Kernel also needs below tags:
>> - Saved CPU registers data to access CPU registers
>> - OPAL metadata area to create opalcore
>>
>> Format:
>> opal_mpipl_query_tag(uint32_t idx, uint64_t *tag)
>>
>> idx : tag index (0..n)
>> tag : OPAL will pass saved tag
>>
>> Kernel will make this call with increased `index` until OPAL returns
>> OPAL_EMPTY.
>>
>> Return values:
>> OPAL_SUCCESS : Operation success
>> OPAL_PARAMETER : Invalid parameter
>> OPAL_EMPTY : OPAL completed sending all tags to kernel
>
> After spending a while picking through the linux patches I'll say that the
> revised API is mostly fine, BUT:
>
> a) The documentation is *woefully* inadequate, and
I have documented API under `doc/opal-api/opal-mpipl-173-174.rst` with expected
behavior
under various scenarios (pretty much similar to how we documented other APIs).
> b) We need to think a bit more about how tags are used.
We (Myself, Ananth, Nick, Hari, Mahesh) had a detailed *offline* discussion on
tag. We did
thought about passing `tag` as part of opal_mpipl_query_tag() API. But Nick
don't wanted
that and wanted to keep API simple. So now our API doesn't care about `tag`
type. Its just
a interface to retrieve various tags. So we had to push the tag field to structure.
>
> My interpretation is that the kernel tag is there to allow the crashing kernel
> to define it's own dump metadata for the recieving kernel. As a result firmware,
Yes. Kernel will have its metadata area and tag points to the metadata area.
(well, actually tag
can be anything using which post kernel can retrieve its metadata area).
> which includes the petitboot kernel, can't make assumptions about the contents
> of the tagged data since it could be populated by a version of Linux with a
> different metadata ABI or by another OS entirely.
OPAL will not care about tag content. Its job is to preserve tag across MPIPL and
pass it back to kernel.
>
> So as far as I can tell the overall process is something like:
>
> 1. We boot an OS. This may or may not be based on a Linux kernel.
> 2. OS configures MPIPL and sets the kernel tag to some structure it defines.
> 3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of
> memory is left as-is. Some memory used by hostboot, the ranges in the MDDT,
> and the skiboot region at 0x3000_0000 are overwritten.
> 4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL
> happened and populates the tags which are defined by OPAL.
> 5. Petitkernel boots and detects that an MPIPL happened and tries to preserve
> the crashed kernel's memory.
> 6. We kexec into a capture kernel that knows how to parse the kernel tag's
> metadata and generates a dump from that.
Correct.
>
>
> This raises a few questions:
>
> 1) How much memory does firmware have to work with? This doesn't seem to be
So Post MPIPL,
- hostboot needs to be loaded and executed. Its done at predefined address and
that memory is part of reserved-memory range. So we don't need to worry
about this memory.
- OPAL memory
By the time OPAL loads, hostboot already preserved memory based on MDST,
MDDT table.
These table contains OPAL runtime memory as well. So by the time we load
OPAL its already
preserved. We are good to reload OPAL on same memory.
- Memory used by OPAL to load petitboot kernel
First kernel advertises these range to kernel via `fw-load-area` so that
kernel care of
preserving these memory.
> actually documented anywhere. The closest thing I can find to an answer is that
> in the Linux patches will preserve at least 0x0...0x3000_0000 since that's what
> the fadump.ops->get_bootmem_min() returns for powernv.
Kernel will preserve kernel memory ranges. It won't care about firmware memory area.
> However, I think the
> minimum memory that firmware needs is something that it should tell the OS via
> the DT rather than having the OS guess.
OS won't guess anything about firmware memory usage. In fact it won't care about
firmware memory except `fw-load-area`.
> We already communicate some of that
> information via the fw-load-area DT property, but that only contains the
> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff).
OS needs to know about these memory ranges *only*...as first kernel may use these
memory. If it uses it should take care of reserving destination memory to
preserve these
memory and pass it during MPIPL registration.
>
> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel can
As mentioned above kernel need not worry about hostboot memory usage...as its
part of reserved-memory ranges.
> use quite a bit. I checked on a random lab system and it was using ~400MB
> after a reboot, just sitting at the petitboot shell prompt.
We are introducing new kernel config (FADUMP_PRESERVE) and that will take care of
petitboot kernel part. Hari can elaborate this.
>
>
> 2) What happens if the tag points into a range that is moved by firmware? If
> this is considered a kernel bug then that needs to be documented in the API.
Kernel tag : During registration kernel passes tag (which points to metadata area)
and post MPIPL OPAL has to send back this tag. It won't care whether its part of
moved memory or not. From kernel point of view it should get back what it passed.
OPAL tag : Points to kernel meta data area. Using this pointer kernel can
retrieve OPAL dump
area and generate opalcore. Kernel won't care whether its part of original
memory or moved
memory.
>
>
> 3) How does a kernel know if it can parse the data pointed to be the kernel tag?
OPAL guarantees to return tag sent by kernel during registration without
modification.
Additionally it can validate the header.
> Right now we assume that each tag (kernel tag included) points to a structure
> with the following header:
>
> struct tag_header {
> u8 type;
> u8 version;
> u8 reserved[6]
> /* stuff */
> }
>
> The fact this is a common header is not documented anywhere or described in any
> of the relevant structures (mpipl_fadump in OPAL, and opal_fadump_mem_struct
> in Linux), but the code in Linux which scans for the kernel tag assumes it's
> there:
I think we missed to sync opal-api.h before sending patches. It will be fixed in
next version.
>
> From arch/powerpc/platforms/powernv/opal-fadump.c, line 597:
>
> do {
> ret = opal_mpipl_query_tag(idx, &addr);
> if (ret != OPAL_SUCCESS)
> break;
>
> addr = be64_to_cpu(addr);
> type = *((u8 *)addr);
> switch (type) {
> case MPIPL_FADUMP_TYPE_CPU:
> opal_cpu_metadata = __va(addr);
> break;
> case MPIPL_FADUMP_TYPE_KERNEL:
> opal_fdm_active = __va(addr);
> break;
> }
>
> pr_debug("idx: %d, addr: %llx, type: %x\n",
> idx, addr, type);
> idx++;
> } while (ret == OPAL_SUCCESS);
>
> So if the crashing OS doesn't use that header and the first byte happens to be
> the same as MPIPL_FADUMP_TYPE_KERNEL, we'll probably blow up horrible later on
First OS registers for MPIPL only after setting up metadata area. If it crashes
before that means it is not registered for MPIPL. And OPAL promises to return
same tag after MPIPL.
So I don't see why things will go wrong here.
> while parsing the kernel-defined fadump structure. While we're here, why is
> MPIPL_FADUMP_TYPE_KERNEL = 0x80? It seems like a relic of the old API.
Well, we need a way to differentiate tag type. Kernel, OPAL. etc.
(in future we may have new tag type as well). And we have byte. So I just
used 0x01 -> 0x7F for OPAL range and 0x80 - 0xff for kernel.
>
>
> That all said, I think most of these problems can be solved by making the
> "type" field an output parameter of opal_mpipl_query_tag() so the common
> header can be dropped.
We don't really gain much by moving "type" field from structure to API. Whether
its structure -OR- API, OPAL will just return whatever kernel passed tag. (and
of course
OPAL tag for parsing OPAL core).
We cannot drop common structure anyway. Kernel needs to know how to parse
OPAL metadata. So we need that.
>
> Without the common header you could then add an address-only tag that's defined
> by the crashing kernel that indicates the upper limit of preserved memory. That way
> the crashing kernel can configure the dump table to make a few GB available for
> petitboot and all petitboot needs to do is query that tag to determine how much
> memory is availble to it.
>
> Adding a uint64_t structure magic field to the structures defined by OPAL and
> Linux is also a good idea since it gives you a reliable way to determine if
> it's a structure you know how deal with or not.
>
> Thoughts?
I don't think we gain anything with that.
>
>
>
> p.s.
>
> Can we please settle on some kind of sane naming convention, like:
>
> opal_mpipl_ - only used for stuff defined by OPAL and
> fadump_ - generic fadump in linux
> pnv_fadump_ - powernv specific fadump stuff.
I think I had mixed up fadump and mpipl in earlier version. Now I have settled with
opal_mpipl for OPAL side. Kernel side I think we are using above convention. If we
missed it we will fix it in next version.
>
> One of the things that has made reviewing the mpipl patches painful is that
> there's no clear seperation between what's defined in OPAL vs the kernel. It's
> not a showstopper, but it makes an already hard job even harder. A particularly
> eregious example is this bastard structure:
As mentioned above we missed to sync structures before sending out patches.
Sorry for that. It will be fixed in next version.
-Vasant
> From skiboot's opal-api.h:
>
> struct mpipl_fadump {
> u8 type; /* MPIPL_FADUMP_TYPE_* */
> u8 version;
> u8 reserved[6];
> u32 crashing_pir; /* OPAL crashing CPU PIR */
> u32 cpu_data_version;
> u32 cpu_data_size;
> u32 fadump_region_cnt;
> struct fadump_region region[];
> } __packed;
>
> From the kernel's opal-api.h:
>
> struct opal_mpipl_fadump {
> u8 type;
> u8 version;
> u8 reserved[6];
> __be32 crashing_pir;
> __be32 cpu_data_version;
> __be32 cpu_data_size;
> __be32 region_cnt;
> struct opal_fadump_region region[OPAL_FADUMP_NR_SECTIONS];
> } __attribute__((packed));
>
> NOT EVEN THE SAME NAME! OR DEFINITION!
>
More information about the Skiboot
mailing list