[Skiboot] [PATCH v8 20/24] MPIPL: Add OPAL API to query saved tags
Oliver O'Halloran
oohall at gmail.com
Tue Jul 9 20:03:22 AEST 2019
On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote:
> Pre-MPIPL kernel saves various information required to create vmcore in
> metadata area and passes metadata area pointer to OPAL. OPAL will preserve
> this pointer across MPIPL. Post MPIPL kernel will request for saved tags
> via this API. Kernel also needs below tags:
> - Saved CPU registers data to access CPU registers
> - OPAL metadata area to create opalcore
>
> Format:
> opal_mpipl_query_tag(uint32_t idx, uint64_t *tag)
>
> idx : tag index (0..n)
> tag : OPAL will pass saved tag
>
> Kernel will make this call with increased `index` until OPAL returns
> OPAL_EMPTY.
>
> Return values:
> OPAL_SUCCESS : Operation success
> OPAL_PARAMETER : Invalid parameter
> OPAL_EMPTY : OPAL completed sending all tags to kernel
After spending a while picking through the linux patches I'll say that the
revised API is mostly fine, BUT:
a) The documentation is *woefully* inadequate, and
b) We need to think a bit more about how tags are used.
My interpretation is that the kernel tag is there to allow the crashing kernel
to define it's own dump metadata for the recieving kernel. As a result firmware,
which includes the petitboot kernel, can't make assumptions about the contents
of the tagged data since it could be populated by a version of Linux with a
different metadata ABI or by another OS entirely.
So as far as I can tell the overall process is something like:
1. We boot an OS. This may or may not be based on a Linux kernel.
2. OS configures MPIPL and sets the kernel tag to some structure it defines.
3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of
memory is left as-is. Some memory used by hostboot, the ranges in the MDDT,
and the skiboot region at 0x3000_0000 are overwritten.
4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL
happened and populates the tags which are defined by OPAL.
5. Petitkernel boots and detects that an MPIPL happened and tries to preserve
the crashed kernel's memory.
6. We kexec into a capture kernel that knows how to parse the kernel tag's
metadata and generates a dump from that.
This raises a few questions:
1) How much memory does firmware have to work with? This doesn't seem to be
actually documented anywhere. The closest thing I can find to an answer is that
in the Linux patches will preserve at least 0x0...0x3000_0000 since that's what
the fadump.ops->get_bootmem_min() returns for powernv. However, I think the
minimum memory that firmware needs is something that it should tell the OS via
the DT rather than having the OS guess. We already communicate some of that
information via the fw-load-area DT property, but that only contains the
ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff).
Odds are hostboot doesn't use *that* much memory, but the petitboot kernel can
use quite a bit. I checked on a random lab system and it was using ~400MB
after a reboot, just sitting at the petitboot shell prompt.
2) What happens if the tag points into a range that is moved by firmware? If
this is considered a kernel bug then that needs to be documented in the API.
3) How does a kernel know if it can parse the data pointed to be the kernel tag?
Right now we assume that each tag (kernel tag included) points to a structure
with the following header:
struct tag_header {
u8 type;
u8 version;
u8 reserved[6]
/* stuff */
}
The fact this is a common header is not documented anywhere or described in any
of the relevant structures (mpipl_fadump in OPAL, and opal_fadump_mem_struct
in Linux), but the code in Linux which scans for the kernel tag assumes it's
there:
>From arch/powerpc/platforms/powernv/opal-fadump.c, line 597:
do {
ret = opal_mpipl_query_tag(idx, &addr);
if (ret != OPAL_SUCCESS)
break;
addr = be64_to_cpu(addr);
type = *((u8 *)addr);
switch (type) {
case MPIPL_FADUMP_TYPE_CPU:
opal_cpu_metadata = __va(addr);
break;
case MPIPL_FADUMP_TYPE_KERNEL:
opal_fdm_active = __va(addr);
break;
}
pr_debug("idx: %d, addr: %llx, type: %x\n",
idx, addr, type);
idx++;
} while (ret == OPAL_SUCCESS);
So if the crashing OS doesn't use that header and the first byte happens to be
the same as MPIPL_FADUMP_TYPE_KERNEL, we'll probably blow up horrible later on
while parsing the kernel-defined fadump structure. While we're here, why is
MPIPL_FADUMP_TYPE_KERNEL = 0x80? It seems like a relic of the old API.
That all said, I think most of these problems can be solved by making the
"type" field an output parameter of opal_mpipl_query_tag() so the common
header can be dropped.
Without the common header you could then add an address-only tag that's defined
by the crashing kernel that indicates the upper limit of preserved memory. That way
the crashing kernel can configure the dump table to make a few GB available for
petitboot and all petitboot needs to do is query that tag to determine how much
memory is availble to it.
Adding a uint64_t structure magic field to the structures defined by OPAL and
Linux is also a good idea since it gives you a reliable way to determine if
it's a structure you know how deal with or not.
Thoughts?
p.s.
Can we please settle on some kind of sane naming convention, like:
opal_mpipl_ - only used for stuff defined by OPAL and
fadump_ - generic fadump in linux
pnv_fadump_ - powernv specific fadump stuff.
One of the things that has made reviewing the mpipl patches painful is that
there's no clear seperation between what's defined in OPAL vs the kernel. It's
not a showstopper, but it makes an already hard job even harder. A particularly
eregious example is this bastard structure:
>From skiboot's opal-api.h:
struct mpipl_fadump {
u8 type; /* MPIPL_FADUMP_TYPE_* */
u8 version;
u8 reserved[6];
u32 crashing_pir; /* OPAL crashing CPU PIR */
u32 cpu_data_version;
u32 cpu_data_size;
u32 fadump_region_cnt;
struct fadump_region region[];
} __packed;
>From the kernel's opal-api.h:
struct opal_mpipl_fadump {
u8 type;
u8 version;
u8 reserved[6];
__be32 crashing_pir;
__be32 cpu_data_version;
__be32 cpu_data_size;
__be32 region_cnt;
struct opal_fadump_region region[OPAL_FADUMP_NR_SECTIONS];
} __attribute__((packed));
NOT EVEN THE SAME NAME! OR DEFINITION!
More information about the Skiboot
mailing list