[PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events
Hari Bathini
hbathini at linux.ibm.com
Fri Nov 17 17:14:01 AEDT 2023
On 17/11/23 11:01 am, Aneesh Kumar K V wrote:
> On 11/17/23 10:03 AM, Sourabh Jain wrote:
>> Hi Aneesh,
>>
>> Thanks for reviewing the patch.
>>
>> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>>> Sourabh Jain<sourabhjain at linux.ibm.com> writes:
>>>
>>> ....
>>>
>>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>>>> index 27f9e11eda28..7be3d8894520 100644
>>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>> #define FADUMP_CPU_UNKNOWN (~((u32)0))
>>>> -#define FADUMP_CRASH_INFO_MAGIC fadump_str_to_u64("FADMPINF")
>>>> +/*
>>>> + * The introduction of new fields in the fadump crash info header has
>>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>>> + * This alteration ensures backward compatibility, enabling the kernel
>>>> + * with the updated fadump crash info to handle kernel dumps from older
>>>> + * kernels.
>>>> + *
>>>> + * To prevent the need for further changes to the magic number in the
>>>> + * event of future modifications to the fadump header, a version field
>>>> + * has been introduced to track the fadump crash info header version.
>>>> + *
>>>> + * Historically, there was no connection between the magic number and
>>>> + * the fadump crash info header version. However, moving forward, the
>>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>>> + * the `FADMPSIG` magic number in header will include a version field to
>>>> + * determine its version.
>>>> + */
>>>> +#define FADUMP_CRASH_INFO_MAGIC fadump_str_to_u64("FADMPSIG")
>>>> +#define FADUMP_VERSION 1
>>>>
>>> Can we keep the old magic details as
>>>
>>> #define FADUMP_CRASH_INFO_MAGIC_OLD fadump_str_to_u64("FADMPINF")
>>> #define FADUMP_CRASH_INFO_MAGIC fadump_str_to_u64("FADMPSIG")
>>
>> Sure.
>>
>>> Also considering the struct need not be backward compatible, can we just
>>> do
>>>
>>> struct fadump_crash_info_header {
>>> u64 magic_number;
>>> u32 crashing_cpu;
>>> u64 elfcorehdr_addr;
>>> u64 elfcorehdr_size;
>>> u64 vmcoreinfo_raddr;
>>> u64 vmcoreinfo_size;
>>> struct pt_regs regs;
>>> struct cpumask cpu_mask;
>>> };
>>> static inline bool fadump_compatible(struct fadump_crash_info_header
>>> *fdh)
>>> {
>>> return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>>> }
>>>
>>> and fail fadump if we find it not compatible?
>>
>> Agree that it is unsafe to collect a dump with an incompatible fadump crash info header.
>>
>> Given that I am updating the fadump crash info header, we can make a few arrangements
>> like adding a size filed for the dynamic size attribute like pt_regs and cpumask to ensure
>> better compatibility in the future.
>>
>> Additionally, let's introduce a version field to the fadump crash info header to avoid changing
>> the magic number in the future.
>>
>
> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
> crash kernel and fadump kernel will be same version? if yes we can simply fail with a magic number
> mismatch because that indicates an user config error?
If we decide not to support different kernel versions for production
kernel and capture kernel, We can make that implicit by adding kernel
version info of production kernel in the header and bailing out if
there is kernel version mismatch as magic could still match for two
different kernel versions.
I would personally prefer something like the below though:
struct fadump_crash_info_header {
u64 magic_number;
u32 version
u32 crashing_cpu;
u64 elfcorehdr_addr;
u64 elfcorehdr_size;
u64 vmcoreinfo_raddr;
u64 vmcoreinfo_size;
u8 kernel_version[];
u32 pt_regs_sz;
struct pt_regs regs;
u32 cpu_mask_sz;
struct cpumask cpu_mask;
};
if (magic_number != new_magic)
goto err; /* Error out */
if (kernel_version != capture_kernel_version)
{
if (cpu_mask_sz == sizeof(struct pt_regs) && cpu_mask_sz ==
sizeof(struct cpumask))
/*
* Warn about the kernel version mismatch and how data can be different
* across kernel versions and proceed anyway!
*/
else
goto err; /* Error out */
}
This ensures we warn and proceed in cases where it is less likely to
have issues capturing kernel dump. This helps in dev environment where
we are trying to debug an early boot crash - in which case capture
kernel can't be the same kernel as it would likely hit the same problem
while booting..
Thanks
Hari
More information about the Linuxppc-dev
mailing list