[PATCH v5 1/3] powerpc: make fadump resilient with memory add/remove events

Hari Bathini hbathini at linux.ibm.com
Fri Nov 17 17:14:01 AEDT 2023



On 17/11/23 11:01 am, Aneesh Kumar K V wrote:
> On 11/17/23 10:03 AM, Sourabh Jain wrote:
>> Hi Aneesh,
>>
>> Thanks for reviewing the patch.
>>
>> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>>> Sourabh Jain<sourabhjain at linux.ibm.com>  writes:
>>>
>>> ....
>>>
>>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>>>> index 27f9e11eda28..7be3d8894520 100644
>>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>>      #define FADUMP_CPU_UNKNOWN        (~((u32)0))
>>>>    -#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPINF")
>>>> +/*
>>>> + * The introduction of new fields in the fadump crash info header has
>>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>>> + * This alteration ensures backward compatibility, enabling the kernel
>>>> + * with the updated fadump crash info to handle kernel dumps from older
>>>> + * kernels.
>>>> + *
>>>> + * To prevent the need for further changes to the magic number in the
>>>> + * event of future modifications to the fadump header, a version field
>>>> + * has been introduced to track the fadump crash info header version.
>>>> + *
>>>> + * Historically, there was no connection between the magic number and
>>>> + * the fadump crash info header version. However, moving forward, the
>>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>>> + * the `FADMPSIG` magic number in header will include a version field to
>>>> + * determine its version.
>>>> + */
>>>> +#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
>>>> +#define FADUMP_VERSION            1
>>>>
>>> Can we keep the old magic details as
>>>
>>> #define FADUMP_CRASH_INFO_MAGIC_OLD        fadump_str_to_u64("FADMPINF")
>>> #define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
>>
>> Sure.
>>
>>> Also considering the struct need not be backward compatible, can we just
>>> do
>>>
>>> struct fadump_crash_info_header {
>>>      u64        magic_number;
>>>      u32        crashing_cpu;
>>>      u64        elfcorehdr_addr;
>>>      u64        elfcorehdr_size;
>>>      u64        vmcoreinfo_raddr;
>>>      u64        vmcoreinfo_size;
>>>      struct pt_regs    regs;
>>>      struct cpumask    cpu_mask;
>>> };
>>> static inline bool fadump_compatible(struct fadump_crash_info_header
>>> *fdh)
>>> {
>>>      return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>>> }
>>>
>>> and fail fadump if we find it not compatible?
>>
>> Agree that it is unsafe to collect a dump with an incompatible fadump crash info header.
>>
>> Given that I am updating the fadump crash info header, we can make a few arrangements
>> like adding a size filed for the dynamic size attribute like pt_regs and cpumask to ensure
>> better compatibility in the future.
>>
>> Additionally, let's introduce a version field to the fadump crash info header to avoid changing
>> the magic number in the future.
>>
> 
> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
> crash kernel and fadump kernel will be same version? if yes we can simply fail with a magic number
> mismatch because that indicates an user config error?

If we decide not to support different kernel versions for production
kernel and capture kernel, We can make that implicit by adding kernel
version info of production kernel in the header and bailing out if
there is kernel version mismatch as magic could still match for two
different kernel versions.

I would personally prefer something like the below though:

	struct fadump_crash_info_header {
		u64		magic_number;
		u32		version
		u32		crashing_cpu;
		u64		elfcorehdr_addr;
		u64		elfcorehdr_size;
		u64		vmcoreinfo_raddr;
		u64		vmcoreinfo_size;
		u8		kernel_version[];
		u32		pt_regs_sz;
		struct pt_regs	regs;
		u32		cpu_mask_sz;
		struct cpumask	cpu_mask;
	};

	if (magic_number != new_magic)
		goto err;	/* Error out */

	if (kernel_version != capture_kernel_version)
	{
		if (cpu_mask_sz == sizeof(struct pt_regs) && cpu_mask_sz == 
sizeof(struct cpumask))
			/*
			 * Warn about the kernel version mismatch and how data can be different
			 * across kernel versions and proceed anyway!
			 */
		else
			goto err;	/* Error out */
	}

This ensures we warn and proceed in cases where it is less likely to
have issues capturing kernel dump. This helps in dev environment where
we are trying to debug an early boot crash - in which case capture
kernel can't be the same kernel as it would likely hit the same problem
while booting..

Thanks
Hari


More information about the Linuxppc-dev mailing list