[PATCH 2/2] misc: smpro-errmon: Add dimm training failure syndrome

Paul Menzel pmenzel at molgen.mpg.de
Wed Feb 15 18:33:23 AEDT 2023


Dear Quan,


Thank you for your patch.

Am 14.02.23 um 07:45 schrieb Quan Nguyen:
> Adds event_dimm[0-15]_syndrome sysfs to report the failure syndrome
> to BMC when DIMM training failed.

Where you able to verify that it works? Out of curiosity, how?

> Signed-off-by: Quan Nguyen <quan at os.amperecomputing.com>
> ---
>   .../sysfs-bus-platform-devices-ampere-smpro   | 10 +++
>   drivers/misc/smpro-errmon.c                   | 77 +++++++++++++++++++
>   2 files changed, 87 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro
> index d4e3f308c451..c35f1d45e656 100644
> --- a/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro
> +++ b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro
> @@ -265,6 +265,16 @@ Description:
>   		For more details, see section `5.7 GPI Status Registers and 5.9 Memory Error Register Definitions,
>   		Altra Family Soc BMC Interface Specification`.
>   
> +What:		/sys/bus/platform/devices/smpro-errmon.*/event_dimm[0-15]_syndrome
> +KernelVersion:	6.1

Should it be 6.2, as it probably won’t make it into 6.1?

> +Contact:	Quan Nguyen <quan at os.amperecomputing.com>
> +Description:
> +		(RO) The sysfs returns the 2-byte DIMM failure syndrome data for slot
> +		0-15 if it failed to initialized.

to initialize

> +
> +		For more details, see section `5.11 Boot Stage Register Definitions,
> +		Altra Family Soc BMC Interface Specification`.
> +
>   What:		/sys/bus/platform/devices/smpro-misc.*/boot_progress
>   KernelVersion:	6.1
>   Contact:	Quan Nguyen <quan at os.amperecomputing.com>
> diff --git a/drivers/misc/smpro-errmon.c b/drivers/misc/smpro-errmon.c
> index 1635e881aefb..3e8570cbb740 100644
> --- a/drivers/misc/smpro-errmon.c
> +++ b/drivers/misc/smpro-errmon.c
> @@ -47,6 +47,12 @@
>   #define WARN_PMPRO_INFO_LO	0xAC
>   #define WARN_PMPRO_INFO_HI	0xAD
>   
> +/* Boot Stage Register */
> +#define BOOTSTAGE		0xB0
> +#define DIMM_SYNDROME_SEL	0xB4
> +#define DIMM_SYNDROME_ERR	0xB5
> +#define DIMM_SYNDROME_STAGE	4
> +
>   /* PCIE Error Registers */
>   #define PCIE_CE_ERR_CNT		0xC0
>   #define PCIE_CE_ERR_LEN		0xC1
> @@ -468,6 +474,61 @@ EVENT_RO(vrd_hot, VRD_HOT_EVENT);
>   EVENT_RO(dimm_hot, DIMM_HOT_EVENT);
>   EVENT_RO(dimm_2x_refresh, DIMM_2X_REFRESH_EVENT);
>   
> +static ssize_t smpro_dimm_syndrome_read(struct device *dev, struct device_attribute *da,
> +					char *buf, int slot)

Could `slot` be passed as `unsigned int`?

> +{
> +	struct smpro_errmon *errmon = dev_get_drvdata(dev);
> +	s32 data;
> +	int ret;
> +
> +	ret = regmap_read(errmon->regmap, BOOTSTAGE, &data);

The function signature is:

     int regmap_read(struct regmap *map, unsigned int reg, unsigned int 
*val)

So why not use unsigned int as data type for `data`?

> +	if (ret)
> +		return ret;
> +
> +	/* check for valid stage */
> +	data = (data >> 8) & 0xff;
> +	if (data != DIMM_SYNDROME_STAGE)
> +		return ret;

Isn’t now success returned? Should a debug message be printed?

> +
> +	/* Write the slot ID to retrieve Error Syndrome */
> +	ret = regmap_write(errmon->regmap, DIMM_SYNDROME_SEL, slot);
> +	if (ret)
> +		return ret;
> +
> +	/* Read the Syndrome error */
> +	ret = regmap_read(errmon->regmap, DIMM_SYNDROME_ERR, &data);
> +	if (ret || !data)
> +		return ret;
> +
> +	return sysfs_emit(buf, "%04x\n", data);
> +}
> +
> +#define EVENT_DIMM_SYNDROME(_slot) \
> +	static ssize_t event_dimm##_slot##_syndrome_show(struct device *dev,          \
> +							 struct device_attribute *da, \
> +							 char *buf)                   \
> +	{                                                                             \
> +		return smpro_dimm_syndrome_read(dev, da, buf, _slot);                 \
> +	}                                                                             \
> +	static DEVICE_ATTR_RO(event_dimm##_slot##_syndrome)
> +
> +EVENT_DIMM_SYNDROME(0);
> +EVENT_DIMM_SYNDROME(1);
> +EVENT_DIMM_SYNDROME(2);
> +EVENT_DIMM_SYNDROME(3);
> +EVENT_DIMM_SYNDROME(4);
> +EVENT_DIMM_SYNDROME(5);
> +EVENT_DIMM_SYNDROME(6);
> +EVENT_DIMM_SYNDROME(7);
> +EVENT_DIMM_SYNDROME(8);
> +EVENT_DIMM_SYNDROME(9);
> +EVENT_DIMM_SYNDROME(10);
> +EVENT_DIMM_SYNDROME(11);
> +EVENT_DIMM_SYNDROME(12);
> +EVENT_DIMM_SYNDROME(13);
> +EVENT_DIMM_SYNDROME(14);
> +EVENT_DIMM_SYNDROME(15);
> +
>   static struct attribute *smpro_errmon_attrs[] = {
>   	&dev_attr_overflow_core_ce.attr,
>   	&dev_attr_overflow_core_ue.attr,
> @@ -493,6 +554,22 @@ static struct attribute *smpro_errmon_attrs[] = {
>   	&dev_attr_event_vrd_hot.attr,
>   	&dev_attr_event_dimm_hot.attr,
>   	&dev_attr_event_dimm_2x_refresh.attr,
> +	&dev_attr_event_dimm0_syndrome.attr,
> +	&dev_attr_event_dimm1_syndrome.attr,
> +	&dev_attr_event_dimm2_syndrome.attr,
> +	&dev_attr_event_dimm3_syndrome.attr,
> +	&dev_attr_event_dimm4_syndrome.attr,
> +	&dev_attr_event_dimm5_syndrome.attr,
> +	&dev_attr_event_dimm6_syndrome.attr,
> +	&dev_attr_event_dimm7_syndrome.attr,
> +	&dev_attr_event_dimm8_syndrome.attr,
> +	&dev_attr_event_dimm9_syndrome.attr,
> +	&dev_attr_event_dimm10_syndrome.attr,
> +	&dev_attr_event_dimm11_syndrome.attr,
> +	&dev_attr_event_dimm12_syndrome.attr,
> +	&dev_attr_event_dimm13_syndrome.attr,
> +	&dev_attr_event_dimm14_syndrome.attr,
> +	&dev_attr_event_dimm15_syndrome.attr,
>   	NULL
>   };


Kind regards,

Paul


More information about the openbmc mailing list