[Skiboot] [PATCH v2 12/12] opal: Recover from TOD register parity errors.

Mahesh Jagannath Salgaonkar mahesh at linux.vnet.ibm.com
Fri Jul 10 03:41:16 AEST 2015


On 07/09/2015 12:38 PM, Stewart Smith wrote:
> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> 
>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>
>> This patch implements recovery from parity errors on below listed TOD
>> control registers:
>>
>> - Master Path control register (0x00040000)
>> - Primary Port-0 control register (0x00040001)
>> - Primary Port-1 control register (0x00040002)
>> - Secondary Port-0 control register (0x00040003)
>> - Secondary Port-1 control register (0x00040004)
>> - Slave Path control register (0x00040005)
>> - Internal Path control register (0x00040006)
>> - Primary/secondary master/slave control register (0x00040007)
>> - Chip control register (0x00040010)
>>
>> To inject TOD register parity error issue:
>> 	putscom pu 40031 8000000000000000 -pall  # (00040000)
>> 	putscom pu 40031 1000000000000000 -pall  # (00040001)
>> 	putscom pu 40031 0800000000000000 -pall  # (00040002)
>> 	putscom pu 40031 0400000000000000 -pall  # (00040003)
>> 	putscom pu 40031 0200000000000000 -pall  # (00040004)
>> 	putscom pu 40031 0100000000000000 -pall  # (00040005)
>> 	putscom pu 40031 0080000000000000 -pall  # (00040006)
>> 	putscom pu 40031 0040000000000000 -pall  # (00040007)
>> 	putscom pu 40031 0000000080000000 -pall  # (00040010)
> 
> I managed to get a bunch of these recovering okay, so that's great!
> 
> I also managed to get a whole bunch of oopes with the last one, so there
> may be some bugs somewhere
> 
> Trying this on a palmetto though:

Strange. I see recovery working fine on firestone. Let me try it on
palmetto tomorrow.

> [130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [130796482015,3] CHIPTOD: Running check fail timeout
> [130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [130796482015,3] CHIPTOD: Running check fail timeout
> [130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [130796482015,3] CHIPTOD: Running check fail timeout
> [130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [130796482015,3] CHIPTOD: Running check fail timeout
> 
> which makse sense when you look at skiboot log:
> [41797933478,6] CHIPTOD: Calculated MCBS is 0x4b (Cfreq=4024000000 Tfreq=320000
> 00).
> [41797936890,7] CHIPTOD: Base TFMR=0x4b12000000000000.
> [41796320335,7] CHIPTOD: Master sync on CPU PIR 0x0048....
> [41801786711,7] CHIPTOD: Slave sync on CPU PIR 0x0058....
> [41809523634,7] CHIPTOD: Slave sync on CPU PIR 0x0068....
> [7309025,7] CHIPTOD: PIR 0x0048 TB=6f86cc.
> [7313179,7] CHIPTOD: PIR 0x0058 TB=6f9707.
> [7680789,7] CHIPTOD: PIR 0x0068 TB=753300.
> [7689063,7] CHIPTOD: TOD Topology in Use: Primary.
> [7691343,7] CHIPTOD:   Primary configuration:.
> [7693122,7] CHIPTOD:    chip id: 0, Role: MDMT, Status: Active Master.
> [7696585,7] CHIPTOD:   Secondary configuration:.
> [7698434,7] CHIPTOD:    chip id: 0, Role: MDMT, Status: Active Master.

Palmetto is single chip system. It looks like It is wrongly informing
same chip id as secondary to opal through device tree. Ideally we should
have seen chip id as -1 for secondary configuration.

Let get grab palmetto if I can and verify all this.

Thanks,
-Mahesh.

> 
> 
> So is this a limitation of current OpenPower? What do we need to do to
> enable it?
> 



More information about the Skiboot mailing list