[Skiboot] [PATCH v2 12/12] opal: Recover from TOD register parity errors.

Stewart Smith stewart at linux.vnet.ibm.com
Thu Jul 9 17:08:50 AEST 2015


Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:

> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> This patch implements recovery from parity errors on below listed TOD
> control registers:
>
> - Master Path control register (0x00040000)
> - Primary Port-0 control register (0x00040001)
> - Primary Port-1 control register (0x00040002)
> - Secondary Port-0 control register (0x00040003)
> - Secondary Port-1 control register (0x00040004)
> - Slave Path control register (0x00040005)
> - Internal Path control register (0x00040006)
> - Primary/secondary master/slave control register (0x00040007)
> - Chip control register (0x00040010)
>
> To inject TOD register parity error issue:
> 	putscom pu 40031 8000000000000000 -pall  # (00040000)
> 	putscom pu 40031 1000000000000000 -pall  # (00040001)
> 	putscom pu 40031 0800000000000000 -pall  # (00040002)
> 	putscom pu 40031 0400000000000000 -pall  # (00040003)
> 	putscom pu 40031 0200000000000000 -pall  # (00040004)
> 	putscom pu 40031 0100000000000000 -pall  # (00040005)
> 	putscom pu 40031 0080000000000000 -pall  # (00040006)
> 	putscom pu 40031 0040000000000000 -pall  # (00040007)
> 	putscom pu 40031 0000000080000000 -pall  # (00040010)

I managed to get a bunch of these recovering okay, so that's great!

I also managed to get a whole bunch of oopes with the last one, so there
may be some bugs somewhere

Trying this on a palmetto though:
[130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[130796482015,3] CHIPTOD: Running check fail timeout
[130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[130796482015,3] CHIPTOD: Running check fail timeout
[130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[130796482015,3] CHIPTOD: Running check fail timeout
[130796482015,5] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[130796482015,3] CHIPTOD: Running check fail timeout

which makse sense when you look at skiboot log:
[41797933478,6] CHIPTOD: Calculated MCBS is 0x4b (Cfreq=4024000000 Tfreq=320000
00).
[41797936890,7] CHIPTOD: Base TFMR=0x4b12000000000000.
[41796320335,7] CHIPTOD: Master sync on CPU PIR 0x0048....
[41801786711,7] CHIPTOD: Slave sync on CPU PIR 0x0058....
[41809523634,7] CHIPTOD: Slave sync on CPU PIR 0x0068....
[7309025,7] CHIPTOD: PIR 0x0048 TB=6f86cc.
[7313179,7] CHIPTOD: PIR 0x0058 TB=6f9707.
[7680789,7] CHIPTOD: PIR 0x0068 TB=753300.
[7689063,7] CHIPTOD: TOD Topology in Use: Primary.
[7691343,7] CHIPTOD:   Primary configuration:.
[7693122,7] CHIPTOD:    chip id: 0, Role: MDMT, Status: Active Master.
[7696585,7] CHIPTOD:   Secondary configuration:.
[7698434,7] CHIPTOD:    chip id: 0, Role: MDMT, Status: Active Master.


So is this a limitation of current OpenPower? What do we need to do to
enable it?



More information about the Skiboot mailing list