[Skiboot] [PATCH] opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.

Stewart Smith stewart at linux.ibm.com
Thu Sep 27 17:15:25 AEST 2018


Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>
> When primary thread receives a CORE level HMI for timer facility errors
> while secondaries are still in OPAL, thread 0 ends up in rendez-vous
> waiting for secondaries to get into hmi handling. This is because OPAL
> runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until
> they are given to Linux OS. Fix this by adding a check for secondary
> state and force them in hmi handling by queuing job on secondary threads.
>
> I have tested this by injecting HDEC parity error very early during Linux
> kernel boot. Recovery works fine for non-TB errors. But if TB is bad at
> this very eary stage we already doomed.
>
> Without this patch we see:
>
> [  285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
> [  285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
> [  285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [  285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error
> [  286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1)
> [  287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1)
> [  289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1)
> [  290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2)
> [  291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2)
> [  293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2)
> [  294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3)
> [  295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3)
> [  297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3)
>
> After this patch:
>
> [  259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c
> [  259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c
> [  259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
> [  259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
> [  259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [  259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error
> [  259.419650678,7] HMI: Sending hmi job to thread 1
> [  259.419652744,7] HMI: Sending hmi job to thread 2
> [  259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [  259.419654725,7] HMI: Sending hmi job to thread 3
> [  259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [  259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
> [  259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error
> [  259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error
> [  259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error
> [  259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c
> [  259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c
> [  259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c
>
> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> ---
>  core/hmi.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)

Thanks, merged to master as of c884f2d0cb921131737df99ed3aad9f5a2d2945f

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list