[Skiboot] [PATCH] Run pollers in time_wait() when not booting

Stewart Smith stewart at linux.ibm.com
Wed Oct 31 15:25:51 AEDT 2018


Andrew Jeffery <andrew at aj.id.au> writes:
> On Tue, 30 Oct 2018, at 17:43, Stewart Smith wrote:
>> This only bit us hard with hioma in one scenario.
>
> Missing a 'p' in 'hiomap'

Cheers, fixed.

>> Our OPAL API has been OPAL_POLL_EVENTS may be needed to make forward
>> progress on ongoing operations, and the internal to skiboot API has been
>> that time_wait() of a suitable time will run pollers (on at least one
>> CPU) to help ensure forward progress can be made.
>> 
>> In a perfect world, interrupts are used but they may a) be disabled, or
>> b) the thing we're doing can't use interrupts because computers are
>> generally terrible.
>> 
>> Back in 3db397ea5892a (circa 2015), we changed skiboot so that we'd run
>> pollers only on the boot CPU, and not if we held any locks. This was to
>> reduce the chance of programming code that could deadlock, as well as to
>> ensure that we didn't just thrash all the cachelines for running pollers
>> all over a large system during boot, or hard spin on the same locks on
>> all secondary CPUs.
>> 
>> The problem arises if the OS we're booting makes an OPAL call early on,
>> with interrupts disabled, that requires a poller to run to make forward
>> progress. An example of this would be OPAL_WRITE_NVRAM early in Linux
>> boot (where Linux sets up the partitions it wants) - something that
>> occurs iff we've had to reformat NVRAM this boot (i.e. first boot or
>> corrupted NVRAM).
>> 
>> The hiomap implementation should arguably *not* rely on synchronous IPMI
>> messages, but this is a future improvement (as was for mbox before it).
>> The mbox-flash code solved this problem by spinning on check_timers().
>> 
>> More generically though, the approach of running the pollers when no
>> longer booting means we behave more in line with what the API is meant
>> to be, rather than have this odd case of "time_wait() for a condition
>> that could also be tripped by an interrupt works fine unless the OS is
>> up and running but hasn't set interrupts up yet".
>
> Nice description.
>
>> 
>> Fixes: 529bdca0bc546a7ae3ecbd2c3134b7260072d8b0
>> Fixes: 3db397ea5892a8b348cf412739996731884561b3
>> Signed-off-by: Stewart Smith <stewart at linux.ibm.com>
>> ---
>>  core/timebase.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/core/timebase.c b/core/timebase.c
>> index 24e72959f542..f9f1a55b8dbb 100644
>> --- a/core/timebase.c
>> +++ b/core/timebase.c
>> @@ -57,7 +57,7 @@ void time_wait(unsigned long duration)
>>  		return;
>>  	}
>>  
>> -	if (c != boot_cpu)
>> +	if (c != boot_cpu && opal_booting())
>>  		time_wait_nopoll(duration);
>>  	else
>>  		time_wait_poll(duration);
>> -- 
>> 2.17.2
>> 
>
> Applying this to master and building gave me:
>
> core/timebase.c: In function ‘time_wait’:
> core/timebase.c:60:23: error: implicit declaration of function ‘opal_booting’ [-Werror=implicit-function-declaration]
>   if (c != boot_cpu && opal_booting())
>                        ^~~~~~~~~~~~
>
> But it passes snowpatch according to patchwork :thinking_face:

That'll be because of b1dee4a43dc3b10e89355964c8cd1f6e26447837 which
split the debug_descriptor routines into their own include file. The
patch will work on 6.0.x okay, but not master.

Teaches me for just building on 6.0.x :)

> diff --git a/core/timebase.c b/core/timebase.c
> index 24e72959f542..830ca53d2442 100644
> --- a/core/timebase.c
> +++ b/core/timebase.c
> @@ -19,6 +19,7 @@
>  #include <opal.h>
>  #include <cpu.h>
>  #include <chip.h>
> +#include <debug_descriptor.h>
>  
>  unsigned long tb_hz = 512000000;
>  
> @@ -57,7 +58,7 @@ void time_wait(unsigned long duration)
>                 return;
>         }
>  
> -       if (c != boot_cpu)
> +       if (c != boot_cpu && opal_booting())
>                 time_wait_nopoll(duration);
>         else
>                 time_wait_poll(duration);
>
> Anyway, with the additional include the build looks like it rids ozrom2 of the stuck thread problem, which it was hitting consistently prior to applying it.

Excellent.

-- 
Stewart Smith
OPAL Architect, IBM.



More information about the Skiboot mailing list