[Skiboot] [PATCH] Run pollers in time_wait() when not booting
Andrew Jeffery
andrew at aj.id.au
Tue Oct 30 22:15:54 AEDT 2018
On Tue, 30 Oct 2018, at 17:43, Stewart Smith wrote:
> This only bit us hard with hioma in one scenario.
Missing a 'p' in 'hiomap'
>
> Our OPAL API has been OPAL_POLL_EVENTS may be needed to make forward
> progress on ongoing operations, and the internal to skiboot API has been
> that time_wait() of a suitable time will run pollers (on at least one
> CPU) to help ensure forward progress can be made.
>
> In a perfect world, interrupts are used but they may a) be disabled, or
> b) the thing we're doing can't use interrupts because computers are
> generally terrible.
>
> Back in 3db397ea5892a (circa 2015), we changed skiboot so that we'd run
> pollers only on the boot CPU, and not if we held any locks. This was to
> reduce the chance of programming code that could deadlock, as well as to
> ensure that we didn't just thrash all the cachelines for running pollers
> all over a large system during boot, or hard spin on the same locks on
> all secondary CPUs.
>
> The problem arises if the OS we're booting makes an OPAL call early on,
> with interrupts disabled, that requires a poller to run to make forward
> progress. An example of this would be OPAL_WRITE_NVRAM early in Linux
> boot (where Linux sets up the partitions it wants) - something that
> occurs iff we've had to reformat NVRAM this boot (i.e. first boot or
> corrupted NVRAM).
>
> The hiomap implementation should arguably *not* rely on synchronous IPMI
> messages, but this is a future improvement (as was for mbox before it).
> The mbox-flash code solved this problem by spinning on check_timers().
>
> More generically though, the approach of running the pollers when no
> longer booting means we behave more in line with what the API is meant
> to be, rather than have this odd case of "time_wait() for a condition
> that could also be tripped by an interrupt works fine unless the OS is
> up and running but hasn't set interrupts up yet".
Nice description.
>
> Fixes: 529bdca0bc546a7ae3ecbd2c3134b7260072d8b0
> Fixes: 3db397ea5892a8b348cf412739996731884561b3
> Signed-off-by: Stewart Smith <stewart at linux.ibm.com>
> ---
> core/timebase.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/core/timebase.c b/core/timebase.c
> index 24e72959f542..f9f1a55b8dbb 100644
> --- a/core/timebase.c
> +++ b/core/timebase.c
> @@ -57,7 +57,7 @@ void time_wait(unsigned long duration)
> return;
> }
>
> - if (c != boot_cpu)
> + if (c != boot_cpu && opal_booting())
> time_wait_nopoll(duration);
> else
> time_wait_poll(duration);
> --
> 2.17.2
>
Applying this to master and building gave me:
core/timebase.c: In function ‘time_wait’:
core/timebase.c:60:23: error: implicit declaration of function ‘opal_booting’ [-Werror=implicit-function-declaration]
if (c != boot_cpu && opal_booting())
^~~~~~~~~~~~
But it passes snowpatch according to patchwork :thinking_face:
I had to do this:
diff --git a/core/timebase.c b/core/timebase.c
index 24e72959f542..830ca53d2442 100644
--- a/core/timebase.c
+++ b/core/timebase.c
@@ -19,6 +19,7 @@
#include <opal.h>
#include <cpu.h>
#include <chip.h>
+#include <debug_descriptor.h>
unsigned long tb_hz = 512000000;
@@ -57,7 +58,7 @@ void time_wait(unsigned long duration)
return;
}
- if (c != boot_cpu)
+ if (c != boot_cpu && opal_booting())
time_wait_nopoll(duration);
else
time_wait_poll(duration);
Anyway, with the additional include the build looks like it rids ozrom2 of the stuck thread problem, which it was hitting consistently prior to applying it.
Thanks for chasing that down.
Andrew
More information about the Skiboot
mailing list