[PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

Sat Apr 24 01:59:39 AEST 2021

* Michal Such?nek <msuchanek at suse.de> [2021-04-23 09:35:51]:

> On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" <ego at linux.vnet.ibm.com>
> > 
> > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > of the Extended CEDE states advertised by the platform
> > 
> > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> Can you be more specific about 'older firmwares'?

Hi Michal,

This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
key idea behind the original patch was to make the H_CEDE latency and
hence target residency come from firmware instead of being decided by
the kernel.  The advantage is such that, different type of systems in
POWER10 generation can adjust this value and have an optimal H_CEDE
entry criteria which balances good single thread performance and
wakeup latency.  Further we can have additional H_CEDE state to feed
into the cpuidle.  

> Also while this is a performance regression on such firmwares it
> should be fixed by updating the firmware to current version.
> 
> Having sub-optimal performance on obsolete firmware should not require a
> kernel workaround, should it?

When we designed and tested this change on POWER9 and POWER10 systems
the values that were set in F/w were working out fine with positive
results in all our micro benchmarks and no regression in context
switch tests.  These repeatable results gave us the confidence that we
can go ahead and set the values from F/w and remove the kernel's value
for all future Linux versions.

But where we slipped is the fact that real world workload show
variations in performance and regressions in specific case because we
are favouring H_CEDE state more often than snooze loop.  The root
cause is we have to send more IPIs to wakeup now because more cpus
will be in H_CEDE state than before.

This is a performance problem on POWER9 systems where we actually
expected good benefit and also proved them with micro benchmarks, but
later it turned out to have an impact for some workloads.  Further the
challenge is not that regressions are severe, it is the fact that on
exact same hardware and firmware end users expect similar or better
performance for everything when updating to a newer kernel and no
regressions.

We have these setting adjusted for POWER10 in F/w and hence behaviour
will be similar when we come from old kernel on P9 to a new kernel on
P10.  We did test the reverse also like new kernel on P9 should show
benefit.  But as explained, the benefit came at the cost of regressing
in few cases which were discovered later.

Hence this fix is to keep exact same behaviour for POWER9 and use this
F/w driven heuristics only from POWER10.

> It's not like the kernel would crash on the affected firmware.

Correct. We do not have a functional issue, but only a performance
regression observable on certain real workloads.

This is a minor change in cpuidle's H_CEDE usage which will show up
only in certain workload patterns where we need idle CPU threads to
wakeup faster to get the job done as compared to keeping busy CPU
threads in single thread mode to get more execution slices.

This fix is primarily to ensure kernel update does not change H_CEDE
behaviour on same hardware generation there by causing performance
variation and also regression in some case.

Thanks for the questions and comments, I hope this gives additional
context for this fix.

--Vaidy