[Skiboot] [PATCH] OCC: Increase max pstate check on P9 to 255
Vaidyanathan Srinivasan
svaidy at linux.vnet.ibm.com
Wed Nov 22 01:00:03 AEDT 2017
* Michael Ellerman <mpe at ellerman.id.au> [2017-11-21 22:40:22]:
> Stewart Smith <stewart at linux.vnet.ibm.com> writes:
>
> > Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com> writes:
> >> * Stewart Smith <stewart at linux.vnet.ibm.com> [2017-11-17 09:34:03]:
> >>
> >>> This has changed from P8, we can now have > 127 pstates.
> >>>
> >>> This was observed on Boston during WoF bringup.
> >>>
> >>> Reported-by: Minda Wei <minda.wei at ibm.com>
> >>
> >> Reported-by: Francesco A Campisano <campisan at us.ibm.com>
> >>>
> >>>
> >>> Signed-off-by: Stewart Smith <stewart at linux.vnet.ibm.com>
> >>
> >> Reviewed-by: Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com>
> >>
> >>
> >>> ---
> >>> hw/occ.c | 10 ++++++----
> >>> 1 file changed, 6 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/hw/occ.c b/hw/occ.c
> >>> index 78c6a6a356e3..8ad0dfe421d7 100644
> >>> --- a/hw/occ.c
> >>> +++ b/hw/occ.c
> >>> @@ -612,13 +612,15 @@ static bool add_cpu_pstate_properties(int *pstate_nom)
> >>> nr_pstates = labs(pmax - pmin) + 1;
> >>> prlog(PR_DEBUG, "OCC: Version %x Min %d Nom %d Max %d Nr States %d\n",
> >>> occ_data->version, pmin, pnom, pmax, nr_pstates);
> >>> - if (nr_pstates <= 1 || nr_pstates > 128) {
> >>> + if ((occ_data->version == 0x90 && (nr_pstates <= 1)) ||
> >>> + (occ_data->version <= 0x02 &&
> >>> + (nr_pstates <= 1 || nr_pstates > 128))) {
> >>> /**
> >>> * @fwts-label OCCInvalidPStateRange
> >>> * @fwts-advice The number of pstates is outside the valid
> >>> - * range (currently <=1 or > 128), so OPAL has not added
> >>> - * pstates to the device tree. This means that OCC (On Chip
> >>> - * Controller) will be non-functional. This means
> >>> + * range (currently <=1 or > 128 on p8, >255 on P9), so OPAL
> >>> + * has not added pstates to the device tree. This means that
> >>> + * OCC (On Chip Controller) will be non-functional. This means
> >>> * that CPU idle states and CPU frequency scaling
> >>> * will not be functional.
> >>> */
> >>
> >> Thanks for the fix to increase the limit.
> >>
> >> Checking for occ_data->version is the right check to decide permitted
> >> range between P8 vs P9.
> >>
> >> In case occ_data->version is bumped up, then we do not know the range
> >> and we will fail in an earlier check and not export any pstates.
> >>
> >> We need to update the right range for future versions here.
> >>
> >> However Linux could have reporting problems with PState IDs more than
> >> 128 and also with large number of PStates. This change does not as
> >> such crash Linux, but more work is needed in Linux driver to support
> >> large number of PStates.
> >
> > I'll pull the patch in and let's see how it goes for a bit... see if
> > people see this as a real problem or not. We can always pull it out and
> > do a dance with filtering pstates to limit to 128 and put the full list
> > somewhere else that the kernel doesn't poke at by default.
> >
> > mpe, what are your thoughts here?
>
> Hard to say.
>
> The trade off seems to be unspecified "problems" in Linux if the large
> number of states is exposed, vs having no CPU idle or CPU frequency
> scaling.
>
> The latter is clearly very bad, but it would be good to have more detail
> from Vaidy on the issues we might see in Linux.
The issue in Linux is only in reporting current frequency and not
really a major issue with the frequency request or the driver. When
we set the PState request, we pick the PState ID corresponding to the
frequency and write to PMCR SPR register. There is no impact here.
When user reads current frequency from sysfs cpuinfo_cur_freq, we read
PMSR SPR and map the PState ID back to a frequency. At this stage the
PState ID read is treated as signed 8-bit to maintain compatibility
with P8 systems.
So when frequency is very low and the PState IDs go more than 128,
then the PState ID interpreted will become negative and *not* found in
Linux. In this situation we fall back to reporting nominal frequency.
The end user will observe that the reported frequency will suddenly
jump to a higher value of nominal frequency when we go down to the
last few PStates at the lowest frequency.
This problem is limited to few chips in few systems where the
calibrated frequency range is wide and really the number of PStates
exceeds 128. This scenario is architecturally correct and permitted,
but just that I did not anticipate :)
I would suggest we could take this patch in skiboot and the work out
additional device tree nodes and later change Linux to use new format.
My idea would be to use different device tree format and make new
Linux driver use different elements for requests and reporting. So in
future we can expose smaller number of PStates to Linux, but still
Linux can lookup and convert any PState back to a frequency value for
reporting.
--Vaidy
More information about the Skiboot
mailing list