[Skiboot] [PATCH] fast-boot: occ: Re-parse the pstate table during fast-boot

Nicholas Piggin npiggin at gmail.com
Sat Feb 3 18:59:14 AEDT 2018


On Sat, 3 Feb 2018 12:27:02 +0530
Vaidyanathan Srinivasan <svaidy at linux.vnet.ibm.com> wrote:

> * Nicholas Piggin <npiggin at gmail.com> [2018-02-03 16:24:25]:
> 
> > 
> > On Fri,  2 Feb 2018 12:32:32 +0530
> > Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> wrote:
> >   
> > > OCC shares the frequency list to host by copying the pstate table to
> > > main memory in HOMER. This table is parsed during boot to create
> > > device-tree properties for frequency and pstate IDs. OCC can update
> > > the pstate table to present a new set of frequencies to the host. But
> > > host will remain oblivious to these changes unless it is re-inited
> > > with the updated device-tree CPU frequency properties. So this patch
> > > allows to re-parse the pstate table and update the device-tree
> > > properties during fast-reboot.
> > > 
> > > OCC updates the pstate table when asked to do so using pstate-table
> > > bias command. And this is mainly used by WOF team for
> > > characterization purposes.  
> > 
> > Would this ever be used in production, I'm guessing not? I don't
> > think that's a bad thing as such -- designing for test is always
> > good. Perhaps a comment though to explain why you're re-parsing
> > it.  
> 
> Never say never :)
> 
> At this time this facility is to enabling tooling to set OCC parameter
> at runtime and test the system without encoding all parameters and
> building a PNOR and dependent components.
> 
> This enables a very efficient workflow with just a OCC reboot and
> fast-reboot on OPAL+Linux and we are back in about 2 minutes.  This
> can be leveraged in automation/CI also to test various parameters.
> 
> > Without knowing much about OCC, I'll guess you're doing this so
> > you can update the OCC at runtime without having to update firmware
> > before each IPL?  
> 
> Yep, you got the use case right.  These OCC and PState parameters and
> tunings can be tested and later rolled up into the firmware.
>  
> > I guess we should always keep in mind fast reboot should match IPL
> > as closely as possible and any undetected deviations are a pretty
> > serious flaw. (e.g., you mess up your OCC state and want to return
> > to normal, you would reboot).  
> 
> We expect PState table to change for these tests.  If something goes
> wrong, OCC will crash or pull system to safe mode.

Is there a way to mark this and disable fast reboot if that happens.

I don't mean to single out this one patch. I just think it's a good
time to think about exactly how we're going to deal with fast reboot,
and how it would be used effectively if we enable it for end users.

We need to be very conservative and be sure to mark any possible errors
or inconsistencies in hardware or firmware for full IPL.

Other modes for debugging and testing like you're doing here are fine
too. No problem if we gate them with an nvram parameter.

>  No major change in
> system configuration like cpu, memory, IO.  If something really bad
> happens, we will hang/checkstop and we will have to re-ipl to recover.
> 
> If there was no OCC PState changes, then parsing it again should get
> us exact state compared to Power-ON and hence no risk.
> 
> > I'm just wondering, should this be under an nvram option?  
> 
> The risk actually depends on what we ask OCC to do and hence not
> a major config change/risk for OPAL.  I would like to leave it as
> default for fast-reboot.  We add slight time factor to rebuild the
> relevant device tree.  Given that fast-reboot itself is experimental,
> this is a acceptable risk and overhead.
> 
> If we hit new error/fail scenarios in future, we can add settings.
> I would like to roll this into a single knob for fast-reboot like
> "safe", "risky" or something that can help us choose what we want to
> do in reinit path.  We need not call out OCC reinit as explicit nvram
> option at this time.

Thanks for the background. The patch is fine in principle from me, I'm
not a fast reboot or OCC expert so my ack would not mean much :) But we
should be thinking about a coherent way to manage fast reboot behaviour.

Thanks,
Nick


More information about the Skiboot mailing list