[Skiboot] [PATCH V4 2/3] occ: Add support for OPAL-OCC command/response interface

Mon Jul 17 14:46:37 AEST 2017

Hi Stewart,

On 07/17/2017 04:56 AM, Stewart Smith wrote:
> Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
>> On 07/07/2017 02:23 PM, Stewart Smith wrote:
>>> Shilpasri G Bhat <shilpa.bhat at linux.vnet.ibm.com> writes:
>>
>>
>>>> +static inline bool occ_in_progress(struct cmd_interface *chip)
>>>> +{
>>>> +	return (chip->rsp->flag == OCC_IN_PROGRESS);
>>>> +}
>>>> +
>>>> +static int write_occ_cmd(struct cmd_interface *chip, bool retry)
>>>> +{
>>>> +	struct opal_command_buffer *cmd = chip->cmd;
>>>> +	struct opal_occ_cmd_rsp_msg *msg = chip->msg;
>>>> +
>>>> +	if (!retry && occ_in_progress(chip)) {
>>>> +		chip->cmd_in_progress = false;
>>>> +		return OPAL_OCC_BUSY;
>>>> +	}
>>>> +
>>>> +	cmd->flag = chip->rsp->flag = 0;
>>>> +	cmd->cmd = occ_cmds[msg->cmd].value;
>>>> +	cmd->request_id = msg->request_id;
>>>> +	cmd->data_size = msg->cdata_size;
>>>> +	memcpy(&cmd->data, (u8 *)msg->cdata, msg->cdata_size);
>>>> +	cmd->flag = OPAL_CMD_READY;
>>>> +
>>>> +	schedule_timer(&chip->timeout,
>>>> +		       msecs_to_tb(occ_cmds[msg->cmd].timeout_ms));
>>>> +	chip->prev_cmd = msg->cmd;
>>>> +
>>>> +	return OPAL_ASYNC_COMPLETION;
>>>> +}
>>>> +
>>>> +static int64_t opal_occ_command(int chip_id, struct opal_occ_cmd_rsp_msg *msg,
>>>> +				int token, bool retry)
>>>
>>> It looks like retry is not actually retry, but rather 'return if busy,
>>> don't queue' as in write_occ_cmd that's the logic.
>>
>> If kernel is retrying the same command due to a command-timeout in the previous
>> try, then don't check the OCC status flag as OCC might not have been able to
>> update the status flag. 'retry' flag is used to indicate that kernel is retrying
>> the command.
>>
>> Do you want 'retry' state to be stored in OPAL ?
> 
> I think that would be better. If part of the protocol in talking to the
> OCC is that if at first you don't succeed, retry without looking at
> status, then we should do the retry within OPAL transparently to Linux,
> so that the kernel just sees success/fail.

Sure will do it this way.

> 
>>>> @@ -910,6 +1314,9 @@ void occ_pstates_init(void)
>>>>  		chip->throttle = 0;
>>>>  	opal_add_poller(occ_throttle_poll, NULL);
>>>>  	occ_pstates_initialized = true;
>>>> +
>>>> +	/* Init OPAL-OCC command-response interface */
>>>> +	occ_cmd_interface_init();
>>>>  }
>>>>  
>>>>  struct occ_load_req {
>>>> @@ -1409,8 +1816,10 @@ void occ_p9_interrupt(uint32_t chip_id)
>>>>  	if (ireg & OCB_OCI_OCIMISC_IRQ_TMGT)
>>>>  		prd_tmgt_interrupt(chip_id);
>>>>  
>>>> -	if (ireg & OCB_OCI_OCIMISC_IRQ_SHMEM)
>>>> +	if (ireg & OCB_OCI_OCIMISC_IRQ_SHMEM) {
>>>>  		occ_throttle_poll(NULL);
>>>> +		handle_occ_rsp(chip_id);
>>>> +	}
>>>>  
>>>>  	if (ireg & OCB_OCI_OCIMISC_IRQ_I2C)
>>>>  		p9_i2c_bus_owner_change(chip_id);
>>>> @@ -1435,5 +1844,3 @@ void occ_fsp_init(void)
>>>>  	if (fsp_present())
>>>>  		fsp_register_client(&fsp_occ_client, FSP_MCLASS_OCC);
>>>>  }
>>>> -
>>>> -
>>>> diff --git a/include/opal-api.h b/include/opal-api.h
>>>> index edae069..dd92cf9 100644
>>>> --- a/include/opal-api.h
>>>> +++ b/include/opal-api.h
>>>> @@ -55,6 +55,10 @@
>>>>  #define OPAL_XSCOM_CTR_OFFLINED	-30
>>>>  #define OPAL_XIVE_PROVISIONING	-31
>>>>  #define OPAL_XIVE_FREE_ACTIVE	-32
>>>> +#define OPAL_OCC_INVALID_STATE	-33
>>>
>>> Instead of adding new return codes, I think existing ones suffice, or is
>>> there a really big need to have a way to distinguish these errors from
>>> other ones?  e.g. insetad of OPAL_OCC_INVALID_STATE, it could be
>>> OPAL_WRONG_STATE - and this is consistent with other OPAL calls.
>>>
>>>> +#define OPAL_OCC_BUSY		-34
>>>
>>> OPAL_BUSY
>>
>> Yeah OPAL_BUSY works.
>>
>>>
>>>> +#define OPAL_OCC_CMD_TIMEOUT	-35
>>>
>>> We probably want an OPAL_TIMEOUT actually... we have OPAL_I2C_TIMEOUT
>>> and OPAL_XSCOM_TIMEOUT
>>>
>>>> +#define OPAL_OCC_CMD_MISMATCH	-36
>>>
>>> OPAL_PARAMETER
>>>
>>> Does that make sense? Any reason not to do that?
>>>
>>
>> OPAL_OCC_CMD_TIMEOUT and OPAL_OCC_CMD_MISMATCH is used by kerenl to retry the
>> OCC command. As OPAL_PARAMETER is also used for the sanity checks of the
>> parameters passed to opal_call, kernel wont be able to distinguish
>> when to retry.
> 
> If the retry logic is in OPAL though, that problem should go away.

Yup agree...

> 
>>>> @@ -1033,6 +1038,81 @@ typedef struct oppanel_line {
>>>>  	__be64 line_len;
>>>>  } oppanel_line_t;
>>>>  
>>>> +enum occ_cmd {
>>>> +	OCC_CMD_AMESTER_PASS_THRU = 0,
>>>
>>> What is amester passthru for? Do we need this as part of our ABI?
>>
>> This command will be used by Amester for tracing the sensor values at 250us/2ms
>> sampling rate. In P8, Amester used IPMI for tracing. In P9, Amester will be able
>> to communicate to OCC inband using this cmd/rsp interface.
>>
>> Yeah we need this command to support inband Amester tracing.
> 
> So, the P8 design is pretty awful, random undocumented out of band IPMI
> commands aren't great.
> 
> So, what exactly is this call accomplishing though? Is it setting a
> sampling frequency? I really dislike these kind of passthrough APIs, as
> the care to maintain backwards compatibility expands to many places when
> we could have just kept it between OPAL and Linux.

Amester tracing is one of the passthrough commands that I am aware of.
Yes it will select one of the sampling rates and collect all the samples at
oneshot till the trace buffer is full or till the user specified time. This is
used for sub-millisecond profiling.

> 
>>>> +	OCC_CMD_CLEAR_SENSOR_DATA,
>>>
>>> Clear what sensors?
>>
>> Min-max values of sensors stored for different users like CSM, Profiler and
>> Job-scheduler.
>>
>> How about OCC_CMD_CLEAR_SENSOR_MIN_MAX instead?
> 
> I don't think it needs to be OCC specific, why not add an API to clear
> any OPAL_SENSOR? For any sensor that isn't min/max and clearable (could
> be property in device tree, it'd return OPAL_UNSUPPORTED. That way,
> these sensors could be read by existing kernels, although a small patch
> would be required to clear them.

The current implementation does not allow clearing of individual sensors. OCC
maintains min/max for each sensor for three different owners: CSM agent,
Profiler and Job-Scheduler (all three are CORAL subsystems)

This command will clear the min/max of all the sensors for the given owner.
So you are suggesting we add an API that clears individual sensors, but in OCC
implementation  clearing one sensor will clear all the min/max history of all
sensors belonging to the sensor ownwer.

> 
>>>> +	OCC_CMD_SET_POWER_CAP,
>>>
>>> Power cap for what? System? Socket?
>>
>> System.
>>
>>>> +	OCC_CMD_SET_POWER_SHIFTING_RATIO,
>>>> +	OCC_CMD_SELECT_SENSOR_GROUPS,
>>>
>>> Basically, this all needs to be documented somewhere, notably in
>>> doc/opal-api/
>>
>> Okay.
>>
>>>
>>> I also wonder if this approach is the best for the OPAL API. While we do
>>> *currently* have a thing called an OCC that does these functions, is
>>> that *always* going to be the case? Elsewhere in firmware we've been
>>> going down the route of calling things "pm_complex" and sending commands
>>> to/from the pm_complex to do tasks for us.
>>>
>>> would it be better if our OPAL API instead described the intended effect
>>> rather than how to achieve said effect?
>>>
>>> I wouldn't be opposed to an OPAL_SET_POWER_CAP call, that's an async
>>> call, and accepts parameters in a range that's described in the device
>>> tree. That call could be implemented on P8 too (where you currently set
>>> power caps by doing crazy IPMI incantations to the BMC), and that would
>>> prove that the OPAL API is implementation independent, and we should be
>>> future proof with however we do the intent of capping
>>> system/socket/other-thing power in the future.
>>
>> I agree with defining an abstract API for setting powercap as a better
>> approach.
> 
> Great! I think it'd end up being a much better approach.
> 
>>> How do the sensors work? Do they fit into the existing OPAL Sensors
>>> infrastructure? are they more in-memory sensors? I ask as perhaps the
>>> select/clear calls should be part of either the regular opal sensors API
>>> or the IMC api rather than their own API.
>>
>> Currently, only in-memory environment sensors are exported in the OPAL sensor
>> infrastructure. Yes there are more perf/util/freq in-memory sensors that are not
>> exported.
>>
>> The current sensor framework provides API to only read the sensor values. We
>> will need new API additions to the sensor framework then to clear min/max of
>> in-memory sensors and select the sensor groups to be copied to main memory.
>> So should we have new opal-calls defined for these sensor operations instead of
>> the OCC cmd/rsp interface?
> 
> Okay, so there's groups of sensors that need enabling? What is the
> impact of always enabling them? What specific sensors are they?
> 

Sensors are grouped based on their type, like temperature sensors, power
sensors, utilization sensors and so on. By default all the sensor groups are
enabled and thus all of them will be copied to main memory (at the rate of
100ms). OCC allows to select sensor groups to copy which will allow OCC to
increase the copying frequency with fewer selected sensor groups.

There are totally 5 commands:

1) Amester passthrough
2) Sensor clear min/max
3) Set system powercap
4) Set power-shifting ratio
5) Select sensor groups to copy

2) and 3) will have separate APIs and the remaining will be part of a single
cmd/rsp API. (?)

Thanks and Regards,
Shilpa