[PATCH] cxl: Add a kernel thread to check the coherent platform function's state
Andrew Donnellan
andrew.donnellan at au1.ibm.com
Tue Apr 19 12:40:07 AEST 2016
On 18/04/16 23:05, Christophe Lombard wrote:
> In the POWERVM environement, the PHYP CoherentAccel component manages
environment
> the state of the Coherant Accelerator Processor Interface adapter and
Coherent
> virtualizes CAPI resources, handles CAPP, PSL, PSL Slice errors - and
> interrupts - and provides a new set of HCALLs for the OS APIs to utilize
> AFUs.
>
> During the course of operation, a coherent platform function can
> encounter errors. Some possible reason for errors are:
> • Hardware recoverable and unrecoverable errors
> • Transient and over-threshold correctable errors
>
> PHYP implements its own state model for the coherent platform function.
> The current state of this Acclerator Fonction Unit (AFU) is available
Accelerator Function Unit
> through a hcall.
>
> In case of low-level troubles (or error injection), The PHYP component
the
> may reset the card and change the AFU state. The PHYP interface doesn't
> provide any way to be notified when that happens.
>
> The current implementation of the cxl driver, for the POWERVM
> environment, follows the general error recovery procedures required to
> reset operation of the coherent platform function. The platform firmware
> resets and reconfigures hardware when an external action is required -
> attach/detach a process, link ok, ....
>
> The purpose of this patch is to interact with the external driver
> (where the AFU is shown) even if no action is required. A kernel thread
> is needed to check every x seconds the current state of the AFU to see
> if we need to enter an error recovery path.
>
> Signed-off-by: Christophe Lombard <clombard at linux.vnet.ibm.com>
A few minor issues below.
> diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
> index 8213372..06dfe7f 100644
> --- a/drivers/misc/cxl/guest.c
> +++ b/drivers/misc/cxl/guest.c
> @@ -19,6 +19,10 @@
> #define CXL_SLOT_RESET_EVENT 2
> #define CXL_RESUME_EVENT 3
>
> +#define CXL_KTHREAD "cxl_kthread"
> +
> +void stop_state_thread(struct cxl_afu *afu);
static?
[...]
> -static int afu_do_recovery(struct cxl_afu *afu)
> +static int handle_state_thread(void *data)
> {
> - int rc;
> + struct cxl_afu *afu;
> + int rc = 0;
It looks like we don't use rc (see also comment below).
>
> - /* many threads can arrive here, in case of detach_all for example.
> - * Only one needs to drive the recovery
> - */
> - if (mutex_trylock(&afu->guest->recovery_lock)) {
> - rc = afu_update_state(afu);
> - mutex_unlock(&afu->guest->recovery_lock);
> - return rc;
> + pr_devel("in %s\n", __func__);
> +
> + afu = (struct cxl_afu*)data;
CodingStyle: space between cxl_afu and *
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + if (afu) {
> + afu_update_state(afu);
Should we be checking the retval here?
> + if (afu->guest->previous_state == H_STATE_PERM_UNAVAILABLE)
> + goto out;
> + } else
> + return -ENODEV;
> + schedule_timeout(msecs_to_jiffies(3000));
> + } while(!kthread_should_stop());
CodingStyle: space between while and (
> +
> +out:
> + afu->guest->kthread_tsk = NULL;
> + return rc;
> +}
> +
> +void start_state_thread(struct cxl_afu *afu)
static?
> +{
> + if (afu->guest->kthread_tsk)
> + return;
> +
> + /* start kernel thread to handle the state of the afu */
> + afu->guest->kthread_tsk = kthread_run(&handle_state_thread,
> + (void *)afu, CXL_KTHREAD);
> + if (IS_ERR(afu->guest->kthread_tsk)) {
> + pr_devel("cannot start state kthread\n");
> + afu->guest->kthread_tsk = NULL;
> }
> - return 0;
> +}
> +
> +void stop_state_thread(struct cxl_afu *afu)
static?
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan at au1.ibm.com IBM Australia Limited
More information about the Linuxppc-dev
mailing list