[PATCH] cxl: Add a kernel thread to check the coherent platform function's state

christophe lombard clombard at linux.vnet.ibm.com
Tue Apr 19 19:15:55 AEST 2016


On 19/04/2016 04:40, Andrew Donnellan wrote:
> On 18/04/16 23:05, Christophe Lombard wrote:
>> In the POWERVM environement, the PHYP CoherentAccel component manages
>
> environment
>
>> the state of the Coherant Accelerator Processor Interface adapter and
>
> Coherent
>
>> virtualizes CAPI resources, handles CAPP, PSL, PSL Slice errors - and
>> interrupts - and provides a new set of HCALLs for the OS APIs to utilize
>> AFUs.
>>
>> During the course of operation, a coherent platform function can
>> encounter errors. Some possible reason for errors are:
>> • Hardware recoverable and unrecoverable errors
>> • Transient and over-threshold correctable errors
>>
>> PHYP implements its own state model for the coherent platform function.
>> The current state of this Acclerator Fonction Unit (AFU) is available
>
> Accelerator Function Unit
>
>> through a hcall.
>>
>> In case of low-level troubles (or error injection), The PHYP component
>
> the
>
>> may reset the card and change the AFU state. The PHYP interface doesn't
>> provide any way to be notified when that happens.
>>
>> The current implementation of the cxl driver, for the POWERVM
>> environment, follows the general error recovery procedures required to
>> reset operation of the coherent platform function. The platform firmware
>> resets and reconfigures hardware when an external action is required -
>> attach/detach a process, link ok, ....
>>
>> The purpose of this patch is to interact with the external driver
>> (where the AFU is shown) even if no action is required. A kernel thread
>> is needed to check every x seconds the current state of the AFU to see
>> if we need to enter an error recovery path.
>>
>> Signed-off-by: Christophe Lombard <clombard at linux.vnet.ibm.com>
>
> A few minor issues below.
>
>> diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
>> index 8213372..06dfe7f 100644
>> --- a/drivers/misc/cxl/guest.c
>> +++ b/drivers/misc/cxl/guest.c
>> @@ -19,6 +19,10 @@
>>   #define CXL_SLOT_RESET_EVENT        2
>>   #define CXL_RESUME_EVENT        3
>>
>> +#define CXL_KTHREAD             "cxl_kthread"
>> +
>> +void stop_state_thread(struct cxl_afu *afu);
>
> static?
>
> [...]
>
>> -static int afu_do_recovery(struct cxl_afu *afu)
>> +static int handle_state_thread(void *data)
>>   {
>> -    int rc;
>> +    struct cxl_afu *afu;
>> +    int rc = 0;
>
> It looks like we don't use rc (see also comment below).
>
>>
>> -    /* many threads can arrive here, in case of detach_all for example.
>> -     * Only one needs to drive the recovery
>> -     */
>> -    if (mutex_trylock(&afu->guest->recovery_lock)) {
>> -        rc = afu_update_state(afu);
>> -        mutex_unlock(&afu->guest->recovery_lock);
>> -        return rc;
>> +    pr_devel("in %s\n", __func__);
>> +
>> +    afu = (struct cxl_afu*)data;
>
> CodingStyle: space between cxl_afu and *
>
>> +    do {
>> +        set_current_state(TASK_INTERRUPTIBLE);
>> +
>> +        if (afu) {
>> +            afu_update_state(afu);
>
> Should we be checking the retval here?

Right, We have to check the retval here. Thanks


>
>> +            if (afu->guest->previous_state == H_STATE_PERM_UNAVAILABLE)
>> +                goto out;
>> +        } else
>> +            return -ENODEV;
>> +        schedule_timeout(msecs_to_jiffies(3000));
>> +    } while(!kthread_should_stop());
>
> CodingStyle: space between while and (
>
>> +
>> +out:
>> +    afu->guest->kthread_tsk = NULL;
>> +    return rc;
>> +}
>> +
>> +void start_state_thread(struct cxl_afu *afu)
>
> static?
>
>> +{
>> +    if (afu->guest->kthread_tsk)
>> +        return;
>> +
>> +    /* start kernel thread to handle the state of the afu */
>> +    afu->guest->kthread_tsk = kthread_run(&handle_state_thread,
>> +                  (void *)afu, CXL_KTHREAD);
>> +    if (IS_ERR(afu->guest->kthread_tsk)) {
>> +        pr_devel("cannot start state kthread\n");
>> +        afu->guest->kthread_tsk = NULL;
>>       }
>> -    return 0;
>> +}
>> +
>> +void stop_state_thread(struct cxl_afu *afu)
>
> static?
>

Thanks for the review. I will send a patch update.



More information about the Linuxppc-dev mailing list