[PATCH V2] cxl: Export library to support IBM XSL

Tue Jun 20 00:10:51 AEST 2017

Le 15/06/2017 à 14:36, Frederic Barrat a écrit :
> Salut Christophe,
>
> A few comments below, nothing major...
>
> Le 14/06/2017 à 15:29, Christophe Lombard a écrit :
>> This patch exports a in-kernel 'library' API which can be called by
>> other drivers to help interacting with an IBM XSL on a POWER9 system.
>>
>> The XSL (Translation Service Layer) is a stripped down version of the
>> PSL (Power Service Layer) used in some cards such as the Mellanox CX5.
>> Like the PSL, it implements the CAIA architecture, but has a number
>> of differences, mostly in it's implementation dependent registers.
>>
>> The XSL also uses a special DMA cxl mode, which uses a slightly
>> different init sequence for the CAPP and PHB.
>>
>> Changelog[v2]
>>   - Rebase to latest upstream.
>>   - Return -EFAULT in case of NULL pointer in cxllib_handle_fault().
>>   - Reverse parameters when copro_handle_mm_fault() is called.
>>
>
>
> The change log shouldn't be part of the commit message, but below the 
> next "---"
>
>

sure.

>> +++ b/drivers/misc/cxl/cxllib.c
>> @@ -0,0 +1,241 @@
>> +/*
>> + * Copyright 2016 IBM Corp.
>
>
> Year need to be updated
>
>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version
>> + * 2 of the License, or (at your option) any later version.
>> + */
>> +
>> +#include <linux/hugetlb.h>
>> +#include <linux/sched/mm.h>
>> +#include "cxl.h"
>> +#include <misc/cxllib.h>
>> +#include <asm/pnv-pci.h>
>
> Ordering of the #include is messy:
> #include <linux/hugetlb.h>
> #include <linux/sched/mm.h>
> #include <asm/pnv-pci.h>
> #include <misc/cxllib.h>
> #include "cxl.h"
>
>
>
>> +int cxllib_set_device_dma(struct pci_dev *dev, unsigned long flags)
>> +{
>> +    int rc;
>> +
>> +    if (flags)
>> +        return -EINVAL;
>> +
>> +    rc = dma_set_mask(&dev->dev, DMA_BIT_MASK(64));
>> +    return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(cxllib_set_device_dma);
>
>
> A comment in cxllib_set_device_dma() would help:
>     /*
>      * When switching the PHB to capi mode, the TVT#1 entry for
>      * the Partitionable Endpoint is set in bypass mode, like
>      * in PCI mode.
>      * Configure the device dma to use TVT#1, which is done
>      * by calling dma_set_mask() with a mask large enough.
>      */
>
>
>
>> +
>> +int cxllib_get_PE_attributes(struct task_struct *task,
>> +         unsigned long translation_mode, struct cxllib_pe_attributes 
>> *attr)
>> +{
>> +    struct mm_struct *mm = NULL;
>> +
>> +    if (translation_mode != CXL_TRANSLATED_MODE &&
>> +        translation_mode != CXL_REAL_MODE)
>> +        return -EINVAL;
>> +
>> +    attr->sr = cxl_calculate_sr(false /* master */,
>> +                task == NULL /* kernel ctx */,
>> +                translation_mode == CXL_REAL_MODE,
>> +                true /* p9 */);
>> +    attr->lpid = mfspr(SPRN_LPID);
>> +    if (task) {
>> +        mm = get_task_mm(task);
>> +        if (mm == NULL)
>> +            return -EINVAL;
>> +        /*
>> +         * Caller is keeping a reference on mm_users for as long
>> +         * as XSL uses the memory context
>> +         */
>> +        attr->pid = mm->context.id;
>> +        mmput(mm);
>> +    } else {
>> +        attr->pid = 0;
>> +    }
>> +    attr->tid = 0;
>
>
> We'll need to remember to patch that function as welll (attr->tid) 
> when we add support for as_notify (even though Mellanox is not 
> expected to use as_notify in capi mode, just pci I believe)
>
>

yep.

>
>> +int cxllib_handle_fault(struct mm_struct *mm, u64 addr, u64 size, 
>> u64 flags)
>> +{
>> +    int rc;
>> +    u64 dar;
>> +    struct vm_area_struct *vma = NULL;
>> +    unsigned long page_size;
>> +
>> +    if (mm == NULL)
>> +        return -EFAULT;
>> +
>> +    down_read(&mm->mmap_sem);
>> +
>> +    for (dar = addr; dar < addr + size; dar += page_size) {
>> +        if (!vma || dar < vma->vm_start || dar > vma->vm_end) {
>> +            vma = find_vma(mm, addr);
>> +            if (!vma) {
>> +                pr_err("Can't find vma for addr %016llx\n", addr);
>> +                rc = -EFAULT;
>> +                goto out;
>> +            }
>> +            /* get the size of the pages allocated */
>> +            page_size = vma_kernel_pagesize(vma);
>> +        }
>> +
>> +        rc = cxl_handle_page_fault(true, mm, flags, dar);
>
>
> Why do we pass "true" for kernel parameter?
> Actually do we even need a kernel input parameter for 
> cxl_handle_page_fault() ? It seems that we can infer it based on mm. 
> If NULL, then we are in kernel space.
>
>

you are right. Previously, the test was based on the field ctx->kernel. 
This explains the kernel parameter.

>> diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
>> index c79e39b..9db63f3 100644
>> --- a/drivers/misc/cxl/fault.c
>> +++ b/drivers/misc/cxl/fault.c
>> @@ -132,18 +132,16 @@ static int cxl_handle_segment_miss(struct 
>> cxl_context *ctx,
>>       return IRQ_HANDLED;
>>   }
>>
>> -static void cxl_handle_page_fault(struct cxl_context *ctx,
>> -                  struct mm_struct *mm, u64 dsisr, u64 dar)
>> +int cxl_handle_page_fault(bool kernel_context,
>> +              struct mm_struct *mm, u64 dsisr, u64 dar)
>>   {
>>       unsigned flt = 0;
>>       int result;
>>       unsigned long access, flags, inv_flags = 0;
>>
>> -    trace_cxl_pte_miss(ctx, dsisr, dar);
>> -
>>       if ((result = copro_handle_mm_fault(mm, dar, dsisr, &flt))) {
>>           pr_devel("copro_handle_mm_fault failed: %#x\n", result);
>> -        return cxl_ack_ae(ctx);
>> +        return result;
>>       }
>>
>>       if (!radix_enabled()) {
>> @@ -156,7 +154,7 @@ static void cxl_handle_page_fault(struct 
>> cxl_context *ctx,
>>               access |= _PAGE_WRITE;
>>
>>           access |= _PAGE_PRIVILEGED;
>> -        if ((!ctx->kernel) || (REGION_ID(dar) == USER_REGION_ID))
>> +        if (!kernel_context || (REGION_ID(dar) == USER_REGION_ID))
>>               access &= ~_PAGE_PRIVILEGED;
>>
>>           if (dsisr & DSISR_NOHPTE)
>> @@ -166,8 +164,7 @@ static void cxl_handle_page_fault(struct 
>> cxl_context *ctx,
>>           hash_page_mm(mm, dar, access, 0x300, inv_flags);
>>           local_irq_restore(flags);
>>       }
>> -    pr_devel("Page fault successfully handled for pe: %i!\n", ctx->pe);
>> -    cxl_ops->ack_irq(ctx, CXL_PSL_TFC_An_R, 0);
>> +    return 0;
>>   }
>>
>>   /*
>> @@ -261,9 +258,15 @@ void cxl_handle_fault(struct work_struct 
>> *fault_work)
>>
>>       if (cxl_is_segment_miss(ctx, dsisr))
>>           cxl_handle_segment_miss(ctx, mm, dar);
>> -    else if (cxl_is_page_fault(ctx, dsisr))
>> -        cxl_handle_page_fault(ctx, mm, dsisr, dar);
>> -    else
>> +    else if (cxl_is_page_fault(ctx, dsisr)) {
>> +        trace_cxl_pte_miss(ctx, dsisr, dar);
>> +        if (cxl_handle_page_fault(ctx->kernel, mm, dsisr, dar)) {
>> +            cxl_ack_ae(ctx);
>> +        } else {
>> +            pr_devel("Page fault successfully handled for pe: 
>> %i!\n", ctx->pe);
>> +            cxl_ops->ack_irq(ctx, CXL_PSL_TFC_An_R, 0);
>> +        }
>
>
> Could we have that code in a wrapper before calling 
> cxl_handle_page_fault()? It would keep the code cleaner and in line 
> with what we do for cxl_handle_segment_miss().
>

We can try to to that.

>
>
>> +++ b/include/misc/cxllib.h
>> @@ -0,0 +1,132 @@
>> +/*
>> + * Copyright 2016 IBM Corp.
>
>
> Year update.
>
>
>> +/*
>> + * Get the Process Element structure for the given thread
>> + *
>> + * Input:
>> + *    pid: points the struct pid for the given thread (i.e. linux pid)
>> + *    translation_mode: whether addresses should be translated
>> + */
>> +struct cxllib_pe_attributes {
>> +    u64 sr;
>> +    u32 lpid;
>> +    u32 tid;
>> +    u32 pid;
>> +};
>> +#define CXL_TRANSLATED_MODE 0
>> +#define CXL_REAL_MODE 1
>> +
>> +int cxllib_get_PE_attributes(struct task_struct *task,
>> +         unsigned long translation_mode, struct cxllib_pe_attributes 
>> *attr);
>
>
> Description in comment no longer matches reality.
> /*
>  * Get the Process Element structure for the given thread
>  *
>  * Input:
>  *    task: task_struct for the context of the translation
>  *    translation_mode: whether addresses should be translated
>  * Output:
>  *    attr: attributes to fill up the Process Element structure from CAIA
>  */
>
>
> Thanks,
>
>   Fred