[PATCH 04/14] cxlflash: Avoid command room violation

Fri Nov 18 09:30:55 AEDT 2016

Thanks for catching this Matt. Looking into this. Will send out a V2.

On 11/17/2016 1:36 PM, Matthew R. Ochs wrote:
> Hi Uma,
>
> I do see a potential hang issue with this patch. See my comments below.
>
>
> -matt
>
>> On Nov 15, 2016, at 5:14 PM, Uma Krishnan <ukrishn at linux.vnet.ibm.com> wrote:
>>
>> During test, a command room violation interrupt is occasionally seen
>> for the master context when the CXL flash devices are stressed.
>>
>> After studying the code, there could be gaps in the way command room
>> value is being cached in cxlflash. When the cached command room is zero
>> the thread attempting to send becomes burdened with updating the cached
>> value with the actual value from the AFU. Today, this is handled with
>> an atomic set operation of the raw value read. Following the atomic
>> update, the thread proceeds to send.
>>
>> This behavior is incorrect on two counts:
>>
>>   - The update fails to take into account the current thread and its
>>     consumption of one of the hardware commands.
>>
>>   - The update does not take into account other threads also atomically
>>     updating. Per design, a worker thread updates the cached value when
>>     a send thread times out. By not performing an atomic compare/exchange,
>>     the cached value can be incorrectly clobbered.
>>
>> To correct these issues, the runtime updates of the cached command room
>> are updated to use atomic64_cmpxchg() and the send routine is updated to
>> take into account the current thread consuming a hardware command.
>>
>> Signed-off-by: Uma Krishnan <ukrishn at linux.vnet.ibm.com>
>> ---
>> drivers/scsi/cxlflash/main.c | 16 ++++++++++------
>> 1 file changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
>> index 6d33d8c..1a32e8b 100644
>> --- a/drivers/scsi/cxlflash/main.c
>> +++ b/drivers/scsi/cxlflash/main.c
>> @@ -322,9 +322,10 @@ static int send_cmd(struct afu *afu, struct afu_cmd *cmd)
>> 	if (!newval) {
>
> When this path is invoked, the current thread is consuming the last entry
> available entry before the room must be read again. While the change
> below is fine for circumstances where the hardware queue has room for
> more than one command, consider a scenario where the queue has room
> for only 1 command (the command that you just consumed via the atomic
> but are not really consuming with a MMIO due to the revised goto).
>
> In such a scenario this code would loop endlessly, bypassing the timeout
> logic completely, until the read room reflected a value greater than 1.
>
>> 		do {
>> 			room = readq_be(&afu->host_map->cmd_room);
>> -			atomic64_set(&afu->room, room);
>> -			if (room)
>> -				goto write_ioarrin;
>> +			if (room) {
>> +				atomic64_cmpxchg(&afu->room, 0, room);
>> +				goto retry;
>> +			}
>
> If you instead fully consume the entry (goto write_ioarrin - similar as it was
> before) and take into account the consumption when you update the cached
> value (i.e.: cmpxchg(..., 0, room - 1) the scenario described above will not occur.
>
>