[PATCH] spi: fsi: Implement a timeout for polling status

Sat Mar 19 01:10:47 AEDT 2022

On 3/17/22 23:19, Joel Stanley wrote:
> On Thu, 17 Mar 2022 at 21:14, Eddie James <eajames at linux.ibm.com> wrote:
>> The data transfer routines must poll the status register to
>> determine when more data can be shifted in or out. If the hardware
>> gets into a bad state, these polling loops may never exit. Prevent
>> this by returning an error if a timeout is exceeded.
> This makes sense. We may even want to put this code in regardless.
>
> However, I'm wondering why the code in fsi_spi_status didn't catch this.

Same, which is why I thought the problem couldn't be happening here for 
a long time. See below with what I think is going on.

>
>> static int fsi_spi_status(struct fsi_spi *ctx, u64 *status, const char *dir)
>> {
>>         int rc = fsi_spi_read_reg(ctx, SPI_FSI_STATUS, status);
> You mentioned the error condition is we get back 0xff. That means that
> status will be 0xffff_ffff_ffff_ffff ?
>
> Did you observe status being this value?

No, I think my observation of 0xff is not universal. I suspect that 
while the CFAM is IN reset, 0xff is returned, but once it's been reset, 
valid (though maybe uninitialized) data is returned. I observed a status 
of 0x0001100000000000, which means the controller is idle, which makes 
sense since it's been reset. So the issue occurs if we start an 
operation before a CFAM reset and are waiting for it to complete during 
the CFAM reset, but we also don't get any failed/invalid data FSI 
operations during/after the reset (very timing dependent - the FSI 
master does lock during the reset but doesn't wait afterwards for the 
hardware to initialize).

>
>>         if (rc)
>>                 return rc;
>>
>>         if (*status & SPI_FSI_STATUS_ANY_ERROR) {
> I think that we're checking against 0xffe0f000.
>
>>                 dev_err(ctx->dev, "%s error: %016llx\n", dir, *status);
>>
>>                 rc = fsi_spi_reset(ctx);
>>                 if (rc)
>>                         return rc;
> Is the problem here? fsi_spi_reset writes to the clock config
> registers, but doesn't read the status.
>
> Obviously doing the writes causes a call to fsi_spi_check_status, but
> that checks the FSI2SPI bridge, not the SPI master.
>
> ...but it doesn't matter, because we're either going to return an
> error from the reset, or return EREMOTEIO, so there's no masking of
> the error.

Not sure I follow. I don't think we were hitting this path in this error 
scenario. Do you think we need to check the status after a reset? It 
should always be the same.

>
>>                 return -EREMOTEIO;
>>         }
>>
>>         return 0;
>> }
>
>> Signed-off-by: Eddie James <eajames at linux.ibm.com>
>> ---
>>   drivers/spi/spi-fsi.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
>> index b6c7467f0b59..d403a7a3021d 100644
>> --- a/drivers/spi/spi-fsi.c
>> +++ b/drivers/spi/spi-fsi.c
>> @@ -25,6 +25,7 @@
>>
>>   #define SPI_FSI_BASE                   0x70000
>>   #define SPI_FSI_INIT_TIMEOUT_MS                1000
>> +#define SPI_FSI_STATUS_TIMEOUT_MS      100
> Can you add a comment (or put something in the commit message) about
> why you chose 100ms.

Hm, sure, but I chose it pretty arbitrarily. I'm not sure how to choose 
something like this.

>
>>   #define SPI_FSI_MAX_RX_SIZE            8
>>   #define SPI_FSI_MAX_TX_SIZE            40
>>
>> @@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                                   struct spi_transfer *transfer)
>>   {
>>          int rc = 0;
>> +       unsigned long end;
>>          u64 status = 0ULL;
>>
>>          if (transfer->tx_buf) {
>> @@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                          if (rc)
>>                                  return rc;
>>
>> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>>                          do {
>>                                  rc = fsi_spi_status(ctx, &status, "TX");
>>                                  if (rc)
>>                                          return rc;
>> +
>> +                               if (time_after(jiffies, end))
>> +                                       return -ETIMEDOUT;
>>                          } while (status & SPI_FSI_STATUS_TDR_FULL);
>>
>>                          sent += nb;
>> @@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                  u8 *rx = transfer->rx_buf;
>>
>>                  while (transfer->len > recv) {
>> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>>                          do {
>>                                  rc = fsi_spi_status(ctx, &status, "RX");
>>                                  if (rc)
>>                                          return rc;
>> +
>> +                               if (time_after(jiffies, end))
>> +                                       return -ETIMEDOUT;
>>                          } while (!(status & SPI_FSI_STATUS_RDR_FULL));
>>
>>                          rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
>> --
>> 2.27.0
>>