[PATCH] net: ftgmac100: Fix missing TX-poll issue

Sat Oct 24 00:08:30 AEDT 2020

> -----Original Message-----
> From: Andrew Jeffery [mailto:andrew at aj.id.au]
> Sent: Wednesday, October 21, 2020 6:26 AM
> To: Benjamin Herrenschmidt <benh at kernel.crashing.org>; Arnd Bergmann
> <arnd at arndb.de>; Dylan Hung <dylan_hung at aspeedtech.com>
> Cc: BMC-SW <BMC-SW at aspeedtech.com>; linux-aspeed
> <linux-aspeed at lists.ozlabs.org>; Po-Yu Chuang <ratbert at faraday-tech.com>;
> netdev <netdev at vger.kernel.org>; OpenBMC Maillist
> <openbmc at lists.ozlabs.org>; Linux Kernel Mailing List
> <linux-kernel at vger.kernel.org>; Jakub Kicinski <kuba at kernel.org>; David
> Miller <davem at davemloft.net>
> Subject: Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue
> 
> 
> 
> On Wed, 21 Oct 2020, at 08:40, Benjamin Herrenschmidt wrote:
> > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:
> > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung
> <dylan_hung at aspeedtech.com> wrote:
> > > > > +1 @first is system memory from dma_alloc_coherent(), right?
> > > > >
> > > > > You shouldn't have to do this. Is coherent DMA memory broken on
> > > > > your platform?
> > > >
> > > > It is about the arbitration on the DRAM controller.  There are two
> queues in the dram controller, one is for the CPU access and the other is for
> the HW engines.
> > > > When CPU issues a store command, the dram controller just
> acknowledges cpu's request and pushes the request into the queue.  Then
> CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA
> memory.
> > > > But since the cpu's request may still stay in the queue, the HW engine
> may fetch the wrong data.
> >
> > Actually, I take back what I said earlier, the above seems to imply
> > this is more generic.
> >
> > Dylan, please confirm, does this affect *all* DMA capable devices ? If
> > yes, then it's a really really bad design bug in your chips
> > unfortunately and the proper fix is indeed to make dma_wmb() do a
> > dummy read of some sort (what address though ? would any dummy
> > non-cachable page do ?) to force the data out as *all* drivers will
> > potentially be affected.
> >

The issue was found on our test chip (ast2600 version A0) which is just for testing and won't be mass-produced.  This HW bug has been fixed on ast2600 A1 and later versions.

To verify the HW fix, I run overnight iperf and kvm tests on ast2600A1 without this patch, and get stable result without hanging.
So I think we can discard this patch.

> > I was under the impression that it was a specific timing issue in the
> > vhub and ethernet parts, but if it's more generic then it needs to be
> > fixed globally.
> >
> 
> We see a similar issue in the XDMA engine where it can transfer stale data to
> the host. I think the driver ended up using memcpy_toio() to work around that
> despite using a DMA reserved memory region.