MPC831x (and others?) NAND erase performance improvements

Mark Mason mason at postdiluvian.org
Tue Dec 7 14:15:54 EST 2010


A few months ago I ran into some performance problems involving
UBI/NAND erases holding other devices off the LBC on an MPC8315.  I
found a solution for this, which worked well, at least with the
hardware I was working with.  I suspect the same problem affects other
PPCs, probably including multicore devices, and maybe other
architectures as well.

I don't have experience with similar NAND controllers on other
devices, so I'd like to explain what I found and see if someone who's
more familiar with the family and/or driver can tell if this is
useful.

The problem cropped up when there was a lot of traffic to the NAND
(Samsung K9WAGU08U1B-PIB0), with the NAND being on the LBC along with
a video chip that needed constant and prompt attention.

What I would see is that, as the writes happened, the erases would
wind up batched and issued all at once, such that frequently 400-700
erases were issued in rapid succession with a 1ms LBC BUSY cycle per
erase.  BUSY was shared with all of the devices on the LBC, so the PPC
could not talk to the video chip as long as BUSY was asserted by the
NAND.  This would give us a window of up to 700ms in which the PPC
could manage very little communication with other devices on the LBC -
in our case the video chip, for which this delay was essentially
fatal.  I suspect that some multicore chips might have one core
effectively halt if that core attempts to access the LBC while the
other core (or itself, for that matter) is executing an erase (if they
have a similar NAND controller).

What I found, though, was that the NAND did not inherently assert BUSY
as part of the erase - BUSY was asserted because the driver polled for
the status (NAND_CMD_STATUS).  If the status poll was delayed for the
duration of the erase then the MPC could talk to the video chip while
the erase was in progress.  At the end of the 1ms delay I would then
poll for status, which would complete effectively immediately.

Here's a code snippet from 2.6.37, with some comments I added.
drivers/mtd/nand/fsl_elbc_nand.c - fsl_elbc_cmdfunc():

  /* ERASE2 uses the block and page address from ERASE1 */
  case NAND_CMD_ERASE2:
    dev_vdbg(priv->dev, "fsl_elbc_cmdfunc: NAND_CMD_ERASE2.\n");

    out_be32(&lbc->fir,
       (FIR_OP_CM0 << FIR_OP0_SHIFT) |  /* Execute CMD0 (ERASE1).           */
       (FIR_OP_PA  << FIR_OP1_SHIFT) |  /* Issue block and page address.    */
       (FIR_OP_CM2 << FIR_OP2_SHIFT) |  /* Execute CMD2 (ERASE2).           */
           /* (delay needed here - this is where the erase happens) */
       (FIR_OP_CW1 << FIR_OP3_SHIFT) |  /* Wait for LFRB (BUSY) to deassert */
                                        /* then issue CW1 (read status).    */
       (FIR_OP_RS  << FIR_OP4_SHIFT));  /* Read one byte.                   */

    out_be32(&lbc->fcr,
       (NAND_CMD_ERASE1 << FCR_CMD0_SHIFT) |  /* 0x60 */
       (NAND_CMD_STATUS << FCR_CMD1_SHIFT) |  /* 0x70 */
       (NAND_CMD_ERASE2 << FCR_CMD2_SHIFT));  /* 0xD0 */

    out_be32(&lbc->fbcr, 0);
    elbc_fcm_ctrl->read_bytes = 0;
    elbc_fcm_ctrl->use_mdr = 1;

    fsl_elbc_run_command(mtd);
    return;

What I did was to issue two commands with fsl_elbc_run_command(), with
a 1ms sleep in between (a tightloop delay worked almost as well, the
important part was having 1ms between the erase and the status poll).
The first command did the FIR_OP_CM0 (NAND_CMD_ERASE1), FIR_OP_PA, and
FIR_OP_CM2 (NAND_CMD_ERASE2).  The second did the FIR_OP_CW1
(NAND_CMD_STATUS) and FIR_OP_RS.

For a bit more detail...  fsl_elbc_run_command() would put the thread
issuing the erase to sleep so other threads could run.  That did work
as planned, except that I was working with a fairly pathalogical case
- there was a very high volume of writes to the NAND, and the video
chip required very frequent and prompt attention.  This meant that the
thread that was most likely to run when the NAND erase was in progress
was the thread that serviced the video chip.

A logic analyzer backed this up.  It would show the erase being
issued, BUSY (R/B# or LFRB) being asserted for 1ms, one or two 16 bit
transactions to the video chip, then another erase, repeating this
process hundreds of times in a row.  The UBI BGT would run long enough
to issue an erase (probably on the order of 20us) then go to sleep.
The video thread would then run, and issue a transaction to the chip.
That transaction would get blocked until BUSY deasserted, at which
point the thread would appear to have run for 1ms, even though it had
only executed a single bus transaction.

I know almost nothing at all about the scheduler, but I'm pretty sure
that this behavior would cause the scheduler to think the video thread
was a CPU hog, since the video thread was running for 1ms for every
20us that the UBI BGT ran, which would cause the scheduler to unfairly
prefer the UBI BGT.  I initially tried to address this problem with
thread priorities, but the unfortunate reality was that either the
NAND writes could fall behind or the video chip could fall behind, and
there wasn't spare bandwidth to allow either.

I tried the same trick for the writes.  It didn't work.

I really hope that someone cares enough after all this typing.
Unfortunately I don't have access to the hardware in question any
more, so I'm a bit limited in what I can offer beyond this without
hardware to run on, but I am willing to do whatever I can.


More information about the Linuxppc-dev mailing list