[BUG] mtd: cfi_cmdset_0002: write regression since v4.17-rc1

Mon Feb 14 03:47:57 AEDT 2022

Hi Ahmad-san,

Thanks for your confirmations. Sorry for late to reply.

Could you please try the patch attached to disable the chip_good() 
change as before?
I think this should work for S29GL964N since the chip_ready() is used 
and works as mentioned.

On 2022/02/07 23:28, Ahmad Fatoum wrote:
> Hello Tokunori-san,
>
> On 29.01.22 19:01, Tokunori Ikegami wrote:
>> Hi Ahmad-san,
>>
>> Thanks for your investigation.
>>
>>> The issue is still there with #define FORCE_WORD_WRITE 1:
>>>
>>>     jffs2: Write clean marker to block at 0x000a0000 failed: -5
>>>     MTD do_write_oneword_once(): software timeout
>> Which kernel version has been tested about this?
> I last tested with v5.10.30, but I had briefly tried v5.16-rc as well
> when first debugging this issue.
>
> I have rebased onto v5.17-rc2 now and will use that for further tests.
> The same issue with word write forcing is reproducible there as well.
Noted about these.
>
>> Since the buffered writes disabled by 7e4404113686 for S29GL256N and tested on kernel 5.10.16.
>> So I would like to confirm if the issue depended on the CPU or kernel version, etc.
>> Note: The chips S29GL064N and S29GL256N seem different the flash Mb size basically.
> I see. To be extra sure, I have replaced 0x2201 with 0x0c01 to hit
> the same code paths, but no improvement.
I see and check the data sheet as described.
>
>>> Doesn't seem to be a buffered write issue here though as the writes
>>> did work fine before dfeae1073583. Any other ideas?
>> At first I thought the issue is possible to be resolved by using the word write instead of the buffered writes.
>> Now I am thinking to disable the changes dfeae1073583 partially with any condition if possible.
> What seems to work for me is checking if chip_good or chip_ready
> and map_word is equal to 0xFF. I can't justify why this is ok though.
> (Worst case bus is floating at this point of time and Hi-Z is read
> as 0xff on CPU data lines...)

Sorry I am not sure about this.
I thought the chip_ready() itself is correct as implemented as the data 
sheet in the past.
But it did not work correctly so changed to use chip_good() instead as 
it is also correct.

>
>> By the way could you please let me know the chip information for more detail? (For example model number, cycle and device ID, etc.)
> I can't read it off the chip, but vendor uses S29GL064N90FFI02 or S29GL964N11FFI02.
> Kernel reports it with:
> ff800000.flash: Found 1 x16 devices at 0x0 in 8-bit bank. Manufacturer ID 0x000001 Chip ID 0x000c01
The change attached checks the device ID 0x0c01 and use chip_ready() 
instead on chip_good().
>
> I am not sure what you mean with cycle. If you tell me what
> command to run, I can paste the output.
Sorry my understanding was not correct about the data sheet description 
device ID and cycle.

Regards,
Ikegami

>
> Thanks,
> Ahmad
>
>
>
>> Regards,
>> Ikegami
>>
>>
>> On 2021/12/14 16:23, Thorsten Leemhuis wrote:
>>
>>>>> [TLDR: adding this regression to regzbot; most of this mail is compiled
>>>>> from a few templates paragraphs some of you might have seen already.]
>>>>>
>>>>> Hi, this is your Linux kernel regression tracker speaking.
>>>>>
>>>>> Top-posting for once, to make this easy accessible to everyone.
>>>>>
>>>>> Thanks for the report.
>>>>>
>>>>> Adding the regression mailing list to the list of recipients, as it
>>>>> should be in the loop for all regressions, as explained here:
>>>>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>>>>>
>>>>> To be sure this issue doesn't fall through the cracks unnoticed, I'm
>>>>> adding it to regzbot, my Linux kernel regression tracking bot:
>>>>>
>>>>> #regzbot ^introduced dfeae1073583
>>>>> #regzbot title mtd: cfi_cmdset_0002: flash write accesses on the
>>>>> hardware fail on a PowerPC MPC8313 to a 8-bit-parallel S29GL064N flash
>>>>> #regzbot ignore-activity
>>>>>
>>>>> Reminder: when fixing the issue, please add a 'Link:' tag with the URL
>>>>> to the report (the parent of this mail), then regzbot will automatically
>>>>> mark the regression as resolved once the fix lands in the appropriate
>>>>> tree. For more details about regzbot see footer.
>>>>>
>>>>> Sending this to everyone that got the initial report, to make all aware
>>>>> of the tracking. I also hope that messages like this motivate people to
>>>>> directly get at least the regression mailing list and ideally even
>>>>> regzbot involved when dealing with regressions, as messages like this
>>>>> wouldn't be needed then.
>>>>>
>>>>> Don't worry, I'll send further messages wrt to this regression just to
>>>>> the lists (with a tag in the subject so people can filter them away), as
>>>>> long as they are intended just for regzbot. With a bit of luck no such
>>>>> messages will be needed anyway.
>>>>>
>>>>> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat).
>>>>>
>>>>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>>>>> on my table. I can only look briefly into most of them. Unfortunately
>>>>> therefore I sometimes will get things wrong or miss something important.
>>>>> I hope that's not the case here; if you think it is, don't hesitate to
>>>>> tell me about it in a public reply. That's in everyone's interest, as
>>>>> what I wrote above might be misleading to everyone reading this; any
>>>>> suggestion I gave thus might sent someone reading this down the wrong
>>>>> rabbit hole, which none of us wants.
>>>>>
>>>>> BTW, I have no personal interest in this issue, which is tracked using
>>>>> regzbot, my Linux kernel regression tracking bot
>>>>> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
>>>>> this mail to get things rolling again and hence don't need to be CC on
>>>>> all further activities wrt to this regression.
>>>>>
>>>>> On 13.12.21 14:24, Ahmad Fatoum wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been investigating a breakage on a PowerPC MPC8313: The SoC is connected
>>>>>> via the "Enhanced Local Bus Controller" to a 8-bit-parallel S29GL064N flash,
>>>>>> which is represented as a memory-mapped cfi-flash.
>>>>>>
>>>>>> The regression began in v4.17-rc1 with
>>>>>>
>>>>>>      dfeae1073583 ("mtd: cfi_cmdset_0002: Change write buffer to check correct value")
>>>>>>
>>>>>> and causes all flash write accesses on the hardware to fail. Example output
>>>>>> after v5.1-rc2[1]:
>>>>>>
>>>>>>      root at host:~# mount -t jffs2 /dev/mtdblock0 /mnt
>>>>>>      MTD do_write_buffer_wait(): software timeout, address:0x000c000b.
>>>>>>      jffs2: Write clean marker to block at 0x000c0000 failed: -5
>>>>>>
>>>>>> This issue still persists with v5.16-rc. Reverting aforementioned patch fixes
>>>>>> it, but I am still looking for a change that keeps both Tokunori's and my
>>>>>> hardware happy.
>>>>>>
>>>>>> What Tokunori's patch did is that it strengthened the success condition
>>>>>> for flash writes:
>>>>>>
>>>>>>     - Prior to the patch, DQ polling was done until bits
>>>>>>       stopped toggling. This was taken as an indicator that the write succeeded
>>>>>>       and was reported up the stack. i.e. success condition is chip_ready()
>>>>>>
>>>>>>     - After the patch, polling continues until the just written data is
>>>>>>       actually read back, i.e. success condition is chip_good()
>>>>>>
>>>>>> This new condition never holds for me, when DQ stabilizes, it reads 0xFF,
>>>>>> never the just written data. The data is still written and can be read back
>>>>>> on subsequent reads, just not at that point of time in the poll loop.
>>>>>>
>>>>>> We haven't had write issues for the years predating that patch. As the
>>>>>> regression has been mainline for a while, I am wondering what about my setup
>>>>>> that makes it pop up here, but not elsewhere?
>>>>>>
>>>>>> I consulted the data sheet[2] and found Figure 27, which describes DQ polling
>>>>>> during embedded algorithms. DQ switches from status output to "True" (I assume
>>>>>> True == all bits set == 0xFF) until CS# is reasserted.
>>>>>>
>>>>>> I compared with another chip's datasheet, and it (Figure 8.4) doesn't describe
>>>>>> such an intermittent "True" state. In any case, the driver polls a few hundred
>>>>>> times, however, before giving up, so there should be enough CS# toggles.
>>>>>>
>>>>>>
>>>>>> Locally, I'll revert this patch for now. I think accepting 0xFF as a success
>>>>>> condition may be appropriate, but I don't yet have the rationale to back it up.
>>>>>>
>>>>>> I am investigating this some more, probably with a logic trace, but I wanted
>>>>>> to report this in case someone has pointers and in case other people run into
>>>>>> the same issue.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Ahmad
>>>>>>
>>>>>> [1] Prior to d9b8a67b3b95 ("mtd: cfi: fix deadloop in cfi_cmdset_0002.c do_write_buffer")
>>>>>>        first included with v5.1-rc2, failing writes just hung indefinitely in kernel space.
>>>>>>        That's fixed, but the writes still fail.
>>>>>>
>>>>>> [2]: 001-98525 Rev. *B, https://www.infineon.com/dgdl/Infineon-S29GL064N_S29GL032N_64_Mbit_32_Mbit_3_V_Page_Mode_MirrorBit_Flash-DataSheet-v03_00-EN.pdf?fileId=8ac78c8c7d0d8da4017d0ed556fd548b
>>>>>>
>>>>>> [3]: https://www.mouser.com/datasheet/2/268/SST39VF1601C-SST39VF1602C-16-Mbit-x16-Multi-Purpos-709008.pdf
>>>>>>         Note that "true data" means valid data here, not all bits one.
>>>>>>
>
-------------- next part --------------
From 59b1e946931202d7058eec12c2bcda7fc65acbba Mon Sep 17 00:00:00 2001
From: Tokunori Ikegami <ikegami.t at gmail.com>
Date: Mon, 14 Feb 2022 01:08:02 +0900
Subject: [PATCH] mtd: cfi_cmdset_0002: Use chip_ready() for write on S29GL064N

The regression issue has been caused on S29GL064N and reported it.
Also the change mentioned is to use chip_good() for buffered write.
So disable the change on S29GL064N and use chip_ready() as before.

Fixes: dfeae1073583("mtd: cfi_cmdset_0002: Change write buffer to check correct value")
Signed-off-by: Tokunori Ikegami <ikegami.t at gmail.com>
Cc: linux-mtd at lists.infradead.org
---
 drivers/mtd/chips/cfi_cmdset_0002.c | 105 ++++++++++++++++------------
 1 file changed, 59 insertions(+), 46 deletions(-)

diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index a761134fd3be..a0dfc8ace899 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -48,6 +48,7 @@
 #define SST49LF040B		0x0050
 #define SST49LF008A		0x005a
 #define AT49BV6416		0x00d6
+#define S29GL064N_MN12		0x0c01
 
 /*
  * Status Register bit description. Used by flash devices that don't
@@ -109,6 +110,8 @@ static struct mtd_chip_driver cfi_amdstd_chipdrv = {
 	.module		= THIS_MODULE
 };
 
+static bool use_chip_good_for_write;
+
 /*
  * Use status register to poll for Erase/write completion when DQ is not
  * supported. This is indicated by Bit[1:0] of SoftwareFeatures field in
@@ -283,6 +286,17 @@ static void fixup_use_write_buffers(struct mtd_info *mtd)
 }
 #endif /* !FORCE_WORD_WRITE */
 
+static void fixup_use_chip_good_for_write(struct mtd_info *mtd)
+{
+	struct map_info *map = mtd->priv;
+	struct cfi_private *cfi = map->fldrv_priv;
+
+	if (cfi->mfr == CFI_MFR_AMD && cfi->id == S29GL064N_MN12)
+		return;
+
+	use_chip_good_for_write = true;
+}
+
 /* Atmel chips don't use the same PRI format as AMD chips */
 static void fixup_convert_atmel_pri(struct mtd_info *mtd)
 {
@@ -462,7 +476,7 @@ static struct cfi_fixup cfi_fixup_table[] = {
 	{ CFI_MFR_AMD, 0x0056, fixup_use_secsi },
 	{ CFI_MFR_AMD, 0x005C, fixup_use_secsi },
 	{ CFI_MFR_AMD, 0x005F, fixup_use_secsi },
-	{ CFI_MFR_AMD, 0x0c01, fixup_s29gl064n_sectors },
+	{ CFI_MFR_AMD, S29GL064N_MN12, fixup_s29gl064n_sectors },
 	{ CFI_MFR_AMD, 0x1301, fixup_s29gl064n_sectors },
 	{ CFI_MFR_AMD, 0x1a00, fixup_s29gl032n_sectors },
 	{ CFI_MFR_AMD, 0x1a01, fixup_s29gl032n_sectors },
@@ -474,6 +488,7 @@ static struct cfi_fixup cfi_fixup_table[] = {
 #if !FORCE_WORD_WRITE
 	{ CFI_MFR_ANY, CFI_ID_ANY, fixup_use_write_buffers },
 #endif
+	{ CFI_MFR_ANY, CFI_ID_ANY, fixup_use_chip_good_for_write },
 	{ 0, 0, NULL }
 };
 static struct cfi_fixup jedec_fixup_table[] = {
@@ -801,42 +816,61 @@ static struct mtd_info *cfi_amdstd_setup(struct mtd_info *mtd)
 	return NULL;
 }
 
-/*
- * Return true if the chip is ready.
- *
- * Ready is one of: read mode, query mode, erase-suspend-read mode (in any
- * non-suspended sector) and is indicated by no toggle bits toggling.
- *
- * Note that anything more complicated than checking if no bits are toggling
- * (including checking DQ5 for an error status) is tricky to get working
- * correctly and is therefore not done	(particularly with interleaved chips
- * as each chip must be checked independently of the others).
- */
-static int __xipram chip_ready(struct map_info *map, struct flchip *chip,
-			       unsigned long addr)
+static int __xipram chip_check(struct map_info *map, struct flchip *chip,
+			       unsigned long addr, map_word *expected)
 {
 	struct cfi_private *cfi = map->fldrv_priv;
-	map_word d, t;
+	map_word oldd, curd;
+	int ret;
 
 	if (cfi_use_status_reg(cfi)) {
 		map_word ready = CMD(CFI_SR_DRB);
+
 		/*
 		 * For chips that support status register, check device
 		 * ready bit
 		 */
 		cfi_send_gen_cmd(0x70, cfi->addr_unlock1, chip->start, map, cfi,
 				 cfi->device_type, NULL);
-		d = map_read(map, addr);
+		curd = map_read(map, addr);
 
-		return map_word_andequal(map, d, ready, ready);
+		return map_word_andequal(map, curd, ready, ready);
 	}
 
-	d = map_read(map, addr);
-	t = map_read(map, addr);
+	oldd = map_read(map, addr);
+	curd = map_read(map, addr);
+
+	ret = map_word_equal(map, oldd, curd);
+
+	if (!ret || !expected)
+		return ret;
+
+	return map_word_equal(map, curd, *expected);
+}
+
+static int __xipram chip_good_for_write(struct map_info *map,
+					struct flchip *chip, unsigned long addr,
+					map_word expected)
+{
+	if (use_chip_good_for_write)
+		return chip_check(map, chip, addr, &expected);
 
-	return map_word_equal(map, d, t);
+	return chip_check(map, chip, addr, NULL);
 }
 
+/*
+ * Return true if the chip is ready.
+ *
+ * Ready is one of: read mode, query mode, erase-suspend-read mode (in any
+ * non-suspended sector) and is indicated by no toggle bits toggling.
+ *
+ * Note that anything more complicated than checking if no bits are toggling
+ * (including checking DQ5 for an error status) is tricky to get working
+ * correctly and is therefore not done	(particularly with interleaved chips
+ * as each chip must be checked independently of the others).
+ */
+#define chip_ready(map, chip, addr) chip_check(map, chip, addr, NULL)
+
 /*
  * Return true if the chip is ready and has the correct value.
  *
@@ -855,28 +889,7 @@ static int __xipram chip_ready(struct map_info *map, struct flchip *chip,
 static int __xipram chip_good(struct map_info *map, struct flchip *chip,
 			      unsigned long addr, map_word expected)
 {
-	struct cfi_private *cfi = map->fldrv_priv;
-	map_word oldd, curd;
-
-	if (cfi_use_status_reg(cfi)) {
-		map_word ready = CMD(CFI_SR_DRB);
-
-		/*
-		 * For chips that support status register, check device
-		 * ready bit
-		 */
-		cfi_send_gen_cmd(0x70, cfi->addr_unlock1, chip->start, map, cfi,
-				 cfi->device_type, NULL);
-		curd = map_read(map, addr);
-
-		return map_word_andequal(map, curd, ready, ready);
-	}
-
-	oldd = map_read(map, addr);
-	curd = map_read(map, addr);
-
-	return	map_word_equal(map, oldd, curd) &&
-		map_word_equal(map, curd, expected);
+	return chip_check(map, chip, addr, &expected);
 }
 
 static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr, int mode)
@@ -1699,7 +1712,7 @@ static int __xipram do_write_oneword_once(struct map_info *map,
 		 * "chip_good" to avoid the failure due to scheduling.
 		 */
 		if (time_after(jiffies, timeo) &&
-		    !chip_good(map, chip, adr, datum)) {
+		    !chip_good_for_write(map, chip, adr, datum)) {
 			xip_enable(map, chip, adr);
 			printk(KERN_WARNING "MTD %s(): software timeout\n", __func__);
 			xip_disable(map, chip, adr);
@@ -1707,7 +1720,7 @@ static int __xipram do_write_oneword_once(struct map_info *map,
 			break;
 		}
 
-		if (chip_good(map, chip, adr, datum)) {
+		if (chip_good_for_write(map, chip, adr, datum)) {
 			if (cfi_check_err_status(map, chip, adr))
 				ret = -EIO;
 			break;
@@ -1979,14 +1992,14 @@ static int __xipram do_write_buffer_wait(struct map_info *map,
 		 * "chip_good" to avoid the failure due to scheduling.
 		 */
 		if (time_after(jiffies, timeo) &&
-		    !chip_good(map, chip, adr, datum)) {
+		    !chip_good_for_write(map, chip, adr, datum)) {
 			pr_err("MTD %s(): software timeout, address:0x%.8lx.\n",
 			       __func__, adr);
 			ret = -EIO;
 			break;
 		}
 
-		if (chip_good(map, chip, adr, datum)) {
+		if (chip_good_for_write(map, chip, adr, datum)) {
 			if (cfi_check_err_status(map, chip, adr))
 				ret = -EIO;
 			break;
-- 
2.32.0