[jffs2] handling flash corruption

Rohit Pai ropai at nvidia.com
Wed Dec 7 21:04:29 AEDT 2022


Hello Milton, 

Thanks for the reply. Yes, we use Aspeed spi-nor controller driver.
So how did you manage to solve the issue ?  are there any patches to the driver code ? 

Thanks 
Rohit PAI 

-----Original Message-----
From: Milton Miller II <miltonm at us.ibm.com> 
Sent: Wednesday, December 7, 2022 4:52 AM
To: Rohit Pai <ropai at nvidia.com>; openbmc at lists.ozlabs.org
Subject: Re: [jffs2] handling flash corruption

External email: Use caution opening links or attachments


Hi Rohit

You didn't say which spi controller you are using, but we did see similar errors when developing the Aspeed spi-nor controller driver.

The arm io_memcopy is aliased to the optimized for memory memcopy and is not suitable for use with fifo io windows that send data to  the flash, as it will stutter and perform overlapping reads or writes depending on the source and destination alignment.  The jffs2 file system definitely triggers such unaligned writes.

The comment in the older driver explains this (here's a link into v5.0 kernel

https://github.com/torvalds/linux/blob/1c163f4c7b3f621efff9b28a47abb36f7378d783/drivers/mtd/spi-nor/aspeed-smc.c#L204

milton

PS I'm not aware of a fsck for jffs2.   Another symptom was fsck would show names
with 4 garbage characters for the old files.

-----------  Apologies for top posting and not quoting the reply to: ------------ Dear Team,

We have 256MB of spi nor flash on our platform.
Its split into multiple partitions as mentioned in the table below.

Block / size
File system
Usage
/dev/mtdblcok5 (0.5MB)
None
Stores copy of uboot env
/dev/mtdblcok6
(16MB)
Jffs2
Read write file system
This is overlayed with read-only file system from the image and mounter at '/'
/dev/mtdblock7
(200MB)
Jffs2
Log partition
Used for storing logs and bmc dumps

We are seeing flash corruption in few of our shipped products which undergo repeated power cycle test.
The continuous power cycle test seems to somehow corrupt the data flash and on the next boot either we end up in kernel panic during init or there is recovery tried by the file system which never seem to end successfully, and the application don't start well.

When the flash is corrupt, we repeatedly see jffs2 errors as shown below.

[  279.805305] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06f90020: 0x8504 instead [  279.805319] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06f90024: 0x75a3 instead [  279.805327] jffs2: Further such events for this erase block will not be printed [  279.817370] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0000: 0x0b14 instead [ 279.848078] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0004: 0x1baa instead [  279.860240] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0008: 0xb9c1 instead [  279.872368] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa000c: 0x4d18 instead

These errors start to come when any file system related commands are executed from the obmc-init.sh file.

mount -t overlay -o lowerdir=$rodir,upperdir=$upper,workdir=$work cow /root

So basically, it appears like jffs2 is trying to recover the file system but because of the type of corruption it could not, and we are starting the overlay in a bad state which can subsequently trigger a kernel panic. The flash corruption is mainly happening because of repeated power cycle test and some of the application (logging, debug collector, etc) trying to write something into flash and causing a corruption. The reproduction seems to be very difficult we saw one failure after 5K loops of test. We also tried to manually corrupt the flash by writing junk data to it, but it did not recreate the same issue.

Has someone seen similar type of issue ?
Do you any recommendations to solve issue ?
Is there a way to recreate such corruption manually for testing purpose ?

One solution we have in mind is to check the return status of all commands in obmc-init script around rwfs and when these commands fail try to boot with read-only file system.
But we are not sure if it can work in all cases, if the commands work and still jffs2 causes kernel panic in the background sync then we will have the same problem.
Any thoughts of how to detect the corrupted flash in obmc-init and avoid using it ?

I see some reference to fsck. This is not working in our platform because we don't have fsck.jffs2. It could be packaging issue which can be solved.
Can we trust fsck to capture all possible flash corruption ? Could it be possible that fsck does not detect anything but when jffs2 mounted then it can start to fail ?

Thanks
Rohit PAI





More information about the openbmc mailing list