[jffs2] handling flash corruption

Wed Dec 7 07:08:26 AEDT 2022

Dear Team,

We have 256MB of spi nor flash on our platform.
Its split into multiple partitions as mentioned in the table below.

Block / size
File system
Usage
/dev/mtdblcok5 (0.5MB)
None
Stores copy of uboot env
/dev/mtdblcok6
(16MB)
Jffs2
Read write file system
This is overlayed with read-only file system from the image and mounter at '/'
/dev/mtdblock7
(200MB)
Jffs2
Log partition
Used for storing logs and bmc dumps

We are seeing flash corruption in few of our shipped products which undergo repeated power cycle test.
The continuous power cycle test seems to somehow corrupt the data flash and on the next boot either we end up in kernel panic during init or there is recovery tried by the file system which never seem to end successfully, and the application don't start well.

When the flash is corrupt, we repeatedly see jffs2 errors as shown below.

[  279.805305] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06f90020: 0x8504 instead
[  279.805319] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06f90024: 0x75a3 instead
[  279.805327] jffs2: Further such events for this erase block will not be printed
[  279.817370] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0000: 0x0b14 instead
[ 279.848078] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0004: 0x1baa instead
[  279.860240] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa0008: 0xb9c1 instead
[  279.872368] jffs2: jffs2_scan_eraseblock(): Magic bitmask 0x1985 not found at 0x06fa000c: 0x4d18 instead

These errors start to come when any file system related commands are executed from the obmc-init.sh<https://github.com/openbmc/meta-phosphor/blob/master/recipes-phosphor/initrdscripts/files/obmc-init.sh#L417> file.

mount -t overlay -o lowerdir=$rodir,upperdir=$upper,workdir=$work cow /root

So basically, it appears like jffs2 is trying to recover the file system but because of the type of corruption it could not, and we are starting the overlay in a bad state which can subsequently trigger a kernel panic. The flash corruption is mainly happening because of repeated power cycle test and some of the application (logging, debug collector, etc) trying to write something into flash and causing a corruption. The reproduction seems to be very difficult we saw one failure after 5K loops of test. We also tried to manually corrupt the flash by writing junk data to it, but it did not recreate the same issue.

Has someone seen similar type of issue ?
Do you any recommendations to solve issue ?
Is there a way to recreate such corruption manually for testing purpose ?

One solution we have in mind is to check the return status of all commands in obmc-init script around rwfs and when these commands fail try to boot with read-only file system.
But we are not sure if it can work in all cases, if the commands work and still jffs2 causes kernel panic in the background sync then we will have the same problem.
Any thoughts of how to detect the corrupted flash in obmc-init and avoid using it ?

I see some reference to fsck<https://github.com/openbmc/meta-phosphor/blob/master/recipes-phosphor/initrdscripts/files/obmc-init.sh#L378>. This is not working in our platform because we don't have fsck.jffs2. It could be packaging issue which can be solved.
Can we trust fsck to capture all possible flash corruption ? Could it be possible that fsck does not detect anything but when jffs2 mounted then it can start to fail ?

Thanks
Rohit PAI

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20221206/a67ea830/attachment-0001.htm>