New design for file system layout and code update

Wed Aug 23 10:59:47 AEST 2017

[Some trimming]

On 08/22/2017 at 12:26AM in some time zone, Andrew Jeffery <andrew at aj.id.au> wrote:
>On Tue, 2017-08-22 at 02:25 +0000, Milton Miller II wrote:
>> On 08/21/2017 around 06:22AM in some timezone, Joel Stanley wrote:
>> > 
>> > Firstly, a clarification about the kernel: your doc mentions that
>> > it'll be present (as the raw kernel binary) in a UBI volume, but
>> > the discussion today you mentioned that it will be a raw flash
>area.
>> > Can you clear that up?
>> 
>> I don't remember my exact words on the call last night, but I  
>> remember talking about putting the FIT image in a bare UBI static 
>> partition instead of a file in an ubifs instance..  I probably 
>> used the terminology raw ubi partition.
>> 
>> As stated, the current plan does not require any UBI volume to 
>> be written, only read.  The only raw flash space in the proposed
>plan 
>> is for u-boot itself and two 64k sectors for its environment space.
>> 

I should point out I left out "in u-boot" when talking write support.  
The kernel will need to write to the read/write file systems, currently 
the rwfs containing either the whole overlay (before) or /var with the 
/etc overlay in the new.

>> > 
>> > > All distribution images (except Das U-boot and its environemnt)
>> > > will be stored in seperate ubi volumes named by their image
>type.
>> > > A hash of version identifiers will be generated during
>deployment
>> > > to make unique image names.
>> > > 
>> > > The kernel and its device tree will be stored in a fit image
>> > > in a ubi volume.
>> > 
>> > Am I correct in assuming that this allows us to be flexible in
>the
>> > partition sizing? ie, we can perform single-volume updates at
>runtime
>> > that may be larger than the space originally allocated for that
>> > volume?
>> 
>> Yes.  To me, that is the primary motivation for enabling UBI
>volumes.
>
>I haven't played much with UBI, so this is a naive question: is this
>the case
>for all UBI volumes? It appears there are both 'static' and 'dynamic'
>volumes.

Yes.  The maximum number of LEBs (logical erase blocks) in a volume 
is stored in the volume header.  This is like a quota for the volume,
and ubi forces all volumes combined plus overhead to be less than 
the device size.  

>From [1]:
>
>	UBI volume size is specified when the volume is created and may
>later
>	be changed (volumes are dynamically re-sizable). There are
>user-space
>	tools which may be used to manipulate UBI volumes.
>
>	There are 2 types of UBI volumes - dynamic volumes and static
>volumes.
>	Static volumes are read-only and their contents are protected by
>CRC-32
>	checksums, while dynamic volumes are read-write and the upper layers
>	(e.g., a file-system) are responsible for ensuring data integrity.
>
>[1] http://www.linux-mtd.infradead.org/doc/ubi.html
>
>
>The order its written in does seem to suggest both static and dynamic
>volumes
>can be resized, but clarity there would be good.

I have resized static volumes.  No volume can be resized below its 
current max LEB.  A static volume content can be replaced with a new 
image.  Both the old a new content must fit in the flash, a truncate 
option on the command line updates to an zero-length image allowing a 
new image up to the maximum size.

On a dynamic volume, individual LEBs can be replaced.  For static 
volumes, when moving a LEB that is part of a static volume UBI will
write the crc32 to the header in the new block, and on recovery it
checks this crc to determine if the new block was full written or if
the new block should be reclaimed instead of the old.

>> > > To support a total flash chip failure, each flash chip will
>> > > contain an independent ubi device.  The mtd_concat driver will
>not
>> > > be used to form an ubi device that spans flash chip boundaries.
>> > > 
>> > > Both the primary chip (mtd label bmc) and alternate chip (mtd
>> > > label alt) will contain a complete copy of u-boot, and a
>redundant
>> > > environment, and kernel images.  For space reasons root
>squashfs
>> > > volumes may be on a different ubi device than the kernel.
>> > 

I just noticed the above is somewhat inconsistent.  We still plan
to not use the concat driver instead have two volumes.  But since
we can only store one root today, we need more code to fall back
to the image fully contained within the good chip.

>> > Do we really need this though? It seems that the functionality
>that
>> > the
>> > hardware provides here (flipping to a completely independent
>backup
>> > flash)
>> > gives us a simple, fail-safe mechanism for disaster recovery.
>> > 
>> 
>> I think its important to separate out failure cases and recovery
>actions  
>> for different users and machine classes.  Of the 11 machine dts in
>the 
>> current openbmc/linux dev-4.10 branch, 2 use other layouts, and of
>the 
>> remaining 9, only the three IBM HPC nodes (firestone, garrison,
>and 
>> witherspoon) have populated dual bmc flash devices.  Unless there
>are 
>> other machines that populate dual flash without showing them in
>the 
>> current dts files (firestone and garrison don't show this in the
>present 
>> device trees).
>> 
>> That said, using a totally independent flash gives the best chance 
>> of recovery, at the largest expense (double the storage cost plus
>the 
>> board space).
>> 
>> In the current hardware, there is not space for two complete code 
>> stacks on each chip.  There is an issue open suggesting the file 
>> system shrink to a target that may allow this goal.
>> 
>> (thoughts continue after more quotes)
>> 
>> 
>> > Using some components from "primary" and some from "secondary"
>seems
>> > like an invitation for incompatibility issues.
>> > 
>> 
>> I wasn't aware until I asked in last friday's code reivew/merge 
>> party^Wmeeting detail what was planned.
>> 
>> The current plans will always use a kernel and rofs from the same 
>> yocto build.  The two images will be stored with the same image
>hash 
>> name.  The question came up how to transition from one image to
>the 
>> next, and for the time being, all kernels will be placed on both
>the 
>> primary and alternate ubi devices, allowing any u-boot to start 
>> any kernel.
>> 
>> Further discussion below.

>> 
>> > If we do continue with this approach, how will it interact with
>the
>> > auto-fallback via chip-select?
>> > 
>> 
>> Yes.  I haven't fully developed my plans, and so I have not
>written 
>> a full proposal yet.  But since (1) users will be confused if the 
>> contents of the primary and alternate change because the secondary 
>> boot code was triggered on the 2500 and (2) users of the 2400 will 
>> not be able to access the primary if the alternate chip is
>booted,  
>> I am making the following proposal:
>> 
>> Before linux will probe the FMC, it will clear all boot watchdog 
>> secondary source selects, ensuring the labels in the device tree 
>> correspond to the physical chip selects and hence placement on the 
>> board.  The kernel command line will assign ubi device numbers 
>> to all bmc ubi storage, and these will not change based on which 
>> chip was selected to load u-boot and its environment.
>> 
>> [We can decide to clear the secondary source in u-boot but that 
>> would require dual flash support to be implemented and tested in 
>> u-boot.  In addition the environment code would need to be updated 
>> to recognise, remember, and react to the secondary source select 
>> at least to prevent saveenv on the alt u-boot writing the primary 
>> environment.]
>> 
>> The code update activation will be responsible to update both 
>> u-boot environments.
>
>By 'both u-boot environments' do you mean on the primary and
>alternative chips?

Yes

>Or both flash erase blocks reserved for the environment on one chip?

No, the fw_env command and u-boot both erase and program one 
copy before starting to update the secondary copy, and use a 
crc32 to check if a copy was fully written.

>
>If the former, this seems like dangerous territory for a
>fallback/recovery
>mechanism. What if code update activation corrupts the environment?
>We would
>corrupt the primary and alternative boot configuration. I guess we
>fall-back to
>the backup environment on the primary somehow? The backup environment
>would
>have to be modified at some point as well, as otherwise the images it
>points to
>might not correspond to what was intended?
>

Simple answer: code update will totally write well defined variables
that it "owns", and the default env bootcmd will take these variables
and boot

Original answer:

First of all, there is no harware fallback to switch from the 
secondary boot source back to the first boot on any failure.  This
will require new code, which means we at least had to start u-boot.

Second of all, today we do not have space for two copies of the root
file system in one chip.  That is still a goal.  Until it is reached,
we have a window where one chip will not be bootable while the file
system image is being replaced on that chip.

So it comes down to trade-offs.  How do we make use of the second chip?

- We could disallow all updates.
- We could declare the alternate chip golden.
- We could create a smaller fall-back that only supports code update.
- We can not update the primary while running directly from it
- We can enforce an update mode precondition, where the update
- We could force boot to alternate flash before writing the primary
- We could copy the primary fs into ram before writing new content

The proposal / current and pending code design says: Either the
primary or the secodary u-boot will be able to load a functional 
bmc.  There is never a window where a volume should be missing, only
a period of time where both will rely on the same ubi device and
volumes.  Factory will ship with two copies.  Field updates will 
maintain the last booted image and a new image.  

Having the second chip gives us additional space, but we are close.
The current proposal handles windows during code update but not
all cases where code writes to errant flash sectors.  But I don't
know how to protect the alternate chip without hiding it from the
kernel.  I don't know if I know how to test it works, and won't
be subject to the same issue that caused the primary to fail.

Enhanced u-boot boot scripts could fall back to the root and 
kernel on the remaining chip.

Remember there are a limited number of systems that have alternate
chips.  This design allows single large flash chip systems to
have multiple images.  (It also allows systems to spill the root
file system into the host chip if its host access policies allow.)

I need to send before I write another tome.

milton