New design for file system layout and code update

Tue Aug 22 12:25:21 AEST 2017

[Sorry for the ever shorter lines, but I don't see a nice !fmt
in this mailer's UI.  I tried to add trailing spaces like quoted-
printable to the end of lines for flow, if they survive ...]

On 08/21/2017 around 06:22AM in some timezone, Joel Stanley wrote:
>
>Hi Milton,
>
>Thanks for the write-up. We weren't yet looking for something as
>comprehensive,
>but thanks for the attention to detail here. Before we get too far
>ahead though,
>there are a few questions about the higher-level design (quotes are
>from your
>doc):

Part of the write-up was to provide background and discussion of 
what I had researched.  Part of the write-up was documentation 
that should have been written last April before openbmc-1.0 and 
should be added to the openbmc/docs with some further editing.

And I wanted to provide background and a glossary at the end to 
try and invite all members of the list to participate in any 
discussion.

>
>Firstly, a clarification about the kernel: your doc mentions that
>it'll be present (as the raw kernel binary) in a UBI volume, but
>the discussion today you mentioned that it will be a raw flash area.
>Can you clear that up?

I don't remember my exact words on the call last night, but I  
remember talking about putting the FIT image in a bare UBI static 
partition instead of a file in an ubifs instance..  I probably 
used the terminology raw ubi partition.

As stated, the current plan does not require any UBI volume to 
be written, only read.  The only raw flash space in the proposed plan 
is for u-boot itself and two 64k sectors for its environment space.

>
>> All distribution images (except Das U-boot and its environemnt)
>> will be stored in seperate ubi volumes named by their image type.
>> A hash of version identifiers will be generated during deployment
>> to make unique image names.
>>
>> The kernel and its device tree will be stored in a fit image
>> in a ubi volume.
>
>Am I correct in assuming that this allows us to be flexible in the
>partition sizing? ie, we can perform single-volume updates at runtime
>that may be larger than the space originally allocated for that
>volume?

Yes.  To me, that is the primary motivation for enabling UBI volumes.

>
>[ie, with the partitioned mtd scheme that we have now, increasing
>a partition beyond its original size isn't possible without
>having to shuffle the data around]
>
>If so, this would be a fairly significant reason for adopting the
>UBI-volume-based approach.
>

I'm glad you agree.

Without this, we could shift space between the kernel FIT and the 
read-only file system, but there is no provision to grow or shrink 
a jffs2 volume that is used for read-write.   And moving the boundary 
is difficult because the partition starts are in the dts and the 
seperate images won't fit or need to be shifted across the old boundary 
as we distribute the image in pieces with the old binaries.

>> To support a total flash chip failure, each flash chip will
>> contain an independent ubi device.  The mtd_concat driver will not
>> be used to form an ubi device that spans flash chip boundaries.
>>
>> Both the primary chip (mtd label bmc) and alternate chip (mtd
>> label alt) will contain a complete copy of u-boot, and a redundant
>> environment, and kernel images.  For space reasons root squashfs
>> volumes may be on a different ubi device than the kernel.
>
>Do we really need this though? It seems that the functionality that
>the
>hardware provides here (flipping to a completely independent backup
>flash)
>gives us a simple, fail-safe mechanism for disaster recovery.
>

I think its important to separate out failure cases and recovery actions  
for different users and machine classes.  Of the 11 machine dts in the 
current openbmc/linux dev-4.10 branch, 2 use other layouts, and of the 
remaining 9, only the three IBM HPC nodes (firestone, garrison, and 
witherspoon) have populated dual bmc flash devices.  Unless there are 
other machines that populate dual flash without showing them in the 
current dts files (firestone and garrison don't show this in the present 
device trees).

That said, using a totally independent flash gives the best chance 
of recovery, at the largest expense (double the storage cost plus the 
board space).

In the current hardware, there is not space for two complete code 
stacks on each chip.  There is an issue open suggesting the file 
system shrink to a target that may allow this goal.

(thoughts continue after more quotes)

>Using some components from "primary" and some from "secondary" seems
>like an invitation for incompatibility issues.
>

I wasn't aware until I asked in last friday's code reivew/merge 
party^Wmeeting detail what was planned.

The current plans will always use a kernel and rofs from the same 
yocto build.  The two images will be stored with the same image hash 
name.  The question came up how to transition from one image to the 
next, and for the time being, all kernels will be placed on both the 
primary and alternate ubi devices, allowing any u-boot to start 
any kernel.

Further discussion below.

>If we do continue with this approach, how will it interact with the
>auto-fallback via chip-select?
>

Yes.  I haven't fully developed my plans, and so I have not written 
a full proposal yet.  But since (1) users will be confused if the 
contents of the primary and alternate change because the secondary 
boot code was triggered on the 2500 and (2) users of the 2400 will 
not be able to access the primary if the alternate chip is booted,  
I am making the following proposal:

Before linux will probe the FMC, it will clear all boot watchdog 
secondary source selects, ensuring the labels in the device tree 
correspond to the physical chip selects and hence placement on the 
board.  The kernel command line will assign ubi device numbers 
to all bmc ubi storage, and these will not change based on which 
chip was selected to load u-boot and its environment.

[We can decide to clear the secondary source in u-boot but that 
would require dual flash support to be implemented and tested in 
u-boot.  In addition the environment code would need to be updated 
to recognise, remember, and react to the secondary source select 
at least to prevent saveenv on the alt u-boot writing the primary 
environment.]

The code update activation will be responsible to update both 
u-boot environments.  Each environment will contain a set of 
variables that will cause u-boot to load a named kernel by 
volume and supply that kernel with a command line to find the 
squashfs by ubi device and volume number (because that is what 
can be supported with the current u-boot and kernel command 
line).

The limitation here is that if the primary chip becomes unreadable 
and contains the root volume for the alternate chip, the alternate 
side will also fail to boot.  This could be improved with a TBD 
mechanism where u-boot would try the kernel matching the rootfs 
volume that is stored on its flash.

There is a similar window whenever the alternate flash rootfs 
image is being replaced until there is space to hold two complete 
images (current and pending) on one flash chip (or a full and 
recovery image, eg that only includes code-update and very limited 
system management).

>> Since all binaries will exist in the root file system, systemd can
>> be started directly from the squashfs without an intermediate
>> initramfs.  Eliminating the initramfs will remove the requirement
>> to build and store it, however, it also requires the bootloader
>> to will need to specify how to locate the root filesystem.
>
>So we'd also be moving from the current initramfs-based flashing
>mechanism to something that requires the rootfs to be working, right?

I think discussion of update and recovery scenarios is a worthwhile 
topic.  I included the glossary to have defined terms to facilitate 
such discussion.  [That said, a review of the glossary was skipped 
in the rush to publish the tome].

As I mentioned in the alternatives section, the xz-squashfs image 
can be loaded into ram, either as an initrd to be mounted on /dev/ram0, 
or with an initramfs script (similar the current obmc-phosphor-initfs 
obmc-phosphor-init.sh) that would use an image in rootfs via /dev/loop.

There are some techincal details to work though to get yocto to 
build such a recovery image but it is quite reasonable.  And unlike 
the cpio.xz ramdisk, the squashfs will be decompressed on demand 
instead of taking 30-90 seconds of time with no console output.

>That is a bit concerning, as it means that more infrastrcuture
>needs to be working & correct to be able to boot *anything*,
>including recovery. Or have I read the design incorrectly here?
>

The current code proposal bypasses the initramfs and requires 
finding the root filesystem.

For human based recovery, from a u-boot prompt:

(1) u-boot provides a ubi command with an info subcommand 
that can be used to show ubi volume name and volume numbers.

(2) The process to extract the information and build a 
boot command and bootargs can be documented, including the 
intermediate variables in the proposed u-boot environment.  

Alternatively,

(3) The existing phosphor-initfs init.sh script can be modified 
to look for ubi volumes by name in addition to mtd partitions 
by name.  The images could be located from its rich recovery 
environment (which I hint at in the documentation part of my 
mega-post, but have never fully documnted).

(4) Alternatively a recovery FIT can be built with initrd 
or initramfs including the squashfs root image.  This image 
could be setup for a netboot fallback or even to be probed 
u-boot from a deployment server (eg attempt dhcp similar to PXE).

In the intrest of full disclosure...

Not fully documented but observed in testing:

If var fails to mount or the /etc overlay fails to mount for any 
reason, the network is not configured and the console getty 
does not start.  

This implies the alternate boot watchdog should not be stopped
in this case.

>How about something like:
>
> - completely independent primary & backup images, and use the
>   BMC's watchdog to revert to the backup on severe boot problems
>
> - kernel and initramfs as (raw) UBI volumes, with u-boot having
>   support to boot from those
>
> - rootfs (ro) + /var (rw) [plus whatever else is required] as
>   UBI volumes, likely with squashfs for ro and jffs2 for rw.
>
>That means:
>
> - we still have a fairly simple path to boot to initramfs, which
>   allows for system recovery
>
> - kernel, initramfs and filesystem sizes are not fixed, and can be
>   modified during firmware update
>
>How does that sound?

Let me highlight what I see as the differences.  I'm going to go
in reverse order for a moment:

(1) jffs2 vs ubifs on the ubirw volume
(2) kernel and initramfs are in separate ubi volumes.
(3) there is an initramfs
(4) flashes are independent
(5) further discussion on needed extension for (4)

Regarding (1):

I don't have a strong feeling between ubifs and jffs2.  I will 
note that glubi, which provides for mtd emulation over ubi, 
presents a odd sized erase block to its customer (64k-2*64 in 
our case).  As I mentioned jffs2 has two non-standard compression 
options that may or may not reduced size including storing 
compressed data.  A concern would be if few users are testing 
the nonstandard block size exposes some hole or boundary 
condition while ubifs on ubi would have a larger user base.

(2)
I am not aware of any advantage to keeping separate kernel 
and initramfs images vs the current layout combining them into 
a FIT image.  (Ok it might be eaiser to add a kernel built 
outside yocto, but u-boot can load the dts and/or initramfs 
images independently from FIT and the FIT images.  And this 
would only apply to kernel hackers vs consumers of the build 
system.)

As long as we plan to distribute kernel and rofs as a 
package I see no reason not to include an initramfs, should 
one be needed, into a FIT image.

[FIT is a flat device tree binary (blob) containing 
pointers to data and metadata including signatures].

(3) I haven't talked to Patrick as to why, but I know he 
had requested several times the initramfs be removed.

One motivation can be that our build currently requires 
two complete copies of busybox and its linked libraries 
be stored, using MB of space even with XZ compression.  

Another may be the presumption of boot time savings, but 
I never saw that as an issue and have not heard it expressed.

Both of these could be addressed with a custom binary 
(I would write in C) that implements the search logic 
currently in the init script.

That said, the only reason I have not proposed making 
the new var a writeable file system and etc an overlay 
is the time to implement the proposal.  I would start by 
making the necessary changes to obmc-phosphor-init.sh 
and then look at what it would take to choose the 
current overlay vs the new via a machine feature (and 
therefore init option via the base options file).

One thing not mentioned is the ability of ubifs to 
rename volumes atomically including swapping names of 
volumes.  The volume number is stored throughout the 
flash but the name is only in the (double stored) 
table of contents.  But until the kernel can mount 
a squashfs only by name its not useful to use to 
implement a constant "name is primary, alt-name is 
secondary" policy.

I do have some concerns with mounting etc after systemd 
starts, but I think the init script can be invoked 
before systemd from within the squashfs image (either 
init= or linuxrc in initrd for recovery).

I think managing the rwfs is an area that deserves 
attention and enhancement.  The existing initramfs 
code is underutilized according to its author :-).

(4)
I think most of the difference is in how the rootfs is 
located;  either it is programmed by code update into 
the environment (and if it doesn't work we fix it), or 
the alternate boot doesn't rely on the alternate chip 
at all.

If we don't reset the secondary boot flag then all 
the names are backwards wrt the device tree, and we 
either have two dts images on the 2500.  It also 
means we can't support any access to the alternate 
chip from the 2400.  Which probably means we can't 
even check its version as access would conflict with 
access to the primary chip.  I strongly believe that 
we need to clear the boot flag by the Linux kernel 
flash driver probe.

If this is given, then it comes down to failure 
recovery scenarios.

The current strategy is each u-boot can read either 
the previous or new env because of the redundant 
copy in each flash.

The current and previous or current and new kernel 
volumes will be available to u-boot.  It comes 
down to finding a matching root file system.

The currently running code will be protected from 
deletion so a vaild image will exist, it may just 
be stored over two flash chips.  The current 
discussion was to update the alternate chip env to 
boot the current image, and then move the newly 
selected image to be selected on the primary boot 
chip.

(5)
The other scenario is the primary ubi device 
becomes unusable, either due to corruption or 
flash chip failure.  With further enhancement 
to the ubi boot script, I anticipate that a 
secondary kernel and rootfs could be stored,  
enabling either the primary or alternate u-boot 
to boot the either the highest priority or 
the alternate, second-priority image.

I think this is needed for supporting recovery 
for systems without a second bmc flash chip.

I think this could be automated by exposing 
the watchdog activation count(s) to the u-boot 
environment and allowing the boot script to 
take action based on the counts.  An earlier 
idea was to have a countdown variable that 
selects an alternate boot until it reaches 
zero that is stored in the environment.

If this is implemented, then the current  
proposal of storing the new image in alternate 
ubi devices can be argues to increase robustness,  
as either the current or secondary root 
will be in the alternate chip, while during 
update the alternate chip will have no valid 
file system with your proposal (an alt that 
is golden would be more robust).

That said, there is a lot more to talk about 
regarding full alternate boot image support 
and what that means.  The in-process code aims 
to not do anything bad if the secondary boot 
source is selected on a ast2500 but this has not 
been tested and may be missing pieces.

The goal for the current sprint is to allow the 
alternate chip to boot and be able to repair 
the primary image to a booting state in the 
absence of hardware failures, in a deployed 
environment (without serial console attached).

There are several details that are needed 
including keeping a watchdog that will select 
the alternate boot running, and only pinging 
it if the user is able to interact via the 
network with the current image.  This may 
require updates to the kernel and u-boot to 
leave a watchdog running in addition to 
clearing the boot source select.  I am not 
aware of any test of booting from the alternate 
chip intentionally; I know several boots 
have worked with the old layout :-).

======

The initramfs today provides several hooks to 
dump to the serial port a sulogin (root login) 
request if anything goes wrong.

So, how much is making sure we find and mount the  
rootfs, how much is a developer recovering in 
system with help of the serial console, and how much
is something else?

>
>Cheers,
>
>Joel
>
>

milton