New design for file system layout and code update
Milton Miller II
miltonm at us.ibm.com
Tue Aug 22 12:25:21 AEST 2017
[Sorry for the ever shorter lines, but I don't see a nice !fmt
in this mailer's UI. I tried to add trailing spaces like quoted-
printable to the end of lines for flow, if they survive ...]
On 08/21/2017 around 06:22AM in some timezone, Joel Stanley wrote:
>
>Hi Milton,
>
>Thanks for the write-up. We weren't yet looking for something as
>comprehensive,
>but thanks for the attention to detail here. Before we get too far
>ahead though,
>there are a few questions about the higher-level design (quotes are
>from your
>doc):
Part of the write-up was to provide background and discussion of
what I had researched. Part of the write-up was documentation
that should have been written last April before openbmc-1.0 and
should be added to the openbmc/docs with some further editing.
And I wanted to provide background and a glossary at the end to
try and invite all members of the list to participate in any
discussion.
>
>Firstly, a clarification about the kernel: your doc mentions that
>it'll be present (as the raw kernel binary) in a UBI volume, but
>the discussion today you mentioned that it will be a raw flash area.
>Can you clear that up?
I don't remember my exact words on the call last night, but I
remember talking about putting the FIT image in a bare UBI static
partition instead of a file in an ubifs instance.. I probably
used the terminology raw ubi partition.
As stated, the current plan does not require any UBI volume to
be written, only read. The only raw flash space in the proposed plan
is for u-boot itself and two 64k sectors for its environment space.
>
>> All distribution images (except Das U-boot and its environemnt)
>> will be stored in seperate ubi volumes named by their image type.
>> A hash of version identifiers will be generated during deployment
>> to make unique image names.
>>
>> The kernel and its device tree will be stored in a fit image
>> in a ubi volume.
>
>Am I correct in assuming that this allows us to be flexible in the
>partition sizing? ie, we can perform single-volume updates at runtime
>that may be larger than the space originally allocated for that
>volume?
Yes. To me, that is the primary motivation for enabling UBI volumes.
>
>[ie, with the partitioned mtd scheme that we have now, increasing
>a partition beyond its original size isn't possible without
>having to shuffle the data around]
>
>If so, this would be a fairly significant reason for adopting the
>UBI-volume-based approach.
>
I'm glad you agree.
Without this, we could shift space between the kernel FIT and the
read-only file system, but there is no provision to grow or shrink
a jffs2 volume that is used for read-write. And moving the boundary
is difficult because the partition starts are in the dts and the
seperate images won't fit or need to be shifted across the old boundary
as we distribute the image in pieces with the old binaries.
>> To support a total flash chip failure, each flash chip will
>> contain an independent ubi device. The mtd_concat driver will not
>> be used to form an ubi device that spans flash chip boundaries.
>>
>> Both the primary chip (mtd label bmc) and alternate chip (mtd
>> label alt) will contain a complete copy of u-boot, and a redundant
>> environment, and kernel images. For space reasons root squashfs
>> volumes may be on a different ubi device than the kernel.
>
>Do we really need this though? It seems that the functionality that
>the
>hardware provides here (flipping to a completely independent backup
>flash)
>gives us a simple, fail-safe mechanism for disaster recovery.
>
I think its important to separate out failure cases and recovery actions
for different users and machine classes. Of the 11 machine dts in the
current openbmc/linux dev-4.10 branch, 2 use other layouts, and of the
remaining 9, only the three IBM HPC nodes (firestone, garrison, and
witherspoon) have populated dual bmc flash devices. Unless there are
other machines that populate dual flash without showing them in the
current dts files (firestone and garrison don't show this in the present
device trees).
That said, using a totally independent flash gives the best chance
of recovery, at the largest expense (double the storage cost plus the
board space).
In the current hardware, there is not space for two complete code
stacks on each chip. There is an issue open suggesting the file
system shrink to a target that may allow this goal.
(thoughts continue after more quotes)
>Using some components from "primary" and some from "secondary" seems
>like an invitation for incompatibility issues.
>
I wasn't aware until I asked in last friday's code reivew/merge
party^Wmeeting detail what was planned.
The current plans will always use a kernel and rofs from the same
yocto build. The two images will be stored with the same image hash
name. The question came up how to transition from one image to the
next, and for the time being, all kernels will be placed on both the
primary and alternate ubi devices, allowing any u-boot to start
any kernel.
Further discussion below.
>If we do continue with this approach, how will it interact with the
>auto-fallback via chip-select?
>
Yes. I haven't fully developed my plans, and so I have not written
a full proposal yet. But since (1) users will be confused if the
contents of the primary and alternate change because the secondary
boot code was triggered on the 2500 and (2) users of the 2400 will
not be able to access the primary if the alternate chip is booted,
I am making the following proposal:
Before linux will probe the FMC, it will clear all boot watchdog
secondary source selects, ensuring the labels in the device tree
correspond to the physical chip selects and hence placement on the
board. The kernel command line will assign ubi device numbers
to all bmc ubi storage, and these will not change based on which
chip was selected to load u-boot and its environment.
[We can decide to clear the secondary source in u-boot but that
would require dual flash support to be implemented and tested in
u-boot. In addition the environment code would need to be updated
to recognise, remember, and react to the secondary source select
at least to prevent saveenv on the alt u-boot writing the primary
environment.]
The code update activation will be responsible to update both
u-boot environments. Each environment will contain a set of
variables that will cause u-boot to load a named kernel by
volume and supply that kernel with a command line to find the
squashfs by ubi device and volume number (because that is what
can be supported with the current u-boot and kernel command
line).
The limitation here is that if the primary chip becomes unreadable
and contains the root volume for the alternate chip, the alternate
side will also fail to boot. This could be improved with a TBD
mechanism where u-boot would try the kernel matching the rootfs
volume that is stored on its flash.
There is a similar window whenever the alternate flash rootfs
image is being replaced until there is space to hold two complete
images (current and pending) on one flash chip (or a full and
recovery image, eg that only includes code-update and very limited
system management).
>> Since all binaries will exist in the root file system, systemd can
>> be started directly from the squashfs without an intermediate
>> initramfs. Eliminating the initramfs will remove the requirement
>> to build and store it, however, it also requires the bootloader
>> to will need to specify how to locate the root filesystem.
>
>So we'd also be moving from the current initramfs-based flashing
>mechanism to something that requires the rootfs to be working, right?
I think discussion of update and recovery scenarios is a worthwhile
topic. I included the glossary to have defined terms to facilitate
such discussion. [That said, a review of the glossary was skipped
in the rush to publish the tome].
As I mentioned in the alternatives section, the xz-squashfs image
can be loaded into ram, either as an initrd to be mounted on /dev/ram0,
or with an initramfs script (similar the current obmc-phosphor-initfs
obmc-phosphor-init.sh) that would use an image in rootfs via /dev/loop.
There are some techincal details to work though to get yocto to
build such a recovery image but it is quite reasonable. And unlike
the cpio.xz ramdisk, the squashfs will be decompressed on demand
instead of taking 30-90 seconds of time with no console output.
>That is a bit concerning, as it means that more infrastrcuture
>needs to be working & correct to be able to boot *anything*,
>including recovery. Or have I read the design incorrectly here?
>
The current code proposal bypasses the initramfs and requires
finding the root filesystem.
For human based recovery, from a u-boot prompt:
(1) u-boot provides a ubi command with an info subcommand
that can be used to show ubi volume name and volume numbers.
(2) The process to extract the information and build a
boot command and bootargs can be documented, including the
intermediate variables in the proposed u-boot environment.
Alternatively,
(3) The existing phosphor-initfs init.sh script can be modified
to look for ubi volumes by name in addition to mtd partitions
by name. The images could be located from its rich recovery
environment (which I hint at in the documentation part of my
mega-post, but have never fully documnted).
(4) Alternatively a recovery FIT can be built with initrd
or initramfs including the squashfs root image. This image
could be setup for a netboot fallback or even to be probed
u-boot from a deployment server (eg attempt dhcp similar to PXE).
In the intrest of full disclosure...
Not fully documented but observed in testing:
If var fails to mount or the /etc overlay fails to mount for any
reason, the network is not configured and the console getty
does not start.
This implies the alternate boot watchdog should not be stopped
in this case.
>How about something like:
>
> - completely independent primary & backup images, and use the
> BMC's watchdog to revert to the backup on severe boot problems
>
> - kernel and initramfs as (raw) UBI volumes, with u-boot having
> support to boot from those
>
> - rootfs (ro) + /var (rw) [plus whatever else is required] as
> UBI volumes, likely with squashfs for ro and jffs2 for rw.
>
>That means:
>
> - we still have a fairly simple path to boot to initramfs, which
> allows for system recovery
>
> - kernel, initramfs and filesystem sizes are not fixed, and can be
> modified during firmware update
>
>How does that sound?
Let me highlight what I see as the differences. I'm going to go
in reverse order for a moment:
(1) jffs2 vs ubifs on the ubirw volume
(2) kernel and initramfs are in separate ubi volumes.
(3) there is an initramfs
(4) flashes are independent
(5) further discussion on needed extension for (4)
Regarding (1):
I don't have a strong feeling between ubifs and jffs2. I will
note that glubi, which provides for mtd emulation over ubi,
presents a odd sized erase block to its customer (64k-2*64 in
our case). As I mentioned jffs2 has two non-standard compression
options that may or may not reduced size including storing
compressed data. A concern would be if few users are testing
the nonstandard block size exposes some hole or boundary
condition while ubifs on ubi would have a larger user base.
(2)
I am not aware of any advantage to keeping separate kernel
and initramfs images vs the current layout combining them into
a FIT image. (Ok it might be eaiser to add a kernel built
outside yocto, but u-boot can load the dts and/or initramfs
images independently from FIT and the FIT images. And this
would only apply to kernel hackers vs consumers of the build
system.)
As long as we plan to distribute kernel and rofs as a
package I see no reason not to include an initramfs, should
one be needed, into a FIT image.
[FIT is a flat device tree binary (blob) containing
pointers to data and metadata including signatures].
(3) I haven't talked to Patrick as to why, but I know he
had requested several times the initramfs be removed.
One motivation can be that our build currently requires
two complete copies of busybox and its linked libraries
be stored, using MB of space even with XZ compression.
Another may be the presumption of boot time savings, but
I never saw that as an issue and have not heard it expressed.
Both of these could be addressed with a custom binary
(I would write in C) that implements the search logic
currently in the init script.
That said, the only reason I have not proposed making
the new var a writeable file system and etc an overlay
is the time to implement the proposal. I would start by
making the necessary changes to obmc-phosphor-init.sh
and then look at what it would take to choose the
current overlay vs the new via a machine feature (and
therefore init option via the base options file).
One thing not mentioned is the ability of ubifs to
rename volumes atomically including swapping names of
volumes. The volume number is stored throughout the
flash but the name is only in the (double stored)
table of contents. But until the kernel can mount
a squashfs only by name its not useful to use to
implement a constant "name is primary, alt-name is
secondary" policy.
I do have some concerns with mounting etc after systemd
starts, but I think the init script can be invoked
before systemd from within the squashfs image (either
init= or linuxrc in initrd for recovery).
I think managing the rwfs is an area that deserves
attention and enhancement. The existing initramfs
code is underutilized according to its author :-).
(4)
I think most of the difference is in how the rootfs is
located; either it is programmed by code update into
the environment (and if it doesn't work we fix it), or
the alternate boot doesn't rely on the alternate chip
at all.
If we don't reset the secondary boot flag then all
the names are backwards wrt the device tree, and we
either have two dts images on the 2500. It also
means we can't support any access to the alternate
chip from the 2400. Which probably means we can't
even check its version as access would conflict with
access to the primary chip. I strongly believe that
we need to clear the boot flag by the Linux kernel
flash driver probe.
If this is given, then it comes down to failure
recovery scenarios.
The current strategy is each u-boot can read either
the previous or new env because of the redundant
copy in each flash.
The current and previous or current and new kernel
volumes will be available to u-boot. It comes
down to finding a matching root file system.
The currently running code will be protected from
deletion so a vaild image will exist, it may just
be stored over two flash chips. The current
discussion was to update the alternate chip env to
boot the current image, and then move the newly
selected image to be selected on the primary boot
chip.
(5)
The other scenario is the primary ubi device
becomes unusable, either due to corruption or
flash chip failure. With further enhancement
to the ubi boot script, I anticipate that a
secondary kernel and rootfs could be stored,
enabling either the primary or alternate u-boot
to boot the either the highest priority or
the alternate, second-priority image.
I think this is needed for supporting recovery
for systems without a second bmc flash chip.
I think this could be automated by exposing
the watchdog activation count(s) to the u-boot
environment and allowing the boot script to
take action based on the counts. An earlier
idea was to have a countdown variable that
selects an alternate boot until it reaches
zero that is stored in the environment.
If this is implemented, then the current
proposal of storing the new image in alternate
ubi devices can be argues to increase robustness,
as either the current or secondary root
will be in the alternate chip, while during
update the alternate chip will have no valid
file system with your proposal (an alt that
is golden would be more robust).
That said, there is a lot more to talk about
regarding full alternate boot image support
and what that means. The in-process code aims
to not do anything bad if the secondary boot
source is selected on a ast2500 but this has not
been tested and may be missing pieces.
The goal for the current sprint is to allow the
alternate chip to boot and be able to repair
the primary image to a booting state in the
absence of hardware failures, in a deployed
environment (without serial console attached).
There are several details that are needed
including keeping a watchdog that will select
the alternate boot running, and only pinging
it if the user is able to interact via the
network with the current image. This may
require updates to the kernel and u-boot to
leave a watchdog running in addition to
clearing the boot source select. I am not
aware of any test of booting from the alternate
chip intentionally; I know several boots
have worked with the old layout :-).
======
The initramfs today provides several hooks to
dump to the serial port a sulogin (root login)
request if anything goes wrong.
So, how much is making sure we find and mount the
rootfs, how much is a developer recovering in
system with help of the serial console, and how much
is something else?
>
>Cheers,
>
>Joel
>
>
milton
More information about the openbmc
mailing list