New design for file system layout and code update

Tue Aug 22 15:26:29 AEST 2017

On Tue, 2017-08-22 at 02:25 +0000, Milton Miller II wrote:
> [Sorry for the ever shorter lines, but I don't see a nice !fmt
> in this mailer's UI.  I tried to add trailing spaces like quoted-
> printable to the end of lines for flow, if they survive ...]
> 
> On 08/21/2017 around 06:22AM in some timezone, Joel Stanley wrote:
> > 
> > Hi Milton,
> > 
> > Thanks for the write-up. We weren't yet looking for something as
> > comprehensive,
> > but thanks for the attention to detail here. Before we get too far
> > ahead though,
> > there are a few questions about the higher-level design (quotes are
> > from your
> > doc):
> 
> Part of the write-up was to provide background and discussion of 
> what I had researched.  Part of the write-up was documentation 
> that should have been written last April before openbmc-1.0 and 
> should be added to the openbmc/docs with some further editing.
> 
> And I wanted to provide background and a glossary at the end to 
> try and invite all members of the list to participate in any 
> discussion.
> 
> 
> 
> > 
> > Firstly, a clarification about the kernel: your doc mentions that
> > it'll be present (as the raw kernel binary) in a UBI volume, but
> > the discussion today you mentioned that it will be a raw flash area.
> > Can you clear that up?
> 
> I don't remember my exact words on the call last night, but I  
> remember talking about putting the FIT image in a bare UBI static 
> partition instead of a file in an ubifs instance..  I probably 
> used the terminology raw ubi partition.
> 
> As stated, the current plan does not require any UBI volume to 
> be written, only read.  The only raw flash space in the proposed plan 
> is for u-boot itself and two 64k sectors for its environment space.
> 
> > 
> > > All distribution images (except Das U-boot and its environemnt)
> > > will be stored in seperate ubi volumes named by their image type.
> > > A hash of version identifiers will be generated during deployment
> > > to make unique image names.
> > > 
> > > The kernel and its device tree will be stored in a fit image
> > > in a ubi volume.
> > 
> > Am I correct in assuming that this allows us to be flexible in the
> > partition sizing? ie, we can perform single-volume updates at runtime
> > that may be larger than the space originally allocated for that
> > volume?
> 
> Yes.  To me, that is the primary motivation for enabling UBI volumes.

I haven't played much with UBI, so this is a naive question: is this the case
for all UBI volumes? It appears there are both 'static' and 'dynamic' volumes.
From [1]:

	UBI volume size is specified when the volume is created and may later
	be changed (volumes are dynamically re-sizable). There are user-space
	tools which may be used to manipulate UBI volumes.

	There are 2 types of UBI volumes - dynamic volumes and static volumes.
	Static volumes are read-only and their contents are protected by CRC-32
	checksums, while dynamic volumes are read-write and the upper layers
	(e.g., a file-system) are responsible for ensuring data integrity.

The order its written in does seem to suggest both static and dynamic volumes
can be resized, but clarity there would be good.

[1] http://www.linux-mtd.infradead.org/doc/ubi.html

> 
> > 
> > [ie, with the partitioned mtd scheme that we have now, increasing
> > a partition beyond its original size isn't possible without
> > having to shuffle the data around]
> > 
> > If so, this would be a fairly significant reason for adopting the
> > UBI-volume-based approach.
> > 
> 
> I'm glad you agree.
> 
> Without this, we could shift space between the kernel FIT and the 
> read-only file system, but there is no provision to grow or shrink 
> a jffs2 volume that is used for read-write.   And moving the boundary 
> is difficult because the partition starts are in the dts and the 
> seperate images won't fit or need to be shifted across the old boundary 
> as we distribute the image in pieces with the old binaries.
> 
> > > To support a total flash chip failure, each flash chip will
> > > contain an independent ubi device.  The mtd_concat driver will not
> > > be used to form an ubi device that spans flash chip boundaries.
> > > 
> > > Both the primary chip (mtd label bmc) and alternate chip (mtd
> > > label alt) will contain a complete copy of u-boot, and a redundant
> > > environment, and kernel images.  For space reasons root squashfs
> > > volumes may be on a different ubi device than the kernel.
> > 
> > Do we really need this though? It seems that the functionality that
> > the
> > hardware provides here (flipping to a completely independent backup
> > flash)
> > gives us a simple, fail-safe mechanism for disaster recovery.
> > 
> 
> I think its important to separate out failure cases and recovery actions  
> for different users and machine classes.  Of the 11 machine dts in the 
> current openbmc/linux dev-4.10 branch, 2 use other layouts, and of the 
> remaining 9, only the three IBM HPC nodes (firestone, garrison, and 
> witherspoon) have populated dual bmc flash devices.  Unless there are 
> other machines that populate dual flash without showing them in the 
> current dts files (firestone and garrison don't show this in the present 
> device trees).
> 
> That said, using a totally independent flash gives the best chance 
> of recovery, at the largest expense (double the storage cost plus the 
> board space).
> 
> In the current hardware, there is not space for two complete code 
> stacks on each chip.  There is an issue open suggesting the file 
> system shrink to a target that may allow this goal.
> 
> (thoughts continue after more quotes)
> 
> 
> > Using some components from "primary" and some from "secondary" seems
> > like an invitation for incompatibility issues.
> > 
> 
> I wasn't aware until I asked in last friday's code reivew/merge 
> party^Wmeeting detail what was planned.
> 
> The current plans will always use a kernel and rofs from the same 
> yocto build.  The two images will be stored with the same image hash 
> name.  The question came up how to transition from one image to the 
> next, and for the time being, all kernels will be placed on both the 
> primary and alternate ubi devices, allowing any u-boot to start 
> any kernel.
> 
> Further discussion below.
> 
> > If we do continue with this approach, how will it interact with the
> > auto-fallback via chip-select?
> > 
> 
> Yes.  I haven't fully developed my plans, and so I have not written 
> a full proposal yet.  But since (1) users will be confused if the 
> contents of the primary and alternate change because the secondary 
> boot code was triggered on the 2500 and (2) users of the 2400 will 
> not be able to access the primary if the alternate chip is booted,  
> I am making the following proposal:
> 
> Before linux will probe the FMC, it will clear all boot watchdog 
> secondary source selects, ensuring the labels in the device tree 
> correspond to the physical chip selects and hence placement on the 
> board.  The kernel command line will assign ubi device numbers 
> to all bmc ubi storage, and these will not change based on which 
> chip was selected to load u-boot and its environment.
> 
> [We can decide to clear the secondary source in u-boot but that 
> would require dual flash support to be implemented and tested in 
> u-boot.  In addition the environment code would need to be updated 
> to recognise, remember, and react to the secondary source select 
> at least to prevent saveenv on the alt u-boot writing the primary 
> environment.]
> 
> The code update activation will be responsible to update both 
> u-boot environments.

By 'both u-boot environments' do you mean on the primary and alternative chips?
Or both flash erase blocks reserved for the environment on one chip?

If the former, this seems like dangerous territory for a fallback/recovery
mechanism. What if code update activation corrupts the environment? We would
corrupt the primary and alternative boot configuration. I guess we fall-back to
the backup environment on the primary somehow? The backup environment would
have to be modified at some point as well, as otherwise the images it points to
might not correspond to what was intended?

> Each environment will contain a set of 
> variables that will cause u-boot to load a named kernel by 
> volume and supply that kernel with a command line to find the 
> squashfs by ubi device and volume number (because that is what 
> can be supported with the current u-boot and kernel command 
> line).
> 
> The limitation here is that if the primary chip becomes unreadable 
> and contains the root volume for the alternate chip, the alternate 
> side will also fail to boot.  This could be improved with a TBD 
> mechanism where u-boot would try the kernel matching the rootfs 
> volume that is stored on its flash.
> 
> There is a similar window whenever the alternate flash rootfs 
> image is being replaced until there is space to hold two complete 
> images (current and pending) on one flash chip (or a full and 
> recovery image, eg that only includes code-update and very limited 
> system management).
> 
> > > Since all binaries will exist in the root file system, systemd can
> > > be started directly from the squashfs without an intermediate
> > > initramfs.  Eliminating the initramfs will remove the requirement
> > > to build and store it, however, it also requires the bootloader
> > > to will need to specify how to locate the root filesystem.
> > 
> > So we'd also be moving from the current initramfs-based flashing
> > mechanism to something that requires the rootfs to be working, right?
> 
> I think discussion of update and recovery scenarios is a worthwhile 
> topic.  I included the glossary to have defined terms to facilitate 
> such discussion.  [That said, a review of the glossary was skipped 
> in the rush to publish the tome].
> 
> As I mentioned in the alternatives section, the xz-squashfs image 
> can be loaded into ram, either as an initrd to be mounted on /dev/ram0, 
> or with an initramfs script (similar the current obmc-phosphor-initfs 
> obmc-phosphor-init.sh) that would use an image in rootfs via /dev/loop.
> 
> There are some techincal details to work though to get yocto to 
> build such a recovery image but it is quite reasonable.  And unlike 
> the cpio.xz ramdisk, the squashfs will be decompressed on demand 
> instead of taking 30-90 seconds of time with no console output.
> 
> > That is a bit concerning, as it means that more infrastrcuture
> > needs to be working & correct to be able to boot *anything*,
> > including recovery. Or have I read the design incorrectly here?
> > 
> 
> The current code proposal bypasses the initramfs and requires 
> finding the root filesystem.
> 
> For human based recovery, from a u-boot prompt:
> 
> (1) u-boot provides a ubi command with an info subcommand 
> that can be used to show ubi volume name and volume numbers.
> 
> (2) The process to extract the information and build a 
> boot command and bootargs can be documented, including the 
> intermediate variables in the proposed u-boot environment.  
> 
> Alternatively,
> 
> (3) The existing phosphor-initfs init.sh script can be modified 
> to look for ubi volumes by name in addition to mtd partitions 
> by name.  The images could be located from its rich recovery 
> environment (which I hint at in the documentation part of my 
> mega-post, but have never fully documnted).
> 
> (4) Alternatively a recovery FIT can be built with initrd 
> or initramfs including the squashfs root image.  This image 
> could be setup for a netboot fallback or even to be probed 
> u-boot from a deployment server (eg attempt dhcp similar to PXE).
> 
> In the intrest of full disclosure...
> 
> Not fully documented but observed in testing:
> 
> If var fails to mount or the /etc overlay fails to mount for any 
> reason, the network is not configured and the console getty 
> does not start.  
> 
> This implies the alternate boot watchdog should not be stopped
> in this case.
> 
> 
> 
> > How about something like:
> > 
> > - completely independent primary & backup images, and use the
> >   BMC's watchdog to revert to the backup on severe boot problems
> > 
> > - kernel and initramfs as (raw) UBI volumes, with u-boot having
> >   support to boot from those
> > 
> > - rootfs (ro) + /var (rw) [plus whatever else is required] as
> >   UBI volumes, likely with squashfs for ro and jffs2 for rw.
> > 
> > That means:
> > 
> > - we still have a fairly simple path to boot to initramfs, which
> >   allows for system recovery
> > 
> > - kernel, initramfs and filesystem sizes are not fixed, and can be
> >   modified during firmware update
> > 
> > How does that sound?
> 
> 
> Let me highlight what I see as the differences.  I'm going to go
> in reverse order for a moment:
> 
> (1) jffs2 vs ubifs on the ubirw volume
> (2) kernel and initramfs are in separate ubi volumes.
> (3) there is an initramfs
> (4) flashes are independent
> (5) further discussion on needed extension for (4)
> 
> Regarding (1):
> 
> I don't have a strong feeling between ubifs and jffs2.  I will 
> note that glubi, which provides for mtd emulation over ubi, 
> presents a odd sized erase block to its customer (64k-2*64 in 
> our case).  As I mentioned jffs2 has two non-standard compression 
> options that may or may not reduced size including storing 
> compressed data.  A concern would be if few users are testing 
> the nonstandard block size exposes some hole or boundary 
> condition while ubifs on ubi would have a larger user base.
> 
> (2)
> I am not aware of any advantage to keeping separate kernel 
> and initramfs images vs the current layout combining them into 
> a FIT image.  (Ok it might be eaiser to add a kernel built 
> outside yocto, but u-boot can load the dts and/or initramfs 
> images independently from FIT and the FIT images.  And this 
> would only apply to kernel hackers vs consumers of the build 
> system.)
> 
> As long as we plan to distribute kernel and rofs as a 
> package I see no reason not to include an initramfs, should 
> one be needed, into a FIT image.
> 
> [FIT is a flat device tree binary (blob) containing 
> pointers to data and metadata including signatures].
> 
> (3) I haven't talked to Patrick as to why, but I know he 
> had requested several times the initramfs be removed.
> 
> One motivation can be that our build currently requires 
> two complete copies of busybox and its linked libraries 
> be stored, using MB of space even with XZ compression.  
> 
> Another may be the presumption of boot time savings, but 
> I never saw that as an issue and have not heard it expressed.
> 
> Both of these could be addressed with a custom binary 
> (I would write in C) that implements the search logic 
> currently in the init script.
> 
> That said, the only reason I have not proposed making 
> the new var a writeable file system and etc an overlay 
> is the time to implement the proposal.  I would start by 
> making the necessary changes to obmc-phosphor-init.sh 
> and then look at what it would take to choose the 
> current overlay vs the new via a machine feature (and 
> therefore init option via the base options file).
> 
> One thing not mentioned is the ability of ubifs to 
> rename volumes atomically including swapping names of 
> volumes.  The volume number is stored throughout the 
> flash but the name is only in the (double stored) 
> table of contents.  But until the kernel can mount 
> a squashfs only by name its not useful to use to 
> implement a constant "name is primary, alt-name is 
> secondary" policy.
> 
> I do have some concerns with mounting etc after systemd 
> starts, but I think the init script can be invoked 
> before systemd from within the squashfs image (either 
> init= or linuxrc in initrd for recovery).
> 
> I think managing the rwfs is an area that deserves 
> attention and enhancement.  The existing initramfs 
> code is underutilized according to its author :-).
> 
> (4)
> I think most of the difference is in how the rootfs is 
> located;  either it is programmed by code update into 
> the environment (and if it doesn't work we fix it), or 
> the alternate boot doesn't rely on the alternate chip 
> at all.
> 
> If we don't reset the secondary boot flag then all 
> the names are backwards wrt the device tree, and we 
> either have two dts images on the 2500.  It also 
> means we can't support any access to the alternate 
> chip from the 2400.  Which probably means we can't 
> even check its version as access would conflict with 
> access to the primary chip.  I strongly believe that 
> we need to clear the boot flag by the Linux kernel 
> flash driver probe.
> 
> If this is given, then it comes down to failure 
> recovery scenarios.
> 
> The current strategy is each u-boot can read either 
> the previous or new env because of the redundant 
> copy in each flash.
> 
> The current and previous or current and new kernel 
> volumes will be available to u-boot.  It comes 
> down to finding a matching root file system.
> 
> The currently running code will be protected from 
> deletion so a vaild image will exist, it may just 
> be stored over two flash chips.  The current 
> discussion was to update the alternate chip env to 
> boot the current image, and then move the newly 
> selected image to be selected on the primary boot 
> chip.
> 
> 
> (5)
> The other scenario is the primary ubi device 
> becomes unusable, either due to corruption or 
> flash chip failure.  With further enhancement 
> to the ubi boot script, I anticipate that a 
> secondary kernel and rootfs could be stored,  
> enabling either the primary or alternate u-boot 
> to boot the either the highest priority or 
> the alternate, second-priority image.
> 
> I think this is needed for supporting recovery 
> for systems without a second bmc flash chip.
> 
> I think this could be automated by exposing 
> the watchdog activation count(s) to the u-boot 
> environment and allowing the boot script to 
> take action based on the counts.  An earlier 
> idea was to have a countdown variable that 
> selects an alternate boot until it reaches 
> zero that is stored in the environment.
> 
> If this is implemented, then the current  
> proposal of storing the new image in alternate 
> ubi devices can be argues to increase robustness,  
> as either the current or secondary root 
> will be in the alternate chip, while during 
> update the alternate chip will have no valid 
> file system with your proposal (an alt that 
> is golden would be more robust).
> 
> That said, there is a lot more to talk about 
> regarding full alternate boot image support 
> and what that means.  The in-process code aims 
> to not do anything bad if the secondary boot 
> source is selected on a ast2500 but this has not 
> been tested and may be missing pieces.
> 
> The goal for the current sprint is to allow the 
> alternate chip to boot and be able to repair 
> the primary image to a booting state in the 
> absence of hardware failures, in a deployed 
> environment (without serial console attached).
> 
> There are several details that are needed 
> including keeping a watchdog that will select 
> the alternate boot running, and only pinging 
> it if the user is able to interact via the 
> network with the current image.  This may 
> require updates to the kernel and u-boot to 
> leave a watchdog running in addition to 
> clearing the boot source select.  I am not 
> aware of any test of booting from the alternate 
> chip intentionally; I know several boots 
> have worked with the old layout :-).
> 
> 
> ======
> 
> The initramfs today provides several hooks to 
> dump to the serial port a sulogin (root login) 
> request if anything goes wrong.
> 
> So, how much is making sure we find and mount the  
> rootfs, how much is a developer recovering in 
> system with help of the serial console, and how much
> is something else?
> 
> > 
> > Cheers,
> > 
> > Joel
> > 
> > 
> 
> milton
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20170822/bb1ec22e/attachment-0001.sig>