RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO

Benjamin Herrenschmidt benh at kernel.crashing.org
Wed May 18 16:27:41 EST 2005


Hi !

Here's the very first draft of my HOWTO about booting the linux/ppc64
kernel without open firmware. It's still incomplete, the main chapter
describing which nodes & properties are required and their format is
still missing (though it will basically be a subset of the Open Firmware
specification & bindings). The format of the flattened device-tree is
documented.

It's a first draft, so please, don't be too harsh :) Comments are
welcome.

           Booting the Linux/ppc64 kernel without Open Firmware
           ----------------------------------------------------


(c) 2005 Benjamin Herrenschmidt <benh at kernel.crashing.org>, IBM Corp.

   May 18, 2005: Rev 0.1 - Initial draft, no chapter III yet.


I- Introduction
===============


During the recent developpements of the Linux/ppc64 kernel, and more
specifically, the addition of new platform types outside of the old
IBM pSeries/iSeries pair, it was decided to enforce some strict rules
regarding the kernel entry and bootloader <-> kernel interfaces, in
order to avoid the degeneration that has become the ppc32 kernel entry
point and the way a new platform should be added to the kernel. The
legacy iSeries platform breaks those rules as it predates this scheme,
but no new board support will be accepted in the main tree that
doesn't follows them properly.

1) Entry point
--------------

   There is one and one single entry point to the kernel, at the start
   of the kernel image. That entry point support two calling
   conventions:

	a) Boot from Open Firmware. If your firmware is compatible
	with Open Firmware (IEEE 1275) or provides an OF compatible
	client interface API (support for "interpret" callback of
	forth words isn't required), you can enter the kernel with:

	      r5 : OF callback pointer as defined by IEEE 1275
	      bindings to powerpc. Only the 32 bits client interface
	      is currently supported

	      r3, r4 : address & lenght of an initrd if any or 0

	      MMU is either on or off, the kernel will run the
	      trampoline located in arch/ppc64/kernel/prom_init.c to
	      extract the device-tree and other informations from open
	      firmware and build a flattened device-tree as described
	      in b). prom_init() will then re-enter the kernel using
	      the second method. This trampoline code runs in the
	      context of the firmware, which is supposed to handle all
	      exceptions during that time.

	b) Direct entry with a flattened device-tree block. This entry
	point is called by a) after the OF trampoline and can also be
	called directly by a bootloader that does not support the Open
	Firmware client interface. It is also used by "kexec" to
	implement "hot" booting of a new kernel from a previous
	running one. This method is what I will describe in more
	details in this document, as method a) is simply standard Open
	Firmware, and thus should be implemented according to the
	various standard documents defining it and it's binding to the
	PowerPC platform. The entry point definition then becomes:

		r3 : physical pointer to the device-tree block
		(defined in chapter II)

		r4 : physical pointer to the kernel itself. This is
		used by the assembly code to properly disable the MMU
		in case you are entering the kernel with MMU enabled
		and a non-1:1 mapping.

		r5 : NULL (as to differenciate with method a)

2) Board support
----------------

   Board supports (platforms) are not exclusive config options. An
   arbitrary set of board supports can be built in a single kernel
   image. The kernel will "known" what set of functions to use for a
   given platform based on the content of the device-tree. Thus, you
   should:

	a) add your platform support as a _boolean_ option in
	arch/ppc64/Kconfig, following the example of PPC_PSERIES,
	PPC_PMAC and PPC_MAPLE. The later is probably a good
	example of a board support to start from.

	b) create your main platform file as
	"arch/ppc64/kernel/myboard_setup.c" and add it to the Makefile
	under the condition of your CONFIG_ option. This file will
	define a structure of type "ppc_md" containing the various
	callbacks that the generic code will use to get to your
	platform specific code

	c) Add a reference to your "ppc_md" structure in the
	"machines" table in arch/ppc64/kernel/setup.c

        d) request and get assigned a platform number (see PLATFORM_*
        constants in include/asm-ppc64/processor.h

   I will describe later the boot process and various callbacks that
   your platform should implement.


II - The DT block format
===========================


This chapter defines the actual format of the flattened device-tree
passed to the kernel. The actual content of it and kernel requirements
are described later. You can find example of code manipulating that
format in various places, including arch/ppc64/kernel/prom_init.c
which will generate a flattened device-tree from the Open Firmware
representation, or the fs2dt utility which is part of the kexec tools
which will generate one from a filesystem representation. It is
expected that a bootloader like uboot provides a bit more support,
that will be discussed later as well.

1) Header
---------

   The kernel is entered with r3 pointing to an area of memory that is
   roughtly described in include/asm-ppc64/prom.h by the structure
   boot_param_header:

struct boot_param_header
{
	u32	magic;			/* magic word OF_DT_HEADER */
	u32	totalsize;		/* total size of DT block */
	u32	off_dt_struct;		/* offset to structure */
	u32	off_dt_strings;		/* offset to strings */
	u32	off_mem_rsvmap;		/* offset to memory reserve map */
	u32	version;		/* format version */
	u32	last_comp_version;	/* last compatible version */
	/* version 2 fields below */
	u32	boot_cpuid_phys;	/* Which physical CPU id we're
					   booting on */
};

   Along with the constants:

/* Definitions used by the flattened device tree */
#define OF_DT_HEADER		0xd00dfeed	/* 4: version, 4: total size */
#define OF_DT_BEGIN_NODE	0x1		/* Start node: full name */
#define OF_DT_END_NODE		0x2		/* End node */
#define OF_DT_PROP		0x3		/* Property: name off,
						   size, content */
#define OF_DT_END		0x9

   All values in this header are in big endian format, the various
   fields in this header are defined more precisely below. All
   "offsets" values are in bytes from the start of the header, that is
   from r3 value.

   - magic

     This is a magic value that "marks" the beginning of the
     device-tree block header. It contains the value 0xd00dfeed and is
     defined by the constant OF_DT_HEADER

   - totalsize

     This is the total size of the DT block including the header. The
     "DT" block should enclose all data structures defined in this
     chapter (who are pointed to by offsets in this header). That is,
     the device-tree structure, strings, and the memory reserve map.

   - off_dt_struct

     This is an offset from the beginning of the header to the start
     of the "structure" part the device tree. (see 2) device tree)

   - off_dt_strings

     This is an offset from the beginning of the header to the start
     of the "strings" part of the device-tree

   - off_mem_rsvmap

     This is an offset from the beginning of the header to the start
     of the reserved memory map. This map is a list of pairs of 64
     bits integers. Each pair is a physical address and a size. The
     list is terminated by an entry of size 0. This map provides the
     kernel with a list of physical memory areas that are "reserved"
     and thus not to be used for memory allocations, especially during
     early initialisation. The kernel needs to allocate memory during
     boot for things like un-flattening the device-tree, allocating an
     MMU hash table, etc... Those allocations must be done in such a
     way to avoid overriding critical things like, on Open Firmware
     capable machines, the RTAS instance, or on some pSeries, the TCE
     tables used for the iommu. Typically, the reserve map should
     contain _at least_ this DT block itself (header,total_size). If
     you are passing an initrd to the kernel, you should reserve it as
     well. You do not need to reserve the kernel image itself. The map
     should be 64 bits aligned. 

   - version

     This is the version of this structure. Version 1 stops
     here. Version 2 adds an additional field boot_cpuid_phys. You
     should always generate a structure of the highest version defined
     at the time of your implementation. That is version 2.

   - last_comp_version

     Last compatible version. This indicates down to what version of
     the DT block you are backward compatible with. For example,
     version 2 is backward compatible with version 1 (that is, a
     kernel build for version 1 will be able to boot with a version 2
     format). You should put a 1 in this field unless a new
     incompatible version of the DT block is defined.

   - boot_cpuid_phys

     This field only exist on version 2 headers. It indicate which
     physical CPU ID is calling the kernel entry point. This is used,
     among others, by kexec. If you are on an SMP system, this value
     should match the content of the "reg" property of the CPU node in
     the device-tree corresponding to the CPU calling the kernel entry
     point (see further chapters for more informations on the required
     device-tree contents)


   So the typical layout of a DT block (though the various parts don't
   need to be in that order) looks like (addresses go from top to bottom):


             ------------------------------    
       r3 -> |  struct boot_param_header  | 
             ------------------------------
             |      (alignment gap) (*)   |
	     ------------------------------
	     |      memory reserve map    |
	     ------------------------------
	     |      (alignment gap)       |
             ------------------------------
             |                            |
             |    device-tree structure   |
             |                            |
             ------------------------------
	     |      (alignment gap)       |
             ------------------------------
             |                            |
             |     device-tree strings    |
             |                            |
      -----> ------------------------------
      |    
      |
      --- (r3 + totalsize)

  (*) The alignment gaps are not necessarily present, their presence
      and size are dependent on the various alignment requirements of
      the individual data blocks.


2) Device tree generalities
---------------------------

This device-tree itself is separated in two different blocks, a
structure block and a strings block. Both need to be page
aligned.

First, let's quickly describe the device-tree concept before detailing
the storage format. This chapter does _not_ describe the detail of the
required types of nodes & properties for the kernel, this is done
later in chapter III.

The device-tree layout is strongly inherited from the definition of
the Open Firmware IEEE 1275 device-tree. It's basically a tree of
nodes, each node having two or more named properties. A property can
have a value or not.

It is a tree, so each node has one and only one parent except for the
root node who has no parent.

A node has 2 names. The actual node name is contained in a property of
type "name" in the node property list whose value is a zero terminated
string and is mandatory. There is also a "unit name" that is used to
differenciate nodes with the same name at the same level, it is
usually made of the node name's, the "@" sign, and a "unit address",
which definition is specific to the bus type the node sits on. The
unit name doesn't exist as a property per-se but is included in the
device-tree structure. It is typically used to represent "path" in the
device-tree. More details about these will be provided later. The
kernel ppc64 generic code does not make any formal use of the unit
address though (though some board support code may do) so the only
real requirement here for the unit address is to ensure uniqueness of
the node unit name at a given level. Nodes with no notion of address
and no possible sibling of the same name (like /memory or /cpus) may
ommit the unit address in the context of this specification, or use
the "@0" default unit address. The unit name is used to define a node
"full path", which is the concatenation of all parent nodes unit names
separated with "/".

The root node is defined as beeing named "device-tree" and has no unit
address (no @ symbol followed by a unit address). When manipulating
device-tree "path", the root of the tree is generally represented by a
simple slash sign "/".

Every node who actually represents an actual device (that is who isn't
only a virtual "container" for more nodes, like "/cpus" is) is also
required to have a "device_type" property indicating the type of node

Finally, every node is required to have a "linux,phandle"
property. Real open firmware implementations don't provide it as it's
generated on the fly by the prom_init.c trampoline from the Open
Firmware "phandle". Implementations providing a flattened device-tree
directly should provide this property. This propery is a 32 bits value
that uniquely identify a node. You are free to use whatever values or
system of values, internal pointers, or whatever to genrate these, the
only requirement is that every single node of the tree you are passing
to the kernel has a unique value in this property.

This can be used in some cases for nodes to reference other nodes.

Here is an example of a simple device-tree. In this example, a "o"
designates a node followed by the node unit name. Properties are
presented with their name followed by their content. "content"
represent an ASCII string (zero terminated) value, while <content>
represent a 32 bits hexadecimal value. The various nodes in this
example will be discusse in a later chapter. At this point, it is
only meant to give you a idea of what a device-tree looks like

  / o device-tree
      |- name = "device-tree"
      |- model = "MyBoardName"
      |- compatible = "MyBoardFamilyName"
      |- #address-cells = <2>
      |- #size-cells = <2>
      |- linux,phandle = <0>
      |
      o cpus
      | | - name = "cpus"
      | | - linux,phandle = <1>
      | |
      | o PowerPC,970 at 0
      |   |- name = "PowerPC,970"
      |   |- device_type = "cpu"
      |   |- reg = <0>
      |   |- clock-frequency = <5f5e1000>
      |   |- linux,boot-cpu
      |   |- linux,phandle = <2>
      |
      o memory at 0
      | |- name = "memory"
      | |- device_type = "memory"
      | |- reg = <00000000 00000000 00000000 20000000>
      | |- linux,phandle = <3>
      |
      o chosen
        |- name = "chosen"
        |- bootargs = "root=/dev/sda2"
        |- linux,platform = <00000600>
        |- linux,phandle = <4>

This tree is an example of a minimal tree. It pretty much contains the
minimal set of required nodes and properties to boot a linux kernel,
that is some basic model informations at the root, the CPUs, the
physical memory layout, and misc informations passed through /chosen
like in this example, the platform type (mandatory) and the kernel
command line arguments (optional).

The /cpus/PowerPC,970 at 0/linux,boot-cpu property is an example of a
property without a value. All other properties have a value. The
signification of the #address-cells and #size-cells properties will be
explained in chapter IV which defines precisely the required nodes and
properties and their content.


3) Device tree "structure" block

The structure of the device tree is a linearized tree structure. The
"OF_DT_BEGIN_NODE" token starts a new node, and the "OF_DT_END" ends
that node definition. Child nodes are simply defined before
"OF_DT_END" (that is nodes within the node). A 'token' is a 32 bits value.

Here's the basic structure of a single node:

     * token OF_DT_BEGIN_NODE (that is 0x00000001)
     * node full path as a zero terminated string
     * [align gap to next 4 bytes boundary]
     * for each property:
        * token OF_DT_PROP (that is 0x00000003)
        * 32 bits value of property value size in bytes (or 0 of no value)
        * 32 bits value of offset in string block of property name
        * [align gap to either next 4 bytes boundary if the property value
	  size is less or equal to 4 bytes, or to next 8 bytes
          boundary if the property value size is larger than 4 bytes]
        * property value data if any
        * [align gap to next 4 bytes boundary]
     * [child nodes if any]
     * token OF_DT_END (that is 0x00000002)

So the node content can be summmarised as a start token, a full path, a list of
properties, a list of child node and an end token. Every child node is
a full node structure itself as defined above

4) Device tree 'strings" block

In order to save space, property names, which are generally redundant,
are stored separately in the "strings" block. This block is simply the
whole bunch of zero terminated strings for all property names
concatenated together. The device-tree property definitions in the
structure block will contain offset values from the beginning of the
strings block.


III - Required content of the device tree
=========================================


  < to be written >



IV - Recommendation for a bootloader
====================================


Here are some various ideas/recommendations that have been proposed
while all this has been defined and implemented.


  - It should be possible to write a parser that turns an ASCII
    representation of a device-tree (or even XML though I find that
    less readable) into a device-tree block. This would allow to
    basically build the device-tree structure and strings "blobs" at
    bootloader build time, and have the bootloader just pass-them
    as-is to the kernel. In fact, the device-tree blob could be then
    separate from the bootloader itself, an be placed in a separate
    portion of the flash that can be "personalized" for different
    board types by flashing a different device-tree

  - A very The bootloader may want to be able to use the device-tree
    itself and may want to manipulate it (to add/edit some properties,
    like physical memory size or kernel arguments). At this point, 2
    choices can be made. Either the bootloader works directly on the
    flattened format, or the bootloader has it's own internal tree
    representation with pointers (similar to the kernel one) and
    re-flattens the tree when booting the kernel. The former is a bit
    more difficult to edit/modify, the later requires probably a bit
    more code to handle the tree structure. Note that the structure
    format has been designed so it's relatively easy to "insert"
    properties or nodes or delete them by just memmovin'g things
    around. It contains no internal offsets or pointers for this purpose.

  - An example of code for iterating nodes & retreiving properties
    directly from the flattened tree format can be found in the kernel
    file arch/ppc64/kernel/prom.c, look at scan_flat_dt() function,
    it's usage in early_init_devtree(), and the corresponding various
    early_init_dt_scan_*() callbacks. That code can be re-used in a
    GPL device-tree, and as the author of that code, I would be happy
    do discuss possible free licencing to any vendor who wishes to
    integrate all or part of this code into a non-GPL bootloader.






More information about the Linuxppc64-dev mailing list