Fwd: Options for kdump kernel location

Thu Aug 18 16:58:17 EST 2005

Some of you have already seen this, but in case anyone else has any thoughts
here it is.

I'll post a follow up with a summary of peoples comments so far.

------------------

For kdump, we have two options WRT where we run the second kernel.
Either we swap the old and new kernels, and the second kernel runs at address
zero, or we run the second kernel somewhere else. More below.

Let me know what you think.

Terminology:

K1: The first kernel that's booted - the original kernel. The kernel that's
    not kexec'ed.

K2: The second kernel is the kernel that the first kernel boots, aka. the
    capture kernel, crash kernel, kexec kernel.

Capture Kernel at Zero
----------------------

1. K1 boots.
2. K1 reserves memory (the "reserved region") for K2. (via "crashkernel=x at y")
2. User/script runs kexec and loads K2 into reserved region.

  Layout at this point:
                         /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K1 Image | K1 memory | K2 Image |                          | K1 memory  |
  ---------------------------------------------------------------------------
      ^
     NIP

3. K1 panics, calls machine_kexec() which:
4. Allocates a temporary stack in the reserved region.
5. Copies some shim code into the reserved region.
6. Enters real mode. (or after step 7?)
7. Switches to new stack, jumps to shim.

  Layout at this point:
                         /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K1 Image | K1 memory | K2 Image | Stack | Shim |           | K1 memory  |
  ---------------------------------------------------------------------------
                                               ^
                                              NIP

8. The shim exchanges the two kernels, stops as soon as K2 is completely
   copied to zero.

  Layout at this point:
                         /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K2 Image | K1 memory | K1 Image | Stack | Shim |           | K1 memory  |
  ---------------------------------------------------------------------------
                                               ^
                                              NIP

9.  The shim finishes, we jump to zero and start running K2.
10. K1 and its memory are reserved from the POV of K2. The memory we used for
    the temp stack and shim are used as K2's memory.

  Layout at this point:
             /------ reserved ------\                          /- reserved -\
  ---------------------------------------------------------------------------
  | K2 Image | K1 memory | K1 Image | K2 memory                | K1 memory  |
  ---------------------------------------------------------------------------
    ^
   NIP

PROBLEMS:
 - If "K2 Image" is larger than "K1 Image" we'll overwrite some of K1's
 memory with the K2 image, this could be bad if we're DMA'ing to that memory.
 - We could fix that by always making the reserved region start at klimit,
 eg:

             /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K1 Image | K2 Image |                          | K1 memory              |
  ---------------------------------------------------------------------------

  But that eats up low memory for K1, do we care? (RTAS does)

 - Come to think of it, do we ever DMA to static data? (ie. in the kernel
image)
   That would really screw us up.
 - We need to run the shim in real mode, otherwise it'll need page tables,
   fault handlers etc. (right ??)
 - And that forces the reserved region to be in the RMO == 256 MB (??)
 - We might be saved from DMA troubles if we clear the TCE tables before
booting
   K2, or not - perhaps the DMA continues regardless of the TCE mapping going
   away.
 - Other stuff?

Capture Kernel at non-Zero
--------------------------

1. K1 boots.
2. K1 reserves memory (the "reserved region") for K2. (via "crashkernel=x at y")
2. User/script runs kexec and loads K2 into reserved region.
   NB. K2 must be linked to run at a non-zero address, except for some/all
   of head.S (????) A PIC kernel might help, but might be impossible (??)

  Layout at this point:
                         /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K1 Image | K1 memory | K2 Image |                          | K1 memory  |
  ---------------------------------------------------------------------------
      ^
     NIP

3. K1 panics, calls machine_kexec() which:
4. Allocates a temporary stack in the reserved region.
5. Copies some shim code into the reserved region.
6. Enters real mode. (??)
7. Switches to new stack, jumps to shim.

  Layout at this point:
                         /------------- reserved --------------\
  ---------------------------------------------------------------------------
  | K1 Image | K1 memory | K2 Image | Stack | Shim |           | K1 memory  |
  ---------------------------------------------------------------------------
                                               ^
                                              NIP

8. The shim swaps the low few (? ~10) pages of K2 with the same few pages
 from K1 (this is essentially head.S, ie. exception vectors etc.)
9. These pages are modified (somehow) so that they jump to the right places
 in the K2 image (or can we do this at link time?)

  Layout at this point:
                              /----------- reserved -------------\
  ---------------------------------------------------------------------------
  |K2 .. | .. end K1 | K1 mem | .. end K2 | Stack | Shim | K1 .. | K1 memory|
  ---------------------------------------------------------------------------
    |                              ^                ^
    \------------------------------|               NIP
       points into here

9.  The shim finishes, we jump to zero and start running K2.
10. K1 and its memory are reserved from the POV of K2. The memory we used for
    the temp stack and shim are used as K2's memory.

  Layout at this point:

         /--- reserved -------\                         /----- reserved ----\
  ---------------------------------------------------------------------------
  |K2 .. | .. end K1 | K1 mem | K2 Image |              | K1 .. | K1 memory |
  ---------------------------------------------------------------------------
   ^  |                              ^
  NIP \------------------------------|
       points into here

PROBLEMS:
 - We need to run the shim in real mode, otherwise it'll need page tables,
   fault handlers etc. (right ??)
 - And that forces the reserved region to be in the RMO == 256 MB (??)
 - Need to audit KERNELBASE/PAGE_OFFSET usage, link kernel at different
address.
 - Have to have a specially built K2 (ie. linked at !0)
 - Can we even build PIC?
 - Might hit other gotchas, ie. code that assumes start == 0.
 - Not sure how we make the exception handlers jump into K2 correctly.

--
Michael Ellerman
IBM OzLabs

email: michael:ellerman.id.au
inmsg: mpe:jabber.org
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050818/7e5882c1/attachment.pgp