Fwd: Options for kdump kernel location

Thu Aug 18 17:02:53 EST 2005

Benh pointed out that the kernel actually looks like this:

  --------------------------------------------------------------
  |  text segment |  __init data & code |  data (data/got/bss) |  
  --------------------------------------------------------------

And so once the kernel is booted, ie. all __init stuff is freed, we have this:

  --------------------------------------------------------------
  |  text segment |     free memory     |  data (data/got/bss) |  
  --------------------------------------------------------------

And that free memory may be used for DMA. This basically makes option one a 
non-option, as we'd quite likely have ongoing DMAs into the second kernel's 
text and/or data.

Paulus pointed out that we could just not free the __init stuff, which would 
cost as a few 100 K but that's not much these days. That would make option 
one a possibility again.

Haren and Vivek both pointed out that distros may be reluctant to implement 
option two, because it requires a specially built capture kernel. A PIC 
kernel would help there, but only if we always build PIC.

Vivek pointed out that there is existing code for i386, "purgatory", which is 
equivalent to my shim. Although purgatory is part of kexec-tools, not the 
kernel which seems odd to me.

Vivek mentioned that x86_64 uses dedicated page tables so the copying step can 
be done in virtual mode. Not sure if we can do that on PPC?

Manessh point out that the i386 kdump used to behave similarly to option one, 
but that it was redesigned more or less like option two. Perhaps we should 
learn from their mistakes? :D

---

Generally it's looking like option 2 is safer, assuming we can get the kernel 
booting at !0.

I've got patches which go some way to doing that, at the moment my kernel 
linked at 64MB explodes as soon as we try to lmb_alloc() - not sure why yet.

cheers

On Thu, 18 Aug 2005 16:58, Michael Ellerman wrote:
> Some of you have already seen this, but in case anyone else has any
> thoughts here it is.
>
> I'll post a follow up with a summary of peoples comments so far.
>
> ------------------
>
> For kdump, we have two options WRT where we run the second kernel.
> Either we swap the old and new kernels, and the second kernel runs at
> address zero, or we run the second kernel somewhere else. More below.
>
> Let me know what you think.
>
> Terminology:
>
> K1: The first kernel that's booted - the original kernel. The kernel that's
>     not kexec'ed.
>
> K2: The second kernel is the kernel that the first kernel boots, aka. the
>     capture kernel, crash kernel, kexec kernel.
>
> Capture Kernel at Zero
> ----------------------
>
> 1. K1 boots.
> 2. K1 reserves memory (the "reserved region") for K2. (via
> "crashkernel=x at y") 2. User/script runs kexec and loads K2 into reserved
> region.
>
>   Layout at this point:
>                          /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K1 Image | K1 memory | K2 Image |                          | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>      NIP
>
> 3. K1 panics, calls machine_kexec() which:
> 4. Allocates a temporary stack in the reserved region.
> 5. Copies some shim code into the reserved region.
> 6. Enters real mode. (or after step 7?)
> 7. Switches to new stack, jumps to shim.
>
>   Layout at this point:
>                          /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K1 Image | K1 memory | K2 Image | Stack | Shim |           | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>                                               NIP
>
> 8. The shim exchanges the two kernels, stops as soon as K2 is completely
>    copied to zero.
>
>   Layout at this point:
>                          /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K2 Image | K1 memory | K1 Image | Stack | Shim |           | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>                                               NIP
>
> 9.  The shim finishes, we jump to zero and start running K2.
> 10. K1 and its memory are reserved from the POV of K2. The memory we used
> for the temp stack and shim are used as K2's memory.
>
>   Layout at this point:
>              /------ reserved ------\                          /- reserved
> -\
> ---------------------------------------------------------------------------
>
>   | K2 Image | K1 memory | K1 Image | K2 memory                | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>    NIP
>
> PROBLEMS:
>  - If "K2 Image" is larger than "K1 Image" we'll overwrite some of K1's
>  memory with the K2 image, this could be bad if we're DMA'ing to that
> memory. - We could fix that by always making the reserved region start at
> klimit, eg:
>
>              /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K1 Image | K2 Image |                          | K1 memory             
>   | |
>
>  
> ---------------------------------------------------------------------------
>
>   But that eats up low memory for K1, do we care? (RTAS does)
>
>  - Come to think of it, do we ever DMA to static data? (ie. in the kernel
> image)
>    That would really screw us up.
>  - We need to run the shim in real mode, otherwise it'll need page tables,
>    fault handlers etc. (right ??)
>  - And that forces the reserved region to be in the RMO == 256 MB (??)
>  - We might be saved from DMA troubles if we clear the TCE tables before
> booting
>    K2, or not - perhaps the DMA continues regardless of the TCE mapping
> going away.
>  - Other stuff?
>
>
> Capture Kernel at non-Zero
> --------------------------
>
> 1. K1 boots.
> 2. K1 reserves memory (the "reserved region") for K2. (via
> "crashkernel=x at y") 2. User/script runs kexec and loads K2 into reserved
> region.
>    NB. K2 must be linked to run at a non-zero address, except for some/all
>    of head.S (????) A PIC kernel might help, but might be impossible (??)
>
>   Layout at this point:
>                          /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K1 Image | K1 memory | K2 Image |                          | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>      NIP
>
> 3. K1 panics, calls machine_kexec() which:
> 4. Allocates a temporary stack in the reserved region.
> 5. Copies some shim code into the reserved region.
> 6. Enters real mode. (??)
> 7. Switches to new stack, jumps to shim.
>
>   Layout at this point:
>                          /------------- reserved --------------\
>  
> ---------------------------------------------------------------------------
>
>   | K1 Image | K1 memory | K2 Image | Stack | Shim |           | K1 memory 
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^
>                                               NIP
>
> 8. The shim swaps the low few (? ~10) pages of K2 with the same few pages
>  from K1 (this is essentially head.S, ie. exception vectors etc.)
> 9. These pages are modified (somehow) so that they jump to the right places
>  in the K2 image (or can we do this at link time?)
>
>   Layout at this point:
>                               /----------- reserved -------------\
>  
> ---------------------------------------------------------------------------
>
>   |K2 .. | .. end K1 | K1 mem | .. end K2 | Stack | Shim | K1 .. | K1
>   | memory|
>
>  
> ---------------------------------------------------------------------------
>
>     |                              ^                ^
>
>     \------------------------------|               NIP
>        points into here
>
> 9.  The shim finishes, we jump to zero and start running K2.
> 10. K1 and its memory are reserved from the POV of K2. The memory we used
> for the temp stack and shim are used as K2's memory.
>
>   Layout at this point:
>
>          /--- reserved -------\                         /----- reserved
> ----\
> ---------------------------------------------------------------------------
>
>   |K2 .. | .. end K1 | K1 mem | K2 Image |              | K1 .. | K1 memory
>   | |
>
>  
> ---------------------------------------------------------------------------
> ^  |                              ^
>   NIP \------------------------------|
>        points into here
>
> PROBLEMS:
>  - We need to run the shim in real mode, otherwise it'll need page tables,
>    fault handlers etc. (right ??)
>  - And that forces the reserved region to be in the RMO == 256 MB (??)
>  - Need to audit KERNELBASE/PAGE_OFFSET usage, link kernel at different
> address.
>  - Have to have a specially built K2 (ie. linked at !0)
>  - Can we even build PIC?
>  - Might hit other gotchas, ie. code that assumes start == 0.
>  - Not sure how we make the exception handlers jump into K2 correctly.
>
> --
> Michael Ellerman
> IBM OzLabs
>
> email: michael:ellerman.id.au
> inmsg: mpe:jabber.org
> wwweb: http://michael.ellerman.id.au
> phone: +61 2 6212 1183 (tie line 70 21183)
>
> We do not inherit the earth from our ancestors,
> we borrow it from our children. - S.M.A.R.T Person

-- 
Michael Ellerman
IBM OzLabs

email: michael:ellerman.id.au
inmsg: mpe:jabber.org
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050818/8698f895/attachment.pgp