Fwd: Options for kdump kernel location
Michael Ellerman
michaele at au.ibm.com
Thu Aug 18 16:58:17 EST 2005
Some of you have already seen this, but in case anyone else has any thoughts
here it is.
I'll post a follow up with a summary of peoples comments so far.
------------------
For kdump, we have two options WRT where we run the second kernel.
Either we swap the old and new kernels, and the second kernel runs at address
zero, or we run the second kernel somewhere else. More below.
Let me know what you think.
Terminology:
K1: The first kernel that's booted - the original kernel. The kernel that's
not kexec'ed.
K2: The second kernel is the kernel that the first kernel boots, aka. the
capture kernel, crash kernel, kexec kernel.
Capture Kernel at Zero
----------------------
1. K1 boots.
2. K1 reserves memory (the "reserved region") for K2. (via "crashkernel=x at y")
2. User/script runs kexec and loads K2 into reserved region.
Layout at this point:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K1 Image | K1 memory | K2 Image | | K1 memory |
---------------------------------------------------------------------------
^
NIP
3. K1 panics, calls machine_kexec() which:
4. Allocates a temporary stack in the reserved region.
5. Copies some shim code into the reserved region.
6. Enters real mode. (or after step 7?)
7. Switches to new stack, jumps to shim.
Layout at this point:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K1 Image | K1 memory | K2 Image | Stack | Shim | | K1 memory |
---------------------------------------------------------------------------
^
NIP
8. The shim exchanges the two kernels, stops as soon as K2 is completely
copied to zero.
Layout at this point:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K2 Image | K1 memory | K1 Image | Stack | Shim | | K1 memory |
---------------------------------------------------------------------------
^
NIP
9. The shim finishes, we jump to zero and start running K2.
10. K1 and its memory are reserved from the POV of K2. The memory we used for
the temp stack and shim are used as K2's memory.
Layout at this point:
/------ reserved ------\ /- reserved -\
---------------------------------------------------------------------------
| K2 Image | K1 memory | K1 Image | K2 memory | K1 memory |
---------------------------------------------------------------------------
^
NIP
PROBLEMS:
- If "K2 Image" is larger than "K1 Image" we'll overwrite some of K1's
memory with the K2 image, this could be bad if we're DMA'ing to that memory.
- We could fix that by always making the reserved region start at klimit,
eg:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K1 Image | K2 Image | | K1 memory |
---------------------------------------------------------------------------
But that eats up low memory for K1, do we care? (RTAS does)
- Come to think of it, do we ever DMA to static data? (ie. in the kernel
image)
That would really screw us up.
- We need to run the shim in real mode, otherwise it'll need page tables,
fault handlers etc. (right ??)
- And that forces the reserved region to be in the RMO == 256 MB (??)
- We might be saved from DMA troubles if we clear the TCE tables before
booting
K2, or not - perhaps the DMA continues regardless of the TCE mapping going
away.
- Other stuff?
Capture Kernel at non-Zero
--------------------------
1. K1 boots.
2. K1 reserves memory (the "reserved region") for K2. (via "crashkernel=x at y")
2. User/script runs kexec and loads K2 into reserved region.
NB. K2 must be linked to run at a non-zero address, except for some/all
of head.S (????) A PIC kernel might help, but might be impossible (??)
Layout at this point:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K1 Image | K1 memory | K2 Image | | K1 memory |
---------------------------------------------------------------------------
^
NIP
3. K1 panics, calls machine_kexec() which:
4. Allocates a temporary stack in the reserved region.
5. Copies some shim code into the reserved region.
6. Enters real mode. (??)
7. Switches to new stack, jumps to shim.
Layout at this point:
/------------- reserved --------------\
---------------------------------------------------------------------------
| K1 Image | K1 memory | K2 Image | Stack | Shim | | K1 memory |
---------------------------------------------------------------------------
^
NIP
8. The shim swaps the low few (? ~10) pages of K2 with the same few pages
from K1 (this is essentially head.S, ie. exception vectors etc.)
9. These pages are modified (somehow) so that they jump to the right places
in the K2 image (or can we do this at link time?)
Layout at this point:
/----------- reserved -------------\
---------------------------------------------------------------------------
|K2 .. | .. end K1 | K1 mem | .. end K2 | Stack | Shim | K1 .. | K1 memory|
---------------------------------------------------------------------------
| ^ ^
\------------------------------| NIP
points into here
9. The shim finishes, we jump to zero and start running K2.
10. K1 and its memory are reserved from the POV of K2. The memory we used for
the temp stack and shim are used as K2's memory.
Layout at this point:
/--- reserved -------\ /----- reserved ----\
---------------------------------------------------------------------------
|K2 .. | .. end K1 | K1 mem | K2 Image | | K1 .. | K1 memory |
---------------------------------------------------------------------------
^ | ^
NIP \------------------------------|
points into here
PROBLEMS:
- We need to run the shim in real mode, otherwise it'll need page tables,
fault handlers etc. (right ??)
- And that forces the reserved region to be in the RMO == 256 MB (??)
- Need to audit KERNELBASE/PAGE_OFFSET usage, link kernel at different
address.
- Have to have a specially built K2 (ie. linked at !0)
- Can we even build PIC?
- Might hit other gotchas, ie. code that assumes start == 0.
- Not sure how we make the exception handlers jump into K2 correctly.
--
Michael Ellerman
IBM OzLabs
email: michael:ellerman.id.au
inmsg: mpe:jabber.org
wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)
We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050818/7e5882c1/attachment.pgp
More information about the Linuxppc64-dev
mailing list