[Skiboot] [PATCH v8 00/24] MPIPL support
hegdevasant at linux.vnet.ibm.com
Fri Jun 28 19:37:59 AEST 2019
On 06/28/2019 06:17 AM, Nicholas Piggin wrote:
> Vasant Hegde's on June 17, 2019 3:10 am:
>> Memory Preserving Initial Program Load (MPIPL) is a Power feature where
>> the contents of memory are preserved while the system reboots after a
>> failure. This is accomplished by the firmware/OS publishing ranges of
>> memory to be preserved across boots.
>> In the OPAL context, OPAL and host Linux communicate the memory ranges
>> to be preserved via source descriptor tables in the HDAT. OPAL and Linux
>> can update these tables during runtime. OPAL sends relocated OPAL base
>> address to SBE. When OPAL or Linux crashes, SBE gets to know of the
>> event via a special interrupt which causes it ot trigger the MPIPL.
>> SBE then collects archicted register data and loads Hostboot. Hostboot
>> then re-IPLs the machine taking care to copy over contents of the source
>> descriptor tables to a alternate memory locations and publishes this
>> information in the destination descriptor tables. The success/failure
>> of the copy is indicated by a results table. Hostboot also copies
>> architected register states to OPAL passed memory.
>> On an MPIPL boot, OPAL creates new device tree propety to indicate its
>> MPIPL boot (/ibm,opal/dump/mpipl-boot). Linux makes MPIPL API call to
>> get metadata pointers. Kernel uses metadata information to create
>> vmcore and opalcore.
>> - Hostboot relies on MDST, MDDT, MDRT ntuple in HDAT for MPIPL.
>> - During boot/runtime, OPAL will update MDST and MDDT table.
>> - Kernel will create metadata area which contain source, destination
>> address, size etc.
>> - Kernel will use MPIPL API for registration
>> - It will pass src, dest, size to OPAL
>> - Pass metadata tag to OPAL. OPAL will preserve this tag across
>> MPIPL and pass it back to kernel during MPIPL boot.
>> - Kernel -OR- OPAL will request for MPIPL.
>> - On FSP system OPAL will trigger attn intruction
>> - On BMC system OPAL will trigger SBE S0 interrupt
>> - SBE quiesce the system and collect CPU register state of running
>> - SBE -> hostboot -> memory preserved + CPU data copied to OPAL reserved
>> memory -> load OPAL
>> - OPAL validates DUMP result table and adds `mpipl-boot` device tree property
>> - Kernel detects its MPIPL boot.
>> - Kernel will use MPIPL query tag API to retrieve metadata tags.
>> - Kenel will create `vmcore` and `opalcore`
>> - Use existing crash tool to debug `vmcore` and gdb to debug `opalcore`
>> - We need Linux kernel changes to generate opal core.
>> Hari will post Linux side patches.
>> Impact on kernel:
>> Upstream kernel has `fadump` (Firmware Assisted Dump) feature on PowerVM
>> LPAR. This works on top of kdump and uses same vmcore format. From kernel
>> point of view, this is extending fadump feature for OPAL based system.
>> User space:
>> We are reusing existing kernel/user space infrastructure. Hence this
>> feature is transparent to end user. User can use existing crash tool
>> to debug `vmcore` and gdb to debug `opalcore`.
>> CPU register data collection:
>> Before initiating crash, kernel will save running thread register
>> content and initiates crash. Then control goes to SBE. SBE will quiesce
>> the system and collect CPU register content for all applicable threads.
>> Kernel will use these data to create vmcore.
>> We had offline discussion with Nick. On of his suggestion was to use
>> kernel SRESET IPI to collect secondary CPU register data. Technically
>> it is possible to use SRESET, but that is still not completely
>> water-right. We can switch to that down the line when SRESET works
>> reliably and we find a way to collect secondary CPU data for OPAL
> I would prefer a Linux initiated crash dump to follow the normal Linux
> crash process which is SRESET. I think it's much better for Linux to
> manage its own registers and the SRESET facility has to be #1 priority
> to support reasonable crash handling (e.g., xmon, kdump, etc).
I think SRESET is separate topic. I don't want to mix that here.
> I would have completely nacked the SBE register collection as
> unnecessary and over complication of interfaces except that it could
> be used for a BMC initiated dump with Linux completely out of the
> picture. That seems like an interesting feature, although I would have
> preferred to make the Linux/sreset approach work first I won't quibble
> about it.
Yeah. Once Linux approach of collecting register becomes as reliable as firmware
then we can consider switching to that. Its just a matter of changing some
>> Hostboot and SBE side of changes is merged and available in upstream
>> op-build. We can use upstream op-build to build PNOR with MPIPL
>> - Capture OPAL crashing CPU information
>> Current patchset relies on SBE to capture OPAL crashing CPU
>> information. We may miss some of the important register
>> information. In future we will enhance OPAL to collect crashing
>> CPU details.
> What do you mean by this? I thought you didn't have the OPAL crash
> part of the series in here?
As we agreed I have dropped early OPAL crash. We still have late OPAL crash support.
In OPAL, today we do not have a facility to capture crashing CPU information. So we
use SBE provided register state data for crashing CPU as well.
More information about the Skiboot