[Skiboot] [PATCH v8 00/24] MPIPL support

Fri Jun 28 19:37:59 AEST 2019

On 06/28/2019 06:17 AM, Nicholas Piggin wrote:
> Vasant Hegde's on June 17, 2019 3:10 am:
>> Memory Preserving Initial Program Load (MPIPL) is a Power feature where
>> the contents of memory are preserved while the system reboots after a
>> failure. This is accomplished by the firmware/OS publishing ranges of
>> memory to be preserved across boots.
>>
>> In the OPAL context, OPAL and host Linux communicate the memory ranges
>> to be preserved via source descriptor tables in the HDAT. OPAL and Linux
>> can update these tables during runtime. OPAL sends relocated OPAL base
>> address to SBE. When OPAL or Linux crashes, SBE gets to know of the
>> event via a special interrupt which causes it ot trigger the MPIPL.
>>
>> SBE then collects archicted register data and loads Hostboot. Hostboot
>> then re-IPLs the machine taking care to copy over contents of the source
>> descriptor tables to a alternate memory locations and publishes this
>> information in the destination descriptor tables. The success/failure
>> of the copy is indicated by a results table. Hostboot also copies
>> architected register states to OPAL passed memory.
>>
>> On an MPIPL boot, OPAL creates new device tree propety to indicate its
>> MPIPL boot (/ibm,opal/dump/mpipl-boot). Linux makes MPIPL API call to
>> get metadata pointers. Kernel uses metadata information to create
>> vmcore and opalcore.
>>
>> Flow:
>>    - Hostboot relies on MDST, MDDT, MDRT ntuple in HDAT for MPIPL.
>>    - During boot/runtime, OPAL will update MDST and MDDT table.
>>    - Kernel will create metadata area which contain source, destination
>>      address, size etc.
>>    - Kernel will use MPIPL API for registration
>>      - It will pass src, dest, size to OPAL
>>      - Pass metadata tag to OPAL. OPAL will preserve this tag across
>>        MPIPL and pass it back to kernel during MPIPL boot.
>>    - Kernel -OR- OPAL will request for MPIPL.
>>       - On FSP system OPAL will trigger attn intruction
>>       - On BMC system OPAL will trigger SBE S0 interrupt
>>    - SBE quiesce the system and collect CPU register state of running
>>      threads.
>>    - SBE -> hostboot -> memory preserved + CPU data copied to OPAL reserved
>>      memory -> load OPAL
>>    - OPAL validates DUMP result table and adds `mpipl-boot` device tree property
>>    - Kernel detects its MPIPL boot.
>>    - Kernel will use MPIPL query tag API to retrieve metadata tags.
>>    - Kenel will create `vmcore` and `opalcore`
>>    - Use existing crash tool to debug `vmcore` and gdb to debug `opalcore`
>>
>> Dependency:
>>    - We need Linux kernel changes to generate opal core.
>>      Hari will post Linux side patches.
>>
>> Impact on kernel:
>>    Upstream kernel has `fadump` (Firmware Assisted Dump) feature on PowerVM
>>    LPAR. This works on top of kdump and uses same vmcore format. From kernel
>>    point of view, this is extending fadump feature for OPAL based system.
>> User space:
>>    We are reusing existing kernel/user space infrastructure. Hence this
>>    feature is transparent to end user. User can use existing crash tool
>>    to debug `vmcore` and gdb to debug `opalcore`.
>>
>> CPU register data collection:
>>    Before initiating crash, kernel will save running thread register
>>    content and initiates crash. Then control goes to SBE. SBE will quiesce
>>    the system and collect CPU register content for all applicable threads.
>>    Kernel will use these data to create vmcore.
>>
>>    We had offline discussion with Nick. On of his suggestion was to use
>>    kernel SRESET IPI to collect secondary CPU register data. Technically
>>    it is possible to use SRESET, but that is still not completely
>>    water-right. We can switch to that down the line when SRESET works
>>    reliably and we find a way to collect secondary CPU data for OPAL
>>    dump.
> 
> I would prefer a Linux initiated crash dump to follow the normal Linux
> crash process which is SRESET. I think it's much better for Linux to
> manage its own registers and the SRESET facility has to be #1 priority
> to support reasonable crash handling (e.g., xmon, kdump, etc).

I think SRESET is separate topic. I don't want to mix that here.

> 
> I would have completely nacked the SBE register collection as
> unnecessary and over complication of interfaces except that it could
> be used for a BMC initiated dump with Linux completely out of the
> picture. That seems like an interesting feature, although I would have
> preferred to make the Linux/sreset approach work first I won't quibble
> about it.

Yeah. Once Linux approach of collecting register becomes as reliable as firmware 
one,
then we can consider switching to that. Its just a matter of changing some 
kernel code.

> 
>>
>> Testing:
>>    Hostboot and SBE side of changes is merged and available in upstream
>>    op-build. We can use upstream op-build to build PNOR with MPIPL
>>    support.
>>
>> TODO:
>>    - Capture OPAL crashing CPU information
>>      Current patchset relies on SBE to capture OPAL crashing CPU
>>      information. We may miss some of the important register
>>      information. In future we will enhance OPAL to collect crashing
>>      CPU details.
> 
> What do you mean by this? I thought you didn't have the OPAL crash
> part of the series in here?

As we agreed I have dropped early OPAL crash. We still have late OPAL crash support.
In OPAL, today we do not have a facility to capture crashing CPU information. So we
use SBE provided register state data for crashing CPU as well.

-Vasant