[Skiboot] [PATCH 4/4] IPMI: Introduce attention call

Mon Aug 31 15:03:18 AEST 2015

Patrick Williams <patrick at stwcx.xyz> writes:
> On Thu, Aug 20, 2015 at 12:33:34PM +0530, Vasant Hegde wrote:
>>   - This patch calls cec_reboot to reboot machine after logging eSEL event.
>>     It queues IPMI message and bt_poll() should be working until we pass
>>     reboot IPMI message to BMC. Hence we have while loop with time_wait_ms().
>>     Alternatively we can use xscom_trigger_xstop().. but it will stop
>>     immediately and eSEL logging fails.
>
> I understand the desire for more data in many cases of abort but it
> seems very dangerous to me to always do the IPMI path.  There are some
> abort calls that are obvious more critical than others and should be
> interpreted as "shutdown as quickly as possible because I'm in a real bad
> state."

I agree.

The current FSP code initially calls op_display() which ends up using
preallocated (well, static) fsp_msg structs to do a sync call to set the
operator panel *before* even printing a backtrace to the console.

Precisely, it's:
	op_display(OP_FATAL, OP_MOD_CORE, 0x6666);
 	prlog(PR_EMERG, "Aborting!\n");
 	backtrace();

If you delve into what fsp_sync_msg() does (that op_display() calls), we
end up doing:

        while(fsp_msg_busy(msg)) {
                cpu_relax();
                opal_run_pollers();
        }

which is likely going to end up with two things:
1) a warning of "pollers called with lock held" if you managed to crash
   while holding a lock (quite likely, a lot of interesting things
   happen with locks held).
2) something being sent to the FSP and hopefully returning back okay.
   (or wec ould crash somewhere because there's a problem with
   talking to the FSP, but I don't think I've ever seen this)

But this is before we've even printed "Aborting" to the in memory log
buffer, which is the most likely thing to succeed.

The only risk in sending the FSP message is if writing to the registers
is going to crash us harder or if we have some really serious memory
corruption going on in exactly the wrong place. I don't think we've ever
seen this be a problem though... or at least things fail gracefully
enough that it's not really a problem.

I think we could do an async message for op_display() here rather than a
sync one though, I think we'd be less likely to get into trouble (and
sync messages are evil anyway)

As for sending a message out to the BMC... at least at this point, it
seems that the bt interface is a bit more flakey than the fsp one (very
unscientific opinion) and perhaps it is something we want to do at the
end, after dumping out as much as possible to a memory buffer. If we're
going to have one path and not also have a crit_assert() path, then
*this* path for sending IPMI message really *must not* do memory
allocation.

I think we may just have to audit the crash code paths to not do any
allocation and fail fairly gracefully, certainly doing memory allocation
in abort() is a really bad idea and unlikely to end well.

> Two that come to mind to me are:
>   * core/mem_region.c : bad_header(...)
>        - If we trigger the abort here, it indicates that the heap is
>          corrupted.  If you try to queue up an IPMI message, which
>          involves a malloc, you're going to end up right back in this
>          code.  Thus, you'll keep overflowing the stack until maybe
>          sometime we end up with a CRESP checkstop?
>   * core/exceptions.c : exception_entry(...)
>        - The specific exception that worries me most is the
>          machine-check.  This could be due to a memory or cache UE, so
>          we want to abort as quickly as possible.  Hopefully, with all
>          your extra code we just end up with an MSR[ME]=0 checkstop
>          instead of a slightly more graceful xscom_trigger_xstop().
>
> In Hostboot we have a 'assert' and 'crit_assert'.  Assert is "I'm in a
> bad state, but not so bad that we cannot collect data about it."
> Crit-Assert is "I'm in a really bad state, so just shutdown."
>
> Do you need something similar here in skiboot?  I would suggest
> reviewing all of the 'abort' calls and confirm that they do not lead
> into bad situations if you try to do "extra" stuff like send an IPMI
> message.

I think those would be the two I'd really care about, but I'd prefer to
be really really safe in abort code paths anyway and not assume that
some were too much safer than others.