[Skiboot] [PATCH 4/4] IPMI: Introduce attention call

Fri Sep 4 03:17:24 AEST 2015

On 08/31/2015 10:33 AM, Stewart Smith wrote:
> Patrick Williams <patrick at stwcx.xyz> writes:
>> On Thu, Aug 20, 2015 at 12:33:34PM +0530, Vasant Hegde wrote:
>>>   - This patch calls cec_reboot to reboot machine after logging eSEL event.
>>>     It queues IPMI message and bt_poll() should be working until we pass
>>>     reboot IPMI message to BMC. Hence we have while loop with time_wait_ms().
>>>     Alternatively we can use xscom_trigger_xstop().. but it will stop
>>>     immediately and eSEL logging fails.
>>
>> I understand the desire for more data in many cases of abort but it
>> seems very dangerous to me to always do the IPMI path.  There are some
>> abort calls that are obvious more critical than others and should be
>> interpreted as "shutdown as quickly as possible because I'm in a real bad
>> state."
> 
> I agree.
> 
> The current FSP code initially calls op_display() which ends up using
> preallocated (well, static) fsp_msg structs to do a sync call to set the
> operator panel *before* even printing a backtrace to the console.
> 
> Precisely, it's:
> 	op_display(OP_FATAL, OP_MOD_CORE, 0x6666);
>  	prlog(PR_EMERG, "Aborting!\n");
>  	backtrace();
> 
> If you delve into what fsp_sync_msg() does (that op_display() calls), we
> end up doing:
> 
>         while(fsp_msg_busy(msg)) {
>                 cpu_relax();
>                 opal_run_pollers();
>         }
> 
> which is likely going to end up with two things:
> 1) a warning of "pollers called with lock held" if you managed to crash
>    while holding a lock (quite likely, a lot of interesting things
>    happen with locks held).
> 2) something being sent to the FSP and hopefully returning back okay.
>    (or wec ould crash somewhere because there's a problem with
>    talking to the FSP, but I don't think I've ever seen this)
> 
> But this is before we've even printed "Aborting" to the in memory log
> buffer, which is the most likely thing to succeed.
> 
> The only risk in sending the FSP message is if writing to the registers
> is going to crash us harder or if we have some really serious memory
> corruption going on in exactly the wrong place. I don't think we've ever
> seen this be a problem though... or at least things fail gracefully
> enough that it's not really a problem.
> 
> I think we could do an async message for op_display() here rather than a
> sync one though, I think we'd be less likely to get into trouble (and
> sync messages are evil anyway)

Currently we call fsp_sync_msg from op_display for PANIC event.. Rest goes
through async path (You know it better anyway :-) ).

We can switch this to async assuming MBOX message reaches FSP before trigger
attn call...which will initiate SYSDUMP.. Anyway I don't think we should care so
much about whether message reached FSP or not (OP panel guys may care though)..

> 
> As for sending a message out to the BMC... at least at this point, it
> seems that the bt interface is a bit more flakey than the fsp one (very
> unscientific opinion) and perhaps it is something we want to do at the
> end, after dumping out as much as possible to a memory buffer. If we're
> going to have one path and not also have a crit_assert() path, then
> *this* path for sending IPMI message really *must not* do memory
> allocation.

Agreed.. I was thinking of touching IPMI allocation along with my RFC patch for
synchronous error logging for BMC based machine.. I'll include this in next version.

> 
> I think we may just have to audit the crash code paths to not do any
> allocation and fail fairly gracefully, certainly doing memory allocation
> in abort() is a really bad idea and unlikely to end well.

Correct.. AFAIK only IPMI path allocates memory.. Anyway I will audit entire
path again.

-Vasant