[Skiboot] [PATCH 4/4] IPMI: Introduce attention call
Vasant Hegde
hegdevasant at linux.vnet.ibm.com
Fri Sep 4 03:17:24 AEST 2015
On 08/31/2015 10:33 AM, Stewart Smith wrote:
> Patrick Williams <patrick at stwcx.xyz> writes:
>> On Thu, Aug 20, 2015 at 12:33:34PM +0530, Vasant Hegde wrote:
>>> - This patch calls cec_reboot to reboot machine after logging eSEL event.
>>> It queues IPMI message and bt_poll() should be working until we pass
>>> reboot IPMI message to BMC. Hence we have while loop with time_wait_ms().
>>> Alternatively we can use xscom_trigger_xstop().. but it will stop
>>> immediately and eSEL logging fails.
>>
>> I understand the desire for more data in many cases of abort but it
>> seems very dangerous to me to always do the IPMI path. There are some
>> abort calls that are obvious more critical than others and should be
>> interpreted as "shutdown as quickly as possible because I'm in a real bad
>> state."
>
> I agree.
>
> The current FSP code initially calls op_display() which ends up using
> preallocated (well, static) fsp_msg structs to do a sync call to set the
> operator panel *before* even printing a backtrace to the console.
>
> Precisely, it's:
> op_display(OP_FATAL, OP_MOD_CORE, 0x6666);
> prlog(PR_EMERG, "Aborting!\n");
> backtrace();
>
> If you delve into what fsp_sync_msg() does (that op_display() calls), we
> end up doing:
>
> while(fsp_msg_busy(msg)) {
> cpu_relax();
> opal_run_pollers();
> }
>
> which is likely going to end up with two things:
> 1) a warning of "pollers called with lock held" if you managed to crash
> while holding a lock (quite likely, a lot of interesting things
> happen with locks held).
> 2) something being sent to the FSP and hopefully returning back okay.
> (or wec ould crash somewhere because there's a problem with
> talking to the FSP, but I don't think I've ever seen this)
>
> But this is before we've even printed "Aborting" to the in memory log
> buffer, which is the most likely thing to succeed.
>
> The only risk in sending the FSP message is if writing to the registers
> is going to crash us harder or if we have some really serious memory
> corruption going on in exactly the wrong place. I don't think we've ever
> seen this be a problem though... or at least things fail gracefully
> enough that it's not really a problem.
>
> I think we could do an async message for op_display() here rather than a
> sync one though, I think we'd be less likely to get into trouble (and
> sync messages are evil anyway)
Currently we call fsp_sync_msg from op_display for PANIC event.. Rest goes
through async path (You know it better anyway :-) ).
We can switch this to async assuming MBOX message reaches FSP before trigger
attn call...which will initiate SYSDUMP.. Anyway I don't think we should care so
much about whether message reached FSP or not (OP panel guys may care though)..
>
> As for sending a message out to the BMC... at least at this point, it
> seems that the bt interface is a bit more flakey than the fsp one (very
> unscientific opinion) and perhaps it is something we want to do at the
> end, after dumping out as much as possible to a memory buffer. If we're
> going to have one path and not also have a crit_assert() path, then
> *this* path for sending IPMI message really *must not* do memory
> allocation.
Agreed.. I was thinking of touching IPMI allocation along with my RFC patch for
synchronous error logging for BMC based machine.. I'll include this in next version.
>
> I think we may just have to audit the crash code paths to not do any
> allocation and fail fairly gracefully, certainly doing memory allocation
> in abort() is a really bad idea and unlikely to end well.
Correct.. AFAIK only IPMI path allocates memory.. Anyway I will audit entire
path again.
-Vasant
More information about the Skiboot
mailing list