[Skiboot] [PATCH 4/4] IPMI: Introduce attention call

Fri Sep 4 03:07:29 AEST 2015

On 08/20/2015 05:57 PM, Patrick Williams wrote:
> On Thu, Aug 20, 2015 at 12:33:34PM +0530, Vasant Hegde wrote:
>>   - This patch calls cec_reboot to reboot machine after logging eSEL event.
>>     It queues IPMI message and bt_poll() should be working until we pass
>>     reboot IPMI message to BMC. Hence we have while loop with time_wait_ms().
>>     Alternatively we can use xscom_trigger_xstop().. but it will stop
>>     immediately and eSEL logging fails.
> 

Patrick,

Thanks for the review.

> I understand the desire for more data in many cases of abort but it
> seems very dangerous to me to always do the IPMI path.  There are some
> abort calls that are obvious more critical than others and should be
> interpreted as "shutdown as quickly as possible because I'm in a real bad
> state."
> 
> Two that come to mind to me are:
>   * core/mem_region.c : bad_header(...)
>        - If we trigger the abort here, it indicates that the heap is
>          corrupted.  If you try to queue up an IPMI message, which
>          involves a malloc, you're going to end up right back in this
>          code.  Thus, you'll keep overflowing the stack until maybe
>          sometime we end up with a CRESP checkstop?
>   * core/exceptions.c : exception_entry(...)
>        - The specific exception that worries me most is the
>          machine-check.  This could be due to a memory or cache UE, so
>          we want to abort as quickly as possible.  Hopefully, with all
>          your extra code we just end up with an MSR[ME]=0 checkstop
>          instead of a slightly more graceful xscom_trigger_xstop().

Correct. Current code allocates memory in IPMI SEL path.
I'll fix that path.. so that we allocate all required memory in advance..
Similar to what we do for FSP based platform.

On FSP based machine, FSP is capable of copying data from host memory..
So we just need to setup attn area, copy data (backtrace and other data)
and trigger attn...

But in IPMI case, logging eSEL (using BT protocol) is the only way to capture
terminate data
(whatever is possible). So I think lets go with this method for now..

> 
> In Hostboot we have a 'assert' and 'crit_assert'.  Assert is "I'm in a
> bad state, but not so bad that we cannot collect data about it."
> Crit-Assert is "I'm in a really bad state, so just shutdown."
> 
> Do you need something similar here in skiboot?  

Right now we have single abort() call.. *worst* case we endup rebooting with
logging.

>I would suggest
> reviewing all of the 'abort' calls and confirm that they do not lead
> into bad situations if you try to do "extra" stuff like send an IPMI
> message.

Sure.. I'll go through all abort calls.

-Vasant