anti-pattern: unexpected errors and exceptions

Fri Feb 14 03:59:34 AEDT 2020

On 2/12/20 3:47 PM, Justin Thaler wrote:
> On high level, I think this is a pretty good start. If there was 
> anything I'd add to it an audience section. Is it going to a human, or 
> intended to be run through an analyzer? What's the human's skill set?
>
> Adding in some comments below

Justin,

Thanks for your input; I've incorporated your ideas (and copied your 
words).  The anti-pattern is ready for review:
https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/29367

I think the anti-pattern is a good start toward better OpenBMC 
serviceability, but it can only go so far.  For example, log analyzers 
and core dump handlers are different topics entirely.  We can discuss 
those when we are ready.

- Joseph

>
> On 2/10/20 10:36 PM, Joseph Reynolds wrote:
>> We're addressing a new anti-pattern "handling unexpected error codes 
>> and exceptions" to address code that just logs errors and continues.  
>> The idea is to think about what diagnostic data to capture, when the 
>> application can and cannot recover, using core dumps to debug and 
>> improve BMC code, how the BMC recovers from failed services, 
>> cascading service failures, and having to reboot the BMC.  It's about 
>> balancing the benefits of a core dump against keeping the BMC running.
>>
>> The anti-pattern (draft below) is only addressing the first part of 
>> that: capturing data and recovering vs crashing.  I plan to push it 
>> for review to our [anti-pattern doc][].  Please take a look ... and 
>> pass these along to your service strategist.  :)
>>
>> [anti-pattern doc]: 
>> https://github.com/openbmc/docs/blob/master/anti-patterns.md
>>
>> - Joseph
>>
>> __________
>>
>> Here is the draft anti-pattern: Handling unexpected error codes and 
>> exceptions
>>
>> Identification:
>> The anti-pattern is to continue processing after unexpected error 
>> codes and exceptions.
>>
>> Description:
>> Suppressing unexpected errors can lead an application to incorrect or 
>> erratic behavior which can affect a systemd service and the overall 
>> system.  Further, merely logging errors may clutter the log and not 
>> give "real" problems the attention they deserve, so developers 
>> doesn't get a chance to investigate problems and the system's 
>> reliability does not improve over time.
>>
>> Background:
>> Programmers are unsure how to handle unexpected conditions, don't 
>> know if it is acceptable for a service to terminate, and may not 
>> fully understand the BMC's service strategy.  So they write code to 
>> log errors and continue processing when it may be better to terminate 
>> an application, restart a service, or handle a situation in ways 
>> outside the scope of an application.
>>
>> Resolution:
>> Several items are needed:
>> 1. Check all return codes and account for all possible values.
>> 2. Have a good reason to handle specific errors and consider using a 
>> default handler to throw an exception.
>> 3. Have a good reason to handle specific exceptions and allow other 
>> exceptions to propagate.
> 4. What downstream services are impacted by this service being 
> restarted or lost?
> 5. Should the BMC restarting be left up to the system administrator?
>>
>> For error handlers:
>> - Consider what data (if any) should be logged.  How will the log 
>> entry be used?  For example, log real hardware errors. Don't log 
>> recoverable errors.  For other situations, what data would you need 
>> to debug the problem (first failure data capture)?  Would a core dump 
>> be useful?
> In relation to Andrew's note. Should the log be created for debug only?
>> - Determine if the application can fully recover from the condition.  
>> If not, don't continue. 
>
> Audience:
> You've decided to create the log entry. Who are you targeting to 
> review the log? A BMC developer, Administrator, some analysis program? 
> Usually the answer is more than one of these.
> For example:
> We'll use an ipmi request to set network access to being on, but the 
> user input is invalid.
>
> BMC Developer: Reference internal applications, services, pids, etc 
> that the developer would be familiar with.
>     - Example: ipmid service successfully processed a network setting 
> packet, however the user input of USB0 is not a valid network 
> interface to configure.
>
> Administrator: They'll be familiar with the external interfaces of the 
> BMC such as the REST API. They can respond to feedback about invalid 
> input, or a need to restart the BMC.
>     - Example: The network interface of USB0 is not a valid option. 
> Retry the IPMI command with a valid interface.
>
> Analyzer: Consider breaking the log down and including several 
> properties which and analyzer can leverage. For instance, tagging the 
> log with 'Internal' is asking to get a defect written as it's not 
> helpful. However, breaking that down into something like 
> [UserInput][IPMI][Network] tells at a quick glance that the input 
> received for configuring the network via an ipmi command was invalid. 
> Categorization and system impact are key things to focus on when 
> creating logs for an analysis application.
>     - Example: [UserInput][IPMI][Network][Config][Warning] Interface 
> USB0 not valid.
>
>>
>> Logging and continuing may be appropriate for some errors, but its 
>> use must be carefully considered.
>>
>
> Thanks,
> Justin