anti-pattern: unexpected errors and exceptions

Thu Feb 13 08:47:44 AEDT 2020

On  high level, I think this is a pretty good start. If there was 
anything I'd add to it an audience section. Is it going to a human, or 
intended to be run through an analyzer? What's the human's skill set?

Adding in some comments below

On 2/10/20 10:36 PM, Joseph Reynolds wrote:
> We're addressing a new anti-pattern "handling unexpected error codes and 
> exceptions" to address code that just logs errors and continues.  The 
> idea is to think about what diagnostic data to capture, when the 
> application can and cannot recover, using core dumps to debug and 
> improve BMC code, how the BMC recovers from failed services, cascading 
> service failures, and having to reboot the BMC.  It's about balancing 
> the benefits of a core dump against keeping the BMC running.
> 
> The anti-pattern (draft below) is only addressing the first part of 
> that: capturing data and recovering vs crashing.  I plan to push it for 
> review to our [anti-pattern doc][].  Please take a look ... and pass 
> these along to your service strategist.  :)
> 
> [anti-pattern doc]: 
> https://github.com/openbmc/docs/blob/master/anti-patterns.md
> 
> - Joseph
> 
> __________
> 
> Here is the draft anti-pattern: Handling unexpected error codes and 
> exceptions
> 
> Identification:
> The anti-pattern is to continue processing after unexpected error codes 
> and exceptions.
> 
> Description:
> Suppressing unexpected errors can lead an application to incorrect or 
> erratic behavior which can affect a systemd service and the overall 
> system.  Further, merely logging errors may clutter the log and not give 
> "real" problems the attention they deserve, so developers doesn't get a 
> chance to investigate problems and the system's reliability does not 
> improve over time.
> 
> Background:
> Programmers are unsure how to handle unexpected conditions, don't know 
> if it is acceptable for a service to terminate, and may not fully 
> understand the BMC's service strategy.  So they write code to log errors 
> and continue processing when it may be better to terminate an 
> application, restart a service, or handle a situation in ways outside 
> the scope of an application.
> 
> Resolution:
> Several items are needed:
> 1. Check all return codes and account for all possible values.
> 2. Have a good reason to handle specific errors and consider using a 
> default handler to throw an exception.
> 3. Have a good reason to handle specific exceptions and allow other 
> exceptions to propagate.
4. What downstream services are impacted by this service being restarted 
or lost?
5. Should the BMC restarting be left up to the system administrator?
> 
> For error handlers:
> - Consider what data (if any) should be logged.  How will the log entry 
> be used?  For example, log real hardware errors.  Don't log recoverable 
> errors.  For other situations, what data would you need to debug the 
> problem (first failure data capture)?  Would a core dump be useful?
In relation to Andrew's note. Should the log be created for debug only?
> - Determine if the application can fully recover from the condition.  If 
> not, don't continue. 

Audience:
You've decided to create the log entry. Who are you targeting to review 
the log? A BMC developer, Administrator, some analysis program? Usually 
the answer is more than one of these.
For example:
We'll use an ipmi request to set network access to being on, but the 
user input is invalid.

BMC Developer: Reference internal applications, services, pids, etc that 
the developer would be familiar with.
     - Example: ipmid service successfully processed a network setting 
packet, however the user input of USB0 is not a valid network interface 
to configure.

Administrator: They'll be familiar with the external interfaces of the 
BMC such as the REST API. They can respond to feedback about invalid 
input, or a need to restart the BMC.
     - Example: The network interface of USB0 is not a valid option. 
Retry the IPMI command with a valid interface.

Analyzer: Consider breaking the log down and including several 
properties which and analyzer can leverage. For instance, tagging the 
log with 'Internal' is asking to get a defect written as it's not 
helpful. However, breaking that down into something like 
[UserInput][IPMI][Network] tells at a quick glance that the input 
received for configuring the network via an ipmi command was invalid. 
Categorization and system impact are key things to focus on when 
creating logs for an analysis application.
     - Example: [UserInput][IPMI][Network][Config][Warning] Interface 
USB0 not valid.

> 
> Logging and continuing may be appropriate for some errors, but its use 
> must be carefully considered.
> 

Thanks,
Justin