anti-pattern: unexpected errors and exceptions

Tue Feb 11 15:36:05 AEDT 2020

We're addressing a new anti-pattern "handling unexpected error codes and 
exceptions" to address code that just logs errors and continues.  The 
idea is to think about what diagnostic data to capture, when the 
application can and cannot recover, using core dumps to debug and 
improve BMC code, how the BMC recovers from failed services, cascading 
service failures, and having to reboot the BMC.  It's about balancing 
the benefits of a core dump against keeping the BMC running.

The anti-pattern (draft below) is only addressing the first part of 
that: capturing data and recovering vs crashing.  I plan to push it for 
review to our [anti-pattern doc][].  Please take a look ... and pass 
these along to your service strategist.  :)

[anti-pattern doc]: 
https://github.com/openbmc/docs/blob/master/anti-patterns.md

- Joseph

__________

Here is the draft anti-pattern: Handling unexpected error codes and 
exceptions

Identification:
The anti-pattern is to continue processing after unexpected error codes 
and exceptions.

Description:
Suppressing unexpected errors can lead an application to incorrect or 
erratic behavior which can affect a systemd service and the overall 
system.  Further, merely logging errors may clutter the log and not give 
"real" problems the attention they deserve, so developers doesn't get a 
chance to investigate problems and the system's reliability does not 
improve over time.

Background:
Programmers are unsure how to handle unexpected conditions, don't know 
if it is acceptable for a service to terminate, and may not fully 
understand the BMC's service strategy.  So they write code to log errors 
and continue processing when it may be better to terminate an 
application, restart a service, or handle a situation in ways outside 
the scope of an application.

Resolution:
Several items are needed:
1. Check all return codes and account for all possible values.
2. Have a good reason to handle specific errors and consider using a 
default handler to throw an exception.
3. Have a good reason to handle specific exceptions and allow other 
exceptions to propagate.

For error handlers:
- Consider what data (if any) should be logged.  How will the log entry 
be used?  For example, log real hardware errors.  Don't log recoverable 
errors.  For other situations, what data would you need to debug the 
problem (first failure data capture)?  Would a core dump be useful?
- Determine if the application can fully recover from the condition.  If 
not, don't continue.

Logging and continuing may be appropriate for some errors, but its use 
must be carefully considered.