anti-pattern: unexpected errors and exceptions
Joseph Reynolds
jrey at linux.ibm.com
Tue Feb 11 15:36:05 AEDT 2020
We're addressing a new anti-pattern "handling unexpected error codes and
exceptions" to address code that just logs errors and continues. The
idea is to think about what diagnostic data to capture, when the
application can and cannot recover, using core dumps to debug and
improve BMC code, how the BMC recovers from failed services, cascading
service failures, and having to reboot the BMC. It's about balancing
the benefits of a core dump against keeping the BMC running.
The anti-pattern (draft below) is only addressing the first part of
that: capturing data and recovering vs crashing. I plan to push it for
review to our [anti-pattern doc][]. Please take a look ... and pass
these along to your service strategist. :)
[anti-pattern doc]:
https://github.com/openbmc/docs/blob/master/anti-patterns.md
- Joseph
__________
Here is the draft anti-pattern: Handling unexpected error codes and
exceptions
Identification:
The anti-pattern is to continue processing after unexpected error codes
and exceptions.
Description:
Suppressing unexpected errors can lead an application to incorrect or
erratic behavior which can affect a systemd service and the overall
system. Further, merely logging errors may clutter the log and not give
"real" problems the attention they deserve, so developers doesn't get a
chance to investigate problems and the system's reliability does not
improve over time.
Background:
Programmers are unsure how to handle unexpected conditions, don't know
if it is acceptable for a service to terminate, and may not fully
understand the BMC's service strategy. So they write code to log errors
and continue processing when it may be better to terminate an
application, restart a service, or handle a situation in ways outside
the scope of an application.
Resolution:
Several items are needed:
1. Check all return codes and account for all possible values.
2. Have a good reason to handle specific errors and consider using a
default handler to throw an exception.
3. Have a good reason to handle specific exceptions and allow other
exceptions to propagate.
For error handlers:
- Consider what data (if any) should be logged. How will the log entry
be used? For example, log real hardware errors. Don't log recoverable
errors. For other situations, what data would you need to debug the
problem (first failure data capture)? Would a core dump be useful?
- Determine if the application can fully recover from the condition. If
not, don't continue.
Logging and continuing may be appropriate for some errors, but its use
must be carefully considered.
More information about the openbmc
mailing list