During the design phase of a system or subsystem, one of the frequent issues is the discussion surrounding error handling and abnormal situations. Many programmers see error handling as a never ending tedious work that can never catch all error / abnormal scenarios.
Jeff Atwood in Coding horrors provides an interesting insight for understanding a way of looking at error handling. In short it provides reasoning for understanding the priority of bugs, as related to the frequency users experience the bugs, and some other technical aspects of bug handling. But most importantly he looks at bugs and unexpected events as in two respects:
- Events handled by the application gracefully, and thus prevent information loss or inconsistency but provide enough stability to continue working.
- Events so serious that the application fails ungracefully.
In our current development phase we came into a problem while testing the database. The partitioning scheme could cause hang-up in the database, up to 45 minutes due to malfunction of a periodic job. This kind of hang-up can cause errors of the second type where there is a complete system failure (imagine 45 minutes without access to the information). More problematic is the fact that the problem has no obvious explanation or reason. Form this point, several steps must be taken, each one aimed at solving some of the problems. First, the DB should be stressed, to understand the reason why it hang up. Second, there should be an external process to monitor the partitioning job.
In parallel, we need to analyze once more the partitioning scheme and try to distinguish between the actual problems originating from the scheme and the problems originating from the implementation of the functions.
At the end it flows to the error scenarios, the likelihood of the situation happening and the impact of this abnormal scenario. Like many other issues, management of resources and prioritization is the key.
If you read this far, you should follow me on twitter here.