Sunday, 25 May 2014

Alternative Log Level Structure

Over the years I have become increasingly dissatisfied with the standard log levels used in Java.  The main problem being that I find with them is that they can be unclear as to where a specific log message belongs: trace, debug, info, warn, error, fatal.

At first glance it appears very straight forward, however after working with many talented teams of developers I have observed that the log levels take effort to standardise across the team and that each team has had their own subtly different guidelines.   Essentially it takes effort to disambiguate edge cases; is a message info or warn? trace or debug? warn or error?  It becomes grey because the level changes based on who the audience is for the message, and typically developers will answer that question from their own view point. 

Previously I posted some guidelines on how we used those log levels; a few years on, I am both older and have less hair.  I have just found it easier to rename each of the log levels and be done with it.  Previously the desire to be a good log4j citizen stopped me, however being somewhat older I am less tolerant of conventions that cause me to loose my hair.  I just don’t have much hair left, and so I need to be more protective of it.   ;)

Here are the log levels that I currently use, I find them to be much more intuitive.

Audit (record of what happened, useful when investigating what the system did and why)

    dev   - diagnostic information useful for developers 
    ops   - diagnostic information useful for people who maintain the system 
    user  - diagnostic information about a user, and if ever seen by a savvy user would be meaningful to them  e.g. ‘CK created account’, and ‘CK logged in'

Alerts (requires action)

    warn - a problem has been detected and mitigated or a potential problem has been detected that does not yet affect users; (action: notify ops/review during office hours to help pre-empt fires/raise awareness of what the state of the system is)
    fatal  - the preverbal has hit the fan, progress can not be made by users (action: wake everybody up)

Alerts are special in two ways, firstly they are raising awareness of an ongoing problem that has a start and hopefully (eventually) an end.  Thus they need to make it clear that yes, we still have a problem or yay the problem has ended.  Secondly, they are a call to action.  Somebody needs to pay attention to alerts, either just by being aware that the system is at reduced capacity (warn) or get out of bed and sound the alarms, round up the troops etc;  the entire system is down (fatal).

Clearly we do not want many alerts.  If there are too many red herrings then people will ignore them, thus the alerts need to be very clear, complete and as rare and accurate as possible.   Thus most messages will be informative/audits and not alerts at all, in which case it must be very clear who the target audience is.  If that audience turn around and say, I do not understand or care then the message comes out entirely.   

I have found that agreeing in a team of developers who a message is for, and whether they need to take action or not to have a lot less grey areas than agreeing what the severity of that message is in relation to another message.  The relative comparison becomes a lot less important.  Thus once a scheme like this is in place, enforcing the rules becomes a lot easier.  The usage is intuitive, so there is little need to keep reminding people (myself included).   

No comments:

Post a Comment