How do you find that one of your friends doesn’t feel well? Normally, they let you know by sending you an email, giving you a call, or walking up and telling you (spreading lots of germs in the process). If they don’t tell you directly, you pick up on other signals. Your friend may be sniffling a lot or sound very congested or be passed out at their desk, drooling all over their keyboard. Conversely, you can see the signs of a relatively healthy person. They sound clear, they have lots of energy or they are telling the guy with the cart, “I feel better. I think I’ll go for a walk!”
Applications need to do the same thing. Communication is the only way you can tell if an application is feeling well or is about to fall apart. If your application can’t talk to you, you have no way of helping it to feel better. The L in LUCID is for logging. Logs are how applications speak. They capture information that allows us to pick up on those little signs and act early enough to make sure the application doesn’t come down with the flu. If you don’t have good logging throughout your system, the best you can do is react to your application falling over. A silent application is an application destined to cause midnight surprises for you and your operations team.
Of course, blabbering constantly isn’t really all that helpful. Logging is not just the act of slathering your code in print statements. Proper logging requires thoughtful consideration and clear communication. Spitting out the string, “ERROR” every time someone tries to log in with the wrong password is not at all useful. The constant noise will cloud the signs of what is really going on. Good logging not only lets you know what is going on, but does it in such a way that you know when to freak out and when to wait it out. Useful logging doesn’t just tell you what happened; it also tells you how much you should worry. Along with information about the event, a log message needs a log level. The log4j family of libraries (log4perl, log4PHP, log4Fortran*, etc.) provide five levels of severity. Selecting the proper level for each message is key to helping you stay asleep when you can afford to stay asleep and waking you up when you need to be woken up. There may be official definitions for different levels somewhere, but here’s a general set of guidelines for what each level means:
Trace – Every method should have a trace call so that you can follow the flow of data through your application.
Debug – Debug statements should exist with enough frequency and information to allow anyone to identify and fix errors at some point in the future.
Info – Info statements are great for metrics. When a new user is registered, you should have an info statement. When a registration fails due to name collisions, you should have an info statement. When a promotion is completed, you should have an info statement.
Warn – Unexpected actions which do not result in data corruption and which the system can reasonably recover from should trigger a warning. For example, a failed attempt to connect to a RabbitMQ server should trigger a warning. This isn’t an error because the system can and should try to connect to a secondary system.
Error – Unexpected conditions which result in lost or corrupted data, or conditions which prevent the system from operating must fire an error. Failure to establish a connection to a database after all possible servers are tried is an error.
Using these conventions you might end up with logs that look something like this if assuming you are only reporting events at INFO level and higher:
Sept 17, 2012 14:25:25 – [INFO] New user account created: smattocks
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.3. Error message: server in read only mode
Sept 17, 2012 14:25:26 – [ERROR] Could not connect any databases while trying to process request for /user/changename for user ID 90210. Attempting to change name to “dylan”
Sept 17, 2012 14:25:27 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:27 – [INFO] Connected to secondary database server at host 192.168.1.3. Most likely primary failed and secondary became master.
Sept 17, 2012 14:25:27 – [INFO] Username changed. Old name: brandon; New name: dylan
Sept 17, 2012 14:25:27 – [WARN] User creation failed due to unique name conflicts. Username: dylan, email: dylan@example.com, password (obfuscated for security reasons): xxxxxxxxx
Note: The example of not being able to connect to a database results in at least one warning and an error.
Hopefully, you have noticed a bit of consistency in the log messages above. The anatomy of a good log message can be described in four parts: a timestamp, an appropriate level of severity, a meaningful message, and sufficient contextual information. A timestamp is required to allow one message to be assessed in relation to other log messages from the current and other applications. The level of severity allows people or applications reading the logs to apply varying levels of urgency to their responses and behaviors. Hopefully, the meaningful message is self explanatory. The contextual information is the part most often omitted.
Letting you know that something went wrong is really only half of a log messages job. The other responsibilities of a log message are to allow you to accurately reproduce the conditions under which the event occurred, and to allow you to fix any data inconsistencies. Knowing that a user’s registration got bungled mid-process is good. Knowing which data caused the error and being able to fix the busted account is even better. A good log message needs to be part diagnostics and part remedy. If you don’t have the data you need to fix issues in your database, you are only fixing half of the problem.
Without logging, your application is a black box. You cannot see what is going on inside and you cannot fix bugs with reasonable speed. Having your application talk to you in an intelligent manner through proper logging makes your life much easier down the road. Heck, it even makes your life easier right now. You can’t consider your application part of your team unless it can provide feedback and help to inform your decisions. The first step in making your code LUCID is to make it “speak”. After that, you can start to make your code psychic by adding unit tests (the subject of my next post).
* I don’t think log4Fortran actually exists, but if someone creates one, I would vote for calling it log4tran.