L is for Logging

How do you find that one of your friends doesn’t feel well? Normally, they let you know by sending you an email, giving you a call, or walking up and telling you (spreading lots of germs in the process). If they don’t tell you directly, you pick up on other signals. Your friend may be sniffling a lot or sound very congested or be passed out at their desk, drooling all over their keyboard. Conversely, you can see the signs of a relatively healthy person. They sound clear, they have lots of energy or they are telling the guy with the cart, “I feel better. I think I’ll go for a walk!”

Applications need to do the same thing. Communication is the only way you can tell if an application is feeling well or is about to fall apart. If your application can’t talk to you, you have no way of helping it to feel better. The L in LUCID is for logging. Logs are how applications speak. They capture information that allows us to pick up on those little signs and act early enough to make sure the application doesn’t come down with the flu. If you don’t have good logging throughout your system, the best you can do is react to your application falling over. A silent application is an application destined to cause midnight surprises for you and your operations team.

Of course, blabbering constantly isn’t really all that helpful. Logging is not just the act of slathering your code in print statements. Proper logging requires thoughtful consideration and clear communication. Spitting out the string, “ERROR” every time someone tries to log in with the wrong password is not at all useful. The constant noise will cloud the signs of what is really going on. Good logging not only lets you know what is going on, but does it in such a way that you know when to freak out and when to wait it out. Useful logging doesn’t just tell you what happened; it also tells you how much you should worry. Along with information about the event, a log message needs a log level.┬áThe log4j family of libraries (log4perl, log4PHP, log4Fortran*, etc.) provide five levels of severity. Selecting the proper level for each message is key to helping you stay asleep when you can afford to stay asleep and waking you up when you need to be woken up. There may be official definitions for different levels somewhere, but here’s a general set of guidelines for what each level means:

Trace – Every method should have a trace call so that you can follow the flow of data through your application.

Debug – Debug statements should exist with enough frequency and information to allow anyone to identify and fix errors at some point in the future.

Info – Info statements are great for metrics. When a new user is registered, you should have an info statement. When a registration fails due to name collisions, you should have an info statement. When a promotion is completed, you should have an info statement.

Warn – Unexpected actions which do not result in data corruption and which the system can reasonably recover from should trigger a warning. For example, a failed attempt to connect to a RabbitMQ server should trigger a warning. This isn’t an error because the system can and should try to connect to a secondary system.

Error – Unexpected conditions which result in lost or corrupted data, or conditions which prevent the system from operating must fire an error. Failure to establish a connection to a database after all possible servers are tried is an error.

Using these conventions you might end up with logs that look something like this if assuming you are only reporting events at INFO level and higher:

Sept 17, 2012 14:25:25 – [INFO] New user account created: smattocks
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.3. Error message: server in read only mode
Sept 17, 2012 14:25:26 – [ERROR] Could not connect any databases while trying to process request for /user/changename for user ID 90210. Attempting to change name to “dylan”
Sept 17, 2012 14:25:27 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:27 – [INFO] Connected to secondary database server at host 192.168.1.3. Most likely primary failed and secondary became master.
Sept 17, 2012 14:25:27 – [INFO] Username changed. Old name: brandon; New name: dylan
Sept 17, 2012 14:25:27 – [WARN] User creation failed due to unique name conflicts. Username: dylan, email: dylan@example.com, password (obfuscated for security reasons): xxxxxxxxx

Note: The example of not being able to connect to a database results in at least one warning and an error.

Hopefully, you have noticed a bit of consistency in the log messages above. The anatomy of a good log message can be described in four parts: a timestamp, an appropriate level of severity, a meaningful message, and sufficient contextual information. A timestamp is required to allow one message to be assessed in relation to other log messages from the current and other applications. The level of severity allows people or applications reading the logs to apply varying levels of urgency to their responses and behaviors. Hopefully, the meaningful message is self explanatory. The contextual information is the part most often omitted.

Letting you know that something went wrong is really only half of a log messages job. The other responsibilities of a log message are to allow you to accurately reproduce the conditions under which the event occurred, and to allow you to fix any data inconsistencies. Knowing that a user’s registration got bungled mid-process is good. Knowing which data caused the error and being able to fix the busted account is even better. A good log message needs to be part diagnostics and part remedy. If you don’t have the data you need to fix issues in your database, you are only fixing half of the problem.

Without logging, your application is a black box. You cannot see what is going on inside and you cannot fix bugs with reasonable speed. Having your application talk to you in an intelligent manner through proper logging makes your life much easier down the road. Heck, it even makes your life easier right now. You can’t consider your application part of your team unless it can provide feedback and help to inform your decisions. The first step in making your code LUCID is to make it “speak”. After that, you can start to make your code psychic by adding unit tests (the subject of my next post).

* I don’t think log4Fortran actually exists, but if someone creates one, I would vote for calling it log4tran.

11 Thoughts.

  1. I use log levels in a similar way, with one distinction: I use error levels that map to the health of the application, instead of the services it uses. So failing to connect to the primary DB raises a WARNING. If it can’t connect to the secondary DB and thus does not work with full functionality, *that* would be an ERROR condition.

    • The example log I have does the same thing. First, the application tried to connect to the primary database. That failed so a WARNING was logged. Next it tried to connect to the secondary database. That one was in read only mode. The application knows that the current request requires a database write, so it logged another WARNING about the failure to connect to a useable database. At this point, it ran out of options and threw an ERROR.

      In general, if an application can continue to work (albeit under less desirable circumstances) it should try to do so. If the application can’t recover, and therefore can’t fulfill the requested operation, it should throw an ERROR. ERRORs should be addressed ASAP. WARNINGs may be able to be shelved until the morning. They key to maintaining sanity is making sure that what you consider an ERROR is something that really requires immediate attention.

  2. The problem is that not everyone reads the logs – in fact, rare are those (even the system administrators) who read actual logs.

    Logs say everything about the application – and when it comes to proactive anti-hacking measures, logs are the best as they reveal some information that tell you if your website is about to be hacked.

    Maybe there should be a dedicated role, like “Log Reader”, but my guess that the person doing this will end up not reading logs anyway.

    • Right. If no one reads your logs, it doesn’t matter how good they are. However, there are applications out there that do the reading for you. If you provide consistency and proper messaging, those applications can alert you via text message or email when things begin to go south. Additionally, other applications can read the logs and let you know when things are going well. For instance, if you begin to see an uptick in registrations, you may want to beef up your servers. Being able to see a trend based on your log output helps you make the right decision about how many more servers you will need.

      Logging stuff doesn’t actually solve any problems. Logging simply opens the gates for other people/applications to solve them for you.

  3. Pingback: U is for Unit Tests | Crisscott.com

  4. Pingback: D is for Documentation | Crisscott.com

  5. Pingback: Centralized Logging with Monolog, Logstash, and Elasticsearch

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>