L is for Logging

How do you find that one of your friends doesn’t feel well? Normally, they let you know by sending you an email, giving you a call, or walking up and telling you (spreading lots of germs in the process). If they don’t tell you directly, you pick up on other signals. Your friend may be sniffling a lot or sound very congested or be passed out at their desk, drooling all over their keyboard. Conversely, you can see the signs of a relatively healthy person. They sound clear, they have lots of energy or they are telling the guy with the cart, “I feel better. I think I’ll go for a walk!”

Applications need to do the same thing. Communication is the only way you can tell if an application is feeling well or is about to fall apart. If your application can’t talk to you, you have no way of helping it to feel better. The L in LUCID is for logging. Logs are how applications speak. They capture information that allows us to pick up on those little signs and act early enough to make sure the application doesn’t come down with the flu. If you don’t have good logging throughout your system, the best you can do is react to your application falling over. A silent application is an application destined to cause midnight surprises for you and your operations team.

Of course, blabbering constantly isn’t really all that helpful. Logging is not just the act of slathering your code in print statements. Proper logging requires thoughtful consideration and clear communication. Spitting out the string, “ERROR” every time someone tries to log in with the wrong password is not at all useful. The constant noise will cloud the signs of what is really going on. Good logging not only lets you know what is going on, but does it in such a way that you know when to freak out and when to wait it out. Useful logging doesn’t just tell you what happened; it also tells you how much you should worry. Along with information about the event, a log message needs a log level.┬áThe log4j family of libraries (log4perl, log4PHP, log4Fortran*, etc.) provide five levels of severity. Selecting the proper level for each message is key to helping you stay asleep when you can afford to stay asleep and waking you up when you need to be woken up. There may be official definitions for different levels somewhere, but here’s a general set of guidelines for what each level means:

Trace – Every method should have a trace call so that you can follow the flow of data through your application.

Debug – Debug statements should exist with enough frequency and information to allow anyone to identify and fix errors at some point in the future.

Info – Info statements are great for metrics. When a new user is registered, you should have an info statement. When a registration fails due to name collisions, you should have an info statement. When a promotion is completed, you should have an info statement.

Warn – Unexpected actions which do not result in data corruption and which the system can reasonably recover from should trigger a warning. For example, a failed attempt to connect to a RabbitMQ server should trigger a warning. This isn’t an error because the system can and should try to connect to a secondary system.

Error – Unexpected conditions which result in lost or corrupted data, or conditions which prevent the system from operating must fire an error. Failure to establish a connection to a database after all possible servers are tried is an error.

Using these conventions you might end up with logs that look something like this if assuming you are only reporting events at INFO level and higher:

Sept 17, 2012 14:25:25 – [INFO] New user account created: smattocks
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.3. Error message: server in read only mode
Sept 17, 2012 14:25:26 – [ERROR] Could not connect any databases while trying to process request for /user/changename for user ID 90210. Attempting to change name to “dylan”
Sept 17, 2012 14:25:27 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:27 – [INFO] Connected to secondary database server at host 192.168.1.3. Most likely primary failed and secondary became master.
Sept 17, 2012 14:25:27 – [INFO] Username changed. Old name: brandon; New name: dylan
Sept 17, 2012 14:25:27 – [WARN] User creation failed due to unique name conflicts. Username: dylan, email: dylan@example.com, password (obfuscated for security reasons): xxxxxxxxx

Note: The example of not being able to connect to a database results in at least one warning and an error.

Hopefully, you have noticed a bit of consistency in the log messages above. The anatomy of a good log message can be described in four parts: a timestamp, an appropriate level of severity, a meaningful message, and sufficient contextual information. A timestamp is required to allow one message to be assessed in relation to other log messages from the current and other applications. The level of severity allows people or applications reading the logs to apply varying levels of urgency to their responses and behaviors. Hopefully, the meaningful message is self explanatory. The contextual information is the part most often omitted.

Letting you know that something went wrong is really only half of a log messages job. The other responsibilities of a log message are to allow you to accurately reproduce the conditions under which the event occurred, and to allow you to fix any data inconsistencies. Knowing that a user’s registration got bungled mid-process is good. Knowing which data caused the error and being able to fix the busted account is even better. A good log message needs to be part diagnostics and part remedy. If you don’t have the data you need to fix issues in your database, you are only fixing half of the problem.

Without logging, your application is a black box. You cannot see what is going on inside and you cannot fix bugs with reasonable speed. Having your application talk to you in an intelligent manner through proper logging makes your life much easier down the road. Heck, it even makes your life easier right now. You can’t consider your application part of your team unless it can provide feedback and help to inform your decisions. The first step in making your code LUCID is to make it “speak”. After that, you can start to make your code psychic by adding unit tests (the subject of my next post).

* I don’t think log4Fortran actually exists, but if someone creates one, I would vote for calling it log4tran.

LUCID Development

There are lots of rules, guidelines and frameworks out there designed to make software easier to build and upgrade. Concepts like SOLID look at how to write object oriented code in such a way as to allow pieces to be easily pulled apart and reused. Design patterns a geared toward solving common problems in tried and true ways. These ideas are great and all developers should not only understand them, but strive to fully utilize them. However, these principles are centered around a Utopian vision of software development. They assume that building the features the first time is the hard part. In reality, that isn’t so. The difficulty and frustration often associated with developing software comes from the maintenance phase, the long running part of the software development lifecycle in which bugs are identified and fixed.

Building software with a clear set of use cases and requirements is a relatively straight forward process. Various design patterns exist to help you solve problems which others have come across already. You can use principles like SOLID to help separate your classes and simplify your code. With modern RAD frameworks like Ruby on Rails or CakePHP, you can have something up and running in less than a day. But, as most software engineers know, that is just the start. Next comes the long process of maintenance and bug fixing. What makes this part of the software lifecycle so difficult is the inherent unpredictability of users. Software behaves predictably. If, under a certain set of conditions, you give it A, you will get B every time. Unfortunately, conditions change constantly and users often give C when you expect them to give you A. Expecting to have all of the use cases well defined at any point is unreasonable. Predictable software is used in unpredictable ways. That is a simple fact.

Any reasonably complex system will have a mind boggling set of conditions and permutations which make it impossible to fully predict all of the paths through the code. The best you can do is expect failure and try to capture enough information to be able to account for the situation next time. Using a set of guidelines like SOLID may make it easier to swap out a broken class for a new class, but they don’t really help you identify the problem or fix any data corruption which may have occurred. Knowing you have a problem and being able to isolate and fix the problem is just as important, if not more so, than being able to rely on consistent interfaces from your factory method. That is why I am proposing an additional set of software development guidelines called “LUCID code.” LUCID code is that which Logs information, uses Unit tests to predict behavior, is Configurable, Isolates features and is fully Documented.

LUCID code is code which understands that bugs are unavoidable; it expects to be updated somewhere down the road. Code which can “speak” to developers through logs, unit tests and documentation is not an artifact of the software development lifecycle, but a part of the team. It provides feedback which enables the human team members to more easily identify and change the unpredicted behaviors. It also allows the team to understand and correct corrupted data, an often overlooked part of the bug fixing process. Code which is isolated from other code and allows itself to be controlled through configurations, is code that can be made to behave in more predictable ways. If the team can throttle throughput, change flows, or increase logging levels without having to redeploy, then they are more likely to be able to keep misbehaving code in check while they work on a fix. In a production environment, code which is based on LUCID principles is less disruptive and easier to control than code which is not.

The principles of LUCID coding do not aim to resolve all issues in the early phases of the software development lifecycle; they simply make it easier to work with code in the real world. As predictable as your software may be, users will always find a way to do something unexpected. The real world is an unpredictable place for code. Having the code you write support you in your efforts may not allow you to predict all of the ways in which your systems will be abused, but it makes your hindsight much more clear.

I’ll follow up this blog post with a series of posts each one focusing on a different aspect of LUCID development. The elements seem simple on the surface, but doing it right takes skill and discipline.