D is for Documentation

Code is the way in which humans tell computers what to do. Lots of effort has gone into making code easier for humans to read in the form of high level languages like Java or C++ and scripting languages like PHP, Ruby, and Python. Despite mankind’s best efforts, writing code is still clearly an exercise for talking to computers. It has not evolved to the point where talking to a computer is as easy and natural as talking to other people.

That’s why documentation is so important. Programming languages are just a translation of a developer’s intent into something a computer can execute. The code may show the computer what you intended for it to do, but the context is lost when another developer comes back to it later. Computers don’t know what to do with context. If they did, the days of Skynet would already be upon us. Humans can process context and it makes the process of dissecting and understanding a computer program much easier.

I find it both sad and hilarious when I see a speaker evangelizing code without comments. Invariably, the speaker shows a slide with five lines of code and spends ten minutes explaining its form and function. Even the simplest and most contrived examples from some of the foremost experts in the field require context and explanation.

When a bug decides to show itself at three in the morning, in code that someone else wrote, context and intent are two very powerful tools. When bugs are found the question, “What was this supposed to do?” is more common than “What is this thing doing?” Figuring out what it is doing is easier when you have good log data to go on. Knowing what it was supposed to do is something only the original developer can tell you.

If you aren’t aware of the concept of Test Driven Development, I strongly recommend you dig into it. In summary, tests are written before the code to ensure that they code matches the business requirements. I would like to propose a complimentary development driver: Documentation Driven Development. By writing out the code as comments first, you can ensure that the context of the development process will be captured. For example, I start writing code with a docblock like this:

/**
 * Returns the array of AMQP arguments for the given queue.
 *
 * Depending on the configuration available, we may have one or more arguments which
 * need to be sent to RabbitMQ when the queue is declared. These arguments could be
 * things like high availability configurations.
 *
 * If something in getInstance() is failing, check here first. Trying to declare a
 * queue with a set of arguments that does not match the arguments which were used
 * the first time the queue was declared most likely will not work. Check the config
 * for AMQP and make sure that the arguments have not been changed since the queue
 * was originally created. The easiest way to reset them is to kill off the queue
 * and try to recreate it based on the new config.
 *
 * @param string $name The name of the queue which will be used as a key in configs.
 *
 * @return array The array of arguments from the config.
 */

Next I dive into the method body itself:

private static function _getQueueArgs($name)
{
    // Start with nothing.

    // We may need to set some configuration arguments.

    // Check for queue specific args first and then try defaults. We will log where we 
    // found the data.

    // Return the args we found.
}

After that, I layer in the actual code:

/**
 * Returns the array of AMQP arguments for the given queue.
 *
 * Depending on the configuration available, we may have one or more arguments which
 * need to be sent to RabbitMQ when the queue is declared. These arguments could be
 * things like high availability configurations.
 *
 * If something in getInstance() is failing, check here first. Trying to declare a
 * queue with a set of arguments that does not match the arguments which were used
 * the first time the queue was declared most likely will not work. Check the config
 * for AMQP and make sure that the arguments have not been changed since the queue
 * was originally created. The easiest way to reset them is to kill off the queue
 * and try to recreate it based on the new config.
 *
 * @param string $name The name of the queue which will be used as a key in configs.
 *
 * @return array The array of arguments from the config.
 */
private static function _getQueueArgs($name)
{
    static::$logger->trace('Entering ' . __FUNCTION__);

    // Start with nothing.
    $args = array();

    // We may need to set some configuration arguments.
    $cfg = Settings\AMQP::getInstance();

    // Check for queue specific args first and then try defaults. We will log where we 
    // found the data.
    if (array_key_exists($name, $cfg['queue_arguments'])) {
        $args = $cfg['queue_arguments'][$name];
        static::$logger->info('Queue specific args found for ' . $name);
    } elseif (array_key_exists('default', $cfg['queue_arguments'])) {
        $args = $cfg['queue_arguments']['default'];
        static:$logger->info('Default args used for ' . $name);
    }

    // Return the args we found.
    static::$logger->trace('Exiting ' . __FUNCTION__ . ' on success.');
    return $args;
}

The final result is a small method which is well documented and little if any extra time to write.

Armed with data from logs, unit tests which ensure functionality, configurations to control execution, isolation switches to lock down features, and contextual information in the form inline documentation, the process of finding bugs becomes easier. LUCID code communicates as if it were a member of the development team. It does all the things you expect from a coworker. It talks, it makes commitments, it works around problems and keeps a record of both what it is doing and why it is doing it.

I is for Isolation

At some point in its life, every team will have a bad egg. If it isn’t a hiring mistake that brings someone in, it can be a business choice or financial change that sours (if even only temporarily) a previously great player. What separates poor and mediocre teams from the great ones is their ability to effectively quarantine the disgruntled teammate until such time as the problem can be corrected. If the group can’t isolate the individual and continue to function at a high level, they run the risk of completely imploding. An ineffective dynamic can spread and negatively impact the entire organization.

The same can be said of an application. Applications are a collection of individual components; they pull together the contributions of individuals to create something which is greater than the sum of its parts. If one piece breaks down, and the system is not prepared to put it in quarantine, the entire application can come crashing down. There are no effective means for isolating the application, if the application is the team member that turns sour. Once this happens, the disease spreads and the entire company is at risk.

The best way to stave off this infection is to prevent it at the source. Keep the application happy by making it more resistant to implosion. Take the team approach down one level, inside your code, and imagine features as team members. If one of piece of the application, such as user authentication, goes a little crazy, will that bring down your entire system? Hopefully, users that are already logged in can still make purchases or update their user information or share links with their friends. Don’t let users that are having trouble logging in ruin the experience for everyone else. That isn’t to say issues with authentication aren’t a major concern, but they should be isolated in such a way that people who are already in can still do what they came to do.

Code jerks (the features, classes and libraries that bring everything else down when they are upset) come in all sorts of shapes and sizes. There are code jerks that hold resources too long. These jerks normally don’t do much on their own, but when you have lots of users executing the same code at the same time, you essentially end up DoS’ing yourself. There are code jerks that assume they know how the rest of the system works. These jerks tend to halt execution and prevent the rest of the application from working around the error. There are code jerks that give up too easily. These jerks try something once and assume things can’t get any better so there is no point in retrying. The worst code jerks are a combination of the three.

Resource hogs are tough to find, because they typically need lots of friends before they are a noticeable problem. For example, setting a connection timeout to five seconds for a third-party service may seem reasonable. However, when the service is slow and it takes three seconds to connect, you could end up keeping your own connections open much longer than expected. Can your server hold each connection open for three seconds and still handle all of the incoming requests? Even if only some of your user requests need this service, you may run out of connections for the rest of your users. Being able to turn this particular feature off without impacting the rest of your site can help keep your application running and making money.

It isn’t always third-parties that cause trouble. Some times it could be your own database that slows down your application. Let’s say you forgot to add an index, or your data suddenly changed dramatically and your old indexes don’t work as well and your database has become unresponsive. Many applications (including most of the ones I have built in the past) assume that not being able to talk to a database is a fatal error. The database wrapper library may halt execution of the app or make it difficult for the application to properly handle the situation. I have since learned that there are lots of things you can do without database access. For instance, depending on your data and your business rules, you may be able to use slightly old cache data in place of fresh database data. If you isolate the your application from certain types of failures, you can let your code decide what is best. Your code should do everything it can to satisfy the user’s request to the best of your ability and be as resilient to failure as your business allows.

Sticking with our database example, let’s look at what you might do when the primary database isn’t available. Assuming you can’t use the cache as a back up database, what can you do? Can you read from a secondary? Can you log data to a temporary location to be written to the database when it comes back up? Is the primary database really 100% critical to satisfying the user’s request? With a bit of creativity you can probably find a solution that allows you to keep the site up while you work to fix the database problem. But even before you get to that point, how do you know you it is really necessary? This may come as a shock to some, but the Internet is unreliable. It tells you something is there one minute, and then that it is gone the next. The inverse is equally true. Just because you tried to connect to the database and couldn’t, doesn’t mean you need to switch to crisis mode. Maybe you were just a victim of the Internet being the Internet. Code that tries multiple (configurable) times to connect to a database isolates the rest of the application from the reliability issues inherent in communicating between systems.

Approaching software as a set of interconnected and unreliable services helps to create applications which stand up better in the face of less than ideal situations. Unfortunately, even when you have an application that communicates well with logs, has tests to verify functionality and uses configurations to help isolate features from one another, there will still be bugs. Armed with your logs and tests, you will still have to dive into code and make changes. The level of documentation in your code can either make this easier, or a nightmare. The next post will look at how documentation contributes to the maintenance of code.

C is for Configurable

The goal of LUCID development is to create code that is easier to maintain and extend. Many times, maintainable is often confused for debuggable. Being able to easily debug code is important, but there is more to maintaining an application than fixing bugs. The environment in which an application is deployed is seldom fixed. As code moves through the software development process, it moves through different environments. In many cases, the full details of then environment may not even be known at the time development starts. Getting your code to change its behavior on the fly is only possible if you have carefully and thoughtfully interleaved configuration parameters into your code.

Most applications of reasonable complexity are really just parts of a larger system. Applications of this nature interact constantly with other applications such as databases, logging systems and third party services. Connecting to these external systems usually requires knowing an endpoint, a username and a password. The benefits of using configurations for these values include both security and personal sanity. If something goes wrong with one of these services you can quickly change the connection credentials. While that alone doesn’t secure a system, it does help to mitigate risks in the event of a system compromise.

In the past, I often felt that the more systems I had access to the more other respected and trusted me. I have learned, however, that trust and respect are not measured by system access. In fact, as I have matured, I have done my best to avoid gaining access to systems like production databases. When a system is compromised, the first group of suspects are those who were granted access. By storing these values in proper configuration files, I can immediately exclude myself from the group of people who may have slipped up. I have no idea what our production databases’ usernames are, let alone the passwords. I sleep easier at night knowing that I can’t possibly let that information slip out to an untrusted source.

In order to fully absolve myself of that responsibility, our code needs to be able to pull in configuration values from a source outside the code repository. If the production configuration lives in the same version control repo as the code I work with, I have access, and I cannot claim ignorance. Configuration files need to able to live external to the main code. The location of that file can live in a repository bootstrap file or can be known by convention, but the actual contents are something I never want to see. I will leave that responsibility to our Operations team.

Once you have separated out your configuration files, you can go about lacing that information into your code. Adding configurations typically isn’t the hard part; knowing what to configure is often more difficult. A good place to start is with code that involves external systems. Adding configurable values for the access credentials, timeouts and retry logic will help to isolate failures (more on that later). Being able to change these values without needing to redeploy code allows those supporting the application to change behaviors in response to a changing environment. If a third party service is having trouble delivering, turning down the timeouts prevents that system from negatively impacting the rest of the system.

Additionally, being able to configure the application to use different drivers or services makes the system easier to test and develop. There is no need to set up a redundant queuing system while unit testing. Injecting a mock system by adjusting runtime configurations helps to isolate the code under test and provide better testing results. The same is true for logging configurations. Different environments require different levels of logging and may use different logging backends. Easily manageable external configurations make adjusting the behavior of a system trivial.

Configurations make runtime behavior changes possible. These behavior modifications allow the code to adapt to an ever changing environment. When combined with runtime data from logs, configurations make it easier to identify and isolate sections of an application. The ability to isolate problem areas with configurations is a key component of LUCID development. The next post will explain in greater detail why it is important to use configurations to isolate sections of code.

U is for Unit Tests

Unit tests are like Google Maps for your code. When you zoom in and look at small pieces, you can only go so far. However, when you zoom out and take a look from a greater distance, you can see the full path from A to B. The idea of unit testing, as it relates to LUCID development, is that it maps out all the different ways to get from one end of your application to the other. Unit tests take the question of “how does my data get from the my user’s request, through my code and back to the user again?” They answer that question just like any well written application: by breaking it down into smaller more manageable chunks.

Each unit test answers a single question. One test may ask, “what happens if I call findTournament() with a user that has played this game before?” Another test may ask, “what happens if I call findTournament() with a user that has never played a game before?” The more questions you ask about your code, the more you know about how your code is likely to function in production. If you have a question that your unit tests can’t answer, you have a blind spot that may come back to haunt you later. A lack of code coverage, is like a missing road on a map. If you can’t see how data is moving through a section of your code, you can’t be sure how it is going to get from one end to the other.

Unit tests are great because they isolate problems. If a test is failing, you know exactly where to the fix needs to go. However, it is easy to lose the big picture when you are so narrowly focused on individual units of code. The goal of having unit tests is not to ensure that 300 separate pieces of code work well by themselves. The goal is to ensure that those 300 pieces can come together to form one well functioning application. Your tests are the zoomed in pieces of the map that need to be strung together to get you from A to B. That isn’t to say that your tests should do more than one thing. You just need to keep the entire application in mind when creating your unit tests, and more importantly, when creating your test data.

In the development phase, your testing is really a best guess at what you think your code will see. You pass in data that you expect to be passed in in the real world. Your unit tests for your email validation method probably verify the functionality when give a valid email and one or two invalid emails. But, when your application gets out in to the wild world of the Internet, will it be able to handle all of the crazy data that your users throw at it? There isn’t a great way to know early in the game. However, once you deploy your application and you start gathering user information, you can begin to fill in some of those missing gaps.

Unit testing isn’t just a development phase activity. You need to continue to write new tests and update old tests. You need to make sure that your unit tests and your test coverage are evolving in the same way that your users and your data are evolving. The data you are capturing in your logs is a great source for real world test cases. When your logs tip you off that something isn’t working quite like you expected, extract the data and use it to build a new unit test. Because your application has great logging, finding the data that is going into and coming out of any method should be pretty straight forward.

A LUCID application feeds into itself, helping to answer questions about how it works. It uses the data it captures through logging to help solidify behaviors and allows the team to better predict real world behaviors through unit tests. Of course, you can’t test everything and there will always be bugs. That is why it is important for an application to expect failures, and do everything it can to continue working in the face of adversity. My next post will be about building a more fault tolerant and controllable application.

L is for Logging

How do you find that one of your friends doesn’t feel well? Normally, they let you know by sending you an email, giving you a call, or walking up and telling you (spreading lots of germs in the process). If they don’t tell you directly, you pick up on other signals. Your friend may be sniffling a lot or sound very congested or be passed out at their desk, drooling all over their keyboard. Conversely, you can see the signs of a relatively healthy person. They sound clear, they have lots of energy or they are telling the guy with the cart, “I feel better. I think I’ll go for a walk!”

Applications need to do the same thing. Communication is the only way you can tell if an application is feeling well or is about to fall apart. If your application can’t talk to you, you have no way of helping it to feel better. The L in LUCID is for logging. Logs are how applications speak. They capture information that allows us to pick up on those little signs and act early enough to make sure the application doesn’t come down with the flu. If you don’t have good logging throughout your system, the best you can do is react to your application falling over. A silent application is an application destined to cause midnight surprises for you and your operations team.

Of course, blabbering constantly isn’t really all that helpful. Logging is not just the act of slathering your code in print statements. Proper logging requires thoughtful consideration and clear communication. Spitting out the string, “ERROR” every time someone tries to log in with the wrong password is not at all useful. The constant noise will cloud the signs of what is really going on. Good logging not only lets you know what is going on, but does it in such a way that you know when to freak out and when to wait it out. Useful logging doesn’t just tell you what happened; it also tells you how much you should worry. Along with information about the event, a log message needs a log level.┬áThe log4j family of libraries (log4perl, log4PHP, log4Fortran*, etc.) provide five levels of severity. Selecting the proper level for each message is key to helping you stay asleep when you can afford to stay asleep and waking you up when you need to be woken up. There may be official definitions for different levels somewhere, but here’s a general set of guidelines for what each level means:

Trace – Every method should have a trace call so that you can follow the flow of data through your application.

Debug – Debug statements should exist with enough frequency and information to allow anyone to identify and fix errors at some point in the future.

Info – Info statements are great for metrics. When a new user is registered, you should have an info statement. When a registration fails due to name collisions, you should have an info statement. When a promotion is completed, you should have an info statement.

Warn – Unexpected actions which do not result in data corruption and which the system can reasonably recover from should trigger a warning. For example, a failed attempt to connect to a RabbitMQ server should trigger a warning. This isn’t an error because the system can and should try to connect to a secondary system.

Error – Unexpected conditions which result in lost or corrupted data, or conditions which prevent the system from operating must fire an error. Failure to establish a connection to a database after all possible servers are tried is an error.

Using these conventions you might end up with logs that look something like this if assuming you are only reporting events at INFO level and higher:

Sept 17, 2012 14:25:25 – [INFO] New user account created: smattocks
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:26 – [WARN] Could not connect to database server at host 192.168.1.3. Error message: server in read only mode
Sept 17, 2012 14:25:26 – [ERROR] Could not connect any databases while trying to process request for /user/changename for user ID 90210. Attempting to change name to “dylan”
Sept 17, 2012 14:25:27 – [WARN] Could not connect to database server at host 192.168.1.2. Error message: server not found
Sept 17, 2012 14:25:27 – [INFO] Connected to secondary database server at host 192.168.1.3. Most likely primary failed and secondary became master.
Sept 17, 2012 14:25:27 – [INFO] Username changed. Old name: brandon; New name: dylan
Sept 17, 2012 14:25:27 – [WARN] User creation failed due to unique name conflicts. Username: dylan, email: dylan@example.com, password (obfuscated for security reasons): xxxxxxxxx

Note: The example of not being able to connect to a database results in at least one warning and an error.

Hopefully, you have noticed a bit of consistency in the log messages above. The anatomy of a good log message can be described in four parts: a timestamp, an appropriate level of severity, a meaningful message, and sufficient contextual information. A timestamp is required to allow one message to be assessed in relation to other log messages from the current and other applications. The level of severity allows people or applications reading the logs to apply varying levels of urgency to their responses and behaviors. Hopefully, the meaningful message is self explanatory. The contextual information is the part most often omitted.

Letting you know that something went wrong is really only half of a log messages job. The other responsibilities of a log message are to allow you to accurately reproduce the conditions under which the event occurred, and to allow you to fix any data inconsistencies. Knowing that a user’s registration got bungled mid-process is good. Knowing which data caused the error and being able to fix the busted account is even better. A good log message needs to be part diagnostics and part remedy. If you don’t have the data you need to fix issues in your database, you are only fixing half of the problem.

Without logging, your application is a black box. You cannot see what is going on inside and you cannot fix bugs with reasonable speed. Having your application talk to you in an intelligent manner through proper logging makes your life much easier down the road. Heck, it even makes your life easier right now. You can’t consider your application part of your team unless it can provide feedback and help to inform your decisions. The first step in making your code LUCID is to make it “speak”. After that, you can start to make your code psychic by adding unit tests (the subject of my next post).

* I don’t think log4Fortran actually exists, but if someone creates one, I would vote for calling it log4tran.

LUCID Development

There are lots of rules, guidelines and frameworks out there designed to make software easier to build and upgrade. Concepts like SOLID look at how to write object oriented code in such a way as to allow pieces to be easily pulled apart and reused. Design patterns a geared toward solving common problems in tried and true ways. These ideas are great and all developers should not only understand them, but strive to fully utilize them. However, these principles are centered around a Utopian vision of software development. They assume that building the features the first time is the hard part. In reality, that isn’t so. The difficulty and frustration often associated with developing software comes from the maintenance phase, the long running part of the software development lifecycle in which bugs are identified and fixed.

Building software with a clear set of use cases and requirements is a relatively straight forward process. Various design patterns exist to help you solve problems which others have come across already. You can use principles like SOLID to help separate your classes and simplify your code. With modern RAD frameworks like Ruby on Rails or CakePHP, you can have something up and running in less than a day. But, as most software engineers know, that is just the start. Next comes the long process of maintenance and bug fixing. What makes this part of the software lifecycle so difficult is the inherent unpredictability of users. Software behaves predictably. If, under a certain set of conditions, you give it A, you will get B every time. Unfortunately, conditions change constantly and users often give C when you expect them to give you A. Expecting to have all of the use cases well defined at any point is unreasonable. Predictable software is used in unpredictable ways. That is a simple fact.

Any reasonably complex system will have a mind boggling set of conditions and permutations which make it impossible to fully predict all of the paths through the code. The best you can do is expect failure and try to capture enough information to be able to account for the situation next time. Using a set of guidelines like SOLID may make it easier to swap out a broken class for a new class, but they don’t really help you identify the problem or fix any data corruption which may have occurred. Knowing you have a problem and being able to isolate and fix the problem is just as important, if not more so, than being able to rely on consistent interfaces from your factory method. That is why I am proposing an additional set of software development guidelines called “LUCID code.” LUCID code is that which Logs information, uses Unit tests to predict behavior, is Configurable, Isolates features and is fully Documented.

LUCID code is code which understands that bugs are unavoidable; it expects to be updated somewhere down the road. Code which can “speak” to developers through logs, unit tests and documentation is not an artifact of the software development lifecycle, but a part of the team. It provides feedback which enables the human team members to more easily identify and change the unpredicted behaviors. It also allows the team to understand and correct corrupted data, an often overlooked part of the bug fixing process. Code which is isolated from other code and allows itself to be controlled through configurations, is code that can be made to behave in more predictable ways. If the team can throttle throughput, change flows, or increase logging levels without having to redeploy, then they are more likely to be able to keep misbehaving code in check while they work on a fix. In a production environment, code which is based on LUCID principles is less disruptive and easier to control than code which is not.

The principles of LUCID coding do not aim to resolve all issues in the early phases of the software development lifecycle; they simply make it easier to work with code in the real world. As predictable as your software may be, users will always find a way to do something unexpected. The real world is an unpredictable place for code. Having the code you write support you in your efforts may not allow you to predict all of the ways in which your systems will be abused, but it makes your hindsight much more clear.

I’ll follow up this blog post with a series of posts each one focusing on a different aspect of LUCID development. The elements seem simple on the surface, but doing it right takes skill and discipline.