At some point in its life, every team will have a bad egg. If it isn’t a hiring mistake that brings someone in, it can be a business choice or financial change that sours (if even only temporarily) a previously great player. What separates poor and mediocre teams from the great ones is their ability to effectively quarantine the disgruntled teammate until such time as the problem can be corrected. If the group can’t isolate the individual and continue to function at a high level, they run the risk of completely imploding. An ineffective dynamic can spread and negatively impact the entire organization.
The same can be said of an application. Applications are a collection of individual components; they pull together the contributions of individuals to create something which is greater than the sum of its parts. If one piece breaks down, and the system is not prepared to put it in quarantine, the entire application can come crashing down. There are no effective means for isolating the application, if the application is the team member that turns sour. Once this happens, the disease spreads and the entire company is at risk.
The best way to stave off this infection is to prevent it at the source. Keep the application happy by making it more resistant to implosion. Take the team approach down one level, inside your code, and imagine features as team members. If one of piece of the application, such as user authentication, goes a little crazy, will that bring down your entire system? Hopefully, users that are already logged in can still make purchases or update their user information or share links with their friends. Don’t let users that are having trouble logging in ruin the experience for everyone else. That isn’t to say issues with authentication aren’t a major concern, but they should be isolated in such a way that people who are already in can still do what they came to do.
Code jerks (the features, classes and libraries that bring everything else down when they are upset) come in all sorts of shapes and sizes. There are code jerks that hold resources too long. These jerks normally don’t do much on their own, but when you have lots of users executing the same code at the same time, you essentially end up DoS’ing yourself. There are code jerks that assume they know how the rest of the system works. These jerks tend to halt execution and prevent the rest of the application from working around the error. There are code jerks that give up too easily. These jerks try something once and assume things can’t get any better so there is no point in retrying. The worst code jerks are a combination of the three.
Resource hogs are tough to find, because they typically need lots of friends before they are a noticeable problem. For example, setting a connection timeout to five seconds for a third-party service may seem reasonable. However, when the service is slow and it takes three seconds to connect, you could end up keeping your own connections open much longer than expected. Can your server hold each connection open for three seconds and still handle all of the incoming requests? Even if only some of your user requests need this service, you may run out of connections for the rest of your users. Being able to turn this particular feature off without impacting the rest of your site can help keep your application running and making money.
It isn’t always third-parties that cause trouble. Some times it could be your own database that slows down your application. Let’s say you forgot to add an index, or your data suddenly changed dramatically and your old indexes don’t work as well and your database has become unresponsive. Many applications (including most of the ones I have built in the past) assume that not being able to talk to a database is a fatal error. The database wrapper library may halt execution of the app or make it difficult for the application to properly handle the situation. I have since learned that there are lots of things you can do without database access. For instance, depending on your data and your business rules, you may be able to use slightly old cache data in place of fresh database data. If you isolate the your application from certain types of failures, you can let your code decide what is best. Your code should do everything it can to satisfy the user’s request to the best of your ability and be as resilient to failure as your business allows.
Sticking with our database example, let’s look at what you might do when the primary database isn’t available. Assuming you can’t use the cache as a back up database, what can you do? Can you read from a secondary? Can you log data to a temporary location to be written to the database when it comes back up? Is the primary database really 100% critical to satisfying the user’s request? With a bit of creativity you can probably find a solution that allows you to keep the site up while you work to fix the database problem. But even before you get to that point, how do you know you it is really necessary? This may come as a shock to some, but the Internet is unreliable. It tells you something is there one minute, and then that it is gone the next. The inverse is equally true. Just because you tried to connect to the database and couldn’t, doesn’t mean you need to switch to crisis mode. Maybe you were just a victim of the Internet being the Internet. Code that tries multiple (configurable) times to connect to a database isolates the rest of the application from the reliability issues inherent in communicating between systems.
Approaching software as a set of interconnected and unreliable services helps to create applications which stand up better in the face of less than ideal situations. Unfortunately, even when you have an application that communicates well with logs, has tests to verify functionality and uses configurations to help isolate features from one another, there will still be bugs. Armed with your logs and tests, you will still have to dive into code and make changes. The level of documentation in your code can either make this easier, or a nightmare. The next post will look at how documentation contributes to the maintenance of code.