Microsoft's only factory asset is the human imagination. --Bill Gates

Rule #2 - Fail Fast and Hard

Rule #2 of The Programmer's Code

The Opposite of Speed.  Copyright © Steve Lautenschlager 2012.

I've been snow skiing once in my life. Even though I'm from the plains of the United States, my first time snow skiing was in a place that would have been the envy of many accomplished skiers. The northern Italian town of Courmayeur sits in the shadow of the famous Mont Blanc peak in the Italian Alps. Two friends, both accomplished skiers agreed to let me tag along.

Although I had not been snow skiing before, I had been water skiing. Missouri has a few sizeable lakes which are perfect for water skiing. I thought some of that experience might translate to snow. I was wrong.

As we rode up the telecabine (cable car) from the parking lot to the main ski chalet, the wind blowing off the slopes was filled with white powder. As we rose higher, I could see the chalet and the individual skiers careening down the slopes.

The closer we got, the bigger and steeper the slopes looked. In fact, the slope nearest the chalet must have dropped a hundred feet almost straight down. To my amazement skiers were flying down this smooth slope like sky-divers, then turning sideways and kicking up forty foot rooster tails as they slid eighty feet along the flat part before finally coming to an easy stop. To say I was intimidated would be an understatement.

I said to my friends, "Ha, I'm not going on that today!"

In response, they laughed and said, "That's the easy run."

I was certain they were teasing me. There was no way on earth that slope was easy. No way!

But they weren't kidding. It was a blue slope. In Europe the only thing easier than a blue slope is green which is usually described as a "baby" slope.

Alright, now I was scared.

My friends agreed to spend a few minutes with me on the baby slope. They started by telling me there were only two things I needed to know. Of course, they said this right after pointing out the cliffs and hundred foot drops along the courses. Now add to that the images in my mind of the various co-workers who had already shown up to work that winter with crutches and knee braces and I was shaking in my boots. Two things? Surely I need to know more than two things.

The first thing they taught me was the snow-plow. This is where you point the fronts of your skis together forming a "V" shape and push outward with your feet to dig into the snow. This is the baby way of stopping. I was not comfortable on the snow, but I managed to learn the basics of the snow plow on the baby slope. I had little confidence the snow plow would help me on a real slope.

The second thing they told me was this: "If you lose control, fall." Okay, armed with these two pieces of information they led me to the ski-lift for a blue slope (holy !@#$) and promptly abandoned me for more challenging opportunities. I really wasn't ready for this, but I remembered what they taught me. Snow plow. Fall. Snow plow. Fall.

I found that I could stay upright on the flat surface at the top of the hill, but the second I started downhill I began gaining speed. I had no idea how to control it. I could see so many people ahead of me effortlessly shifting their weight as they sashayed down the slope, smoothly leaning back and forth as easy as rocking in a rocking chair. Meanwhile, I was zipping across the ski run about to take a flying leap into uncharted territory. Then I remembered... If you lose control, fall. Ok. I can do this. I'm not sure how, but with the grace of a rhinoceros on ice skates, I managed to hit the ground while one ski came off my foot and headed downhill. But I was alive!! Thank God, I was alive.

The joy of being alive quickly gave way to stark embarrassment for being such a klutz. As six year girls sailed around me, I wondered how I would get me ski back and recover my dignity--if such a thing were possible.

Fortunately, within a few seconds a young man swooped down the slope ahead of me and grabbed my ski. His momentum carried him back up to me and he graciously handed me my ski. I managed an embarrassed, "Merci." To which he replied "Prego." In seconds he was down the hill and out of sight.

I spent the rest of the day falling. I even took a wrong turn and ended up on a black slope with moguls. The only way out was down so I spent an hour falling and getting back up as I hit one mogul after the next. Luckily there weren't too many people around. Eventually, to my relief, I made it through the day without injury and the blue slopes, while still scary, were a lot less scary.

But the real beauty of this story is that my friends gave me exactly what I needed that day to avoid injury and remain safe. They taught me how to do a basic stop and taught me to fail fast before I ran into a tree at sixty miles per hour. This allowed me the opportunity to learn over time with less risk.

On the slopes, failing fast and hard was preferable to flying over a cliff and dying. Most of the time, software is not a life and death issue, although sometimes it certainly is.

What Do You Mean, Fast and Hard?

Before I go further, let me explain what I mean by failing fast and hard in software. In short, if your program experiences an error or unexpected state it should immediately fail completely. Period.

On the slopes, when I began to lose control, I fell. I did that all day long. Eventually, the average time between falls got shorter. Although, certain conditions--the moguls--dramatically shortened my average time between falls. So I made a correction by paying better attention to the ski lifts so I wouldn't end up on a mogul run.

By failing quickly I preserved more energy and my falls were all controlled. I didn't once crash into a tree or fly over a cliff.

Creating a software application is a learning process in the same way as learning to ski. It requires patience and controlled, planned failure. To use a common cliche, we "fail forward."

No Such Thing as Failing Gracefully

My experience in the Italian Alps is a perfect metaphor for software error handling. It simulates many of the issues software teams face.

When people talk about "failing gracefully," it is a desire to protect the user from lost data or a full scale program crash. Obviously, these are undesirable conditions. But too often this idea is carried too far. Developers often swallow exceptions or overlook unexpected values and continue to run in a bad state. When the program does eventually crash it's hard to debug. Or worse, the program appears to function properly, but its calculations are wrong.

Let's compare this to my skiing experience. If my concern for what others thought of me was greater than my fear of crashing into a tree, I would have continued trying to stay upright on my skis. I would have shifted this way and that, making corrections. In the end, my efforts to stay in control might have allowed me to stay upright. Meanwhile, I am gaining speed with no way to control my direction or to slow down. So my eventual crash will be that much worse.

If my code generates lots of terrible exceptions, assertions and other error conditions, that's music to my ears as a developer. That's one more bug we can fix that would have otherwise been hidden and possibly made its way into production.

But Shouldn't We Try To Handle Some Errors?

Throughout the development process you want to fail frequently, but there are times when small corrections or error handling is desirable. This is were a little judgment is required. If an experienced skier happens to lose his balance slightly, it probably makes sense for him to attempt a correction. After all, he has many skills that can help him recover. On the other hand, if he's operating toward the limit of his ability--a newbie like me on a blue slope or a competitive skier in the Giant Slalom--then a tiny bobble can spell disaster.

A controlled fall is better than an uncontrolled one. When to make this decision between attempting a correction and having a controlled failure is a matter of judgment and experience. However, I would nearly always advocate failing fast and hard. This is always a controlled failure. It is not a graceful failure, which, to my mind is nothing more than an unhealthy concern with appearances. But it exposes the problem plainly and clearly, it prevents the program from continuing in an unstable state and makes debugging much easier.

We Don't Want Errors In Production, Right?

I understand the desire to prevent customers from experiencing program crashes or major errors. But no more is needed than a global error handler to capture all failures and deliver an error message. You could even explain your failure philosophy in a couple of sentences and ask the user to submit the error details. Will this annoy customers? Yes. Nobody wants to experience bugs or failures. But what's the alternative? The program continues in an unstable state. The behavior is flaky. The user can't understand why weird things are happening. Eventually out of frustration, the user abandons your software. This happens every day. The worst thing is, software teams often don't know it. At least if the user sends you the error you have the opportunity to fix it and win them over again. You can turn a potential non-customer into a raving fan when they see how quickly you're able to respond to their problem. Or at the least your next customer won't have the same problem.

To most developers and business leaders, this philosophy doesn't make sense. They say, "Why would I ever want to allow my users to see that the program experienced an error?" To them I want to say, "You don't have a choice and it's time to face reality. Software is always buggy. It will always experience failures no matter how good you think you are. Do you want to know about the bugs and bad experiences or do you want to bury your head in the sand and pretend they don't exist. If you want to be successful, then you want to know. Do you want your users to think you are oblivious or do you want them to develop trust in you as you experience controlled failures and continue to grow and get better. One way you lose a customer and don't even know it. The other way, you have a fighting chance."

Mature Error Handling

Instead of thinking that we should remove all errors from our software, software developers need to think about minimizing errors. When you think this way you understand that there will always be errors and that you need appropriate ways to deal with them. The only way to minimize errors is to be fully aware of as many error states as possible. The best way to do that is to fail fast and hard. And in order to minimize the affect on your production users, have a global error handling mechanism that can save, reset or recover the state of the program while giving the user a meaningful and friendly error message.

One approach to error handling is to try to anticipate all possible error conditions in your code. This is ok, but you'll never succeed. My approach of failing fast and hard means that you worry less about anticipating errors and deal with them rapidly as they come up. You still need to test thoroughly and consider all of the possibilities. But my approach is a reality based, experimental approach from the beginning. It acknowledges the fact that perfection is a fantasy.

Since perfection is off the table, I think I'm going to have some ice cream and go watch television. Now that's a sport I can get into!

Additional Reading

While researching this article I found two other writer's who have produced some excellent articles on the subject of failing fast and hard.

So what do you think? Agree? Disagree? Does your team believe in failing forward? Please let me know below in the comments section.