Some of the most disastrous software fails
If there is one word that sums up software testing the best, it’s learning.
Learning how software works, its capabilities, strengths, and weaknesses. Teaches us more about our own perceptions and tests our understanding of product.
I’ve blogged before about learning from mistakes we have made. But the opportunity to learn from the mistakes of others is as, if not more important for our growth. Gives us the ability to think better and is fascinating as well.
If you receive my weekly newsletter (you can sign-up here if you’d like). You would know that I’m reading Black Box Thinking by Matthew Syed.
In it, Matthew describes how failure can be redefined to cultivate new ideas. Thus, enable creative thinking, and how if we can change our perceptions of failure. We can use the experiences of others to be lead to success.
“Learn from the mistakes of others. You can’t live long enough to make them all yourself.”
From surgeons and pilots to the head of Dyson. Matthew uses the examples of others to describe how failure is the catalyst for us to learn.
So taking the lesson and applying it to software development. How can we learn from the mistakes of others to improve the testing discipline?
What are some of the biggest software failures that had the biggest impact?
Mariner 1
The year is 1962.
A new spacecraft is being launched from Cape Canaveral in Florida. The moment promises to lead a new age in space discovery.
Everyone is on the edge of their seat, waiting for lift-off to occur, and the extent of human knowledge be on display.
Lift off. For a few moments. Everything appears to be going to plan.
But then, it explodes before everyone’s eyes. And while there were no casualties. It still left a 135 million dollar invoice in the hands of NASA.
But how did this happen?
The issue to self-destruct was a human one. But the ultimate trigger for it needing to be sent was a software error.
One which was caused by the spacecraft manoeuvring in a way that it shouldn’t have. Over correcting a normal movement, and putting it on a dangerous path. So for safety reasons. The craft was destroyed.
In those days. Software programs came on punch cards, of which one carried a flaw in a mathematical equation.
As reported a few days later,
The hyphen symbol, called a “bar,” if officially fed into the computer on its punched card instructions, in effect tells the machine not to worry about this normal veering movement.
Read more here.
Pentium FDIV bug
In 1994. Thomas Nicely, a professor of mathematics at Lynchburg College. Discovered a bug when using Intel processors to calculate floating-point numbers. The bug would cause inconsistencies when calculating past the third decimal place. The bug didn’t affect all Intel’s processors. But it wasn’t only a few either.
Many people were not happy about this and demanded a replacement. But Intel would only do so if they could prove their CPU had the error. Which it turned out only a fraction of people ended up doing.
In all, this ended up costing Intel a total of 475 million dollars.
Read more here.
Therac-25
An example of one of the most deadly programming related accidents in history.
The Therac-25 was a radiation therapy device developed by Atomic Energy of Canada Limited in 1982. Unlike modern medical units controlled by separate computers. The Therac-25 had its own controller and operating system.
Patients treated using the Therac-25 between 1985 and 1987. Were exposed to much higher levels of radiation than they were supposed to. Causing them to receive fatal amounts, and caused their deaths.
Researchers investigated the accidents and found many issues which contributed to the accidents. Mainly, the programmer who designed the system did not run and unit tests of his code. Nor was it independently tested, or reviewed with other hardware.
Nowadays machines are run by separate computers. With separate operating systems, which makes these kinds of errors less likely.
Read more here.
Patriot Missile Defence System
Another example of how software errors can lead to lost lives. In 1991, an Iraqi Scud missile hit U.S. Army barracks in Dhahran Saudi Arabia. Resulting in the deaths of 28 soldiers.
The missile defence system whose purpose is to keep them safe. Failed to intercept the missile. The incident was the starting-gun for a government investigation into what went wrong.
They found that due to a programming error in the way the system handled timestamps. It caused a tracking calculation error to occur that got worse the longer system was in use.
The fix? Reboot the system.
But, this wasn’t a practical long term solution. And after being notified about the bug a few weeks earlier by the Israelis. The manufacturer of the system released an updated version of the software.
Yet, this didn’t reach the barracks until one day after the incident.
Read more here.
Conclusion
The key message here is don’t skimp on your software testing efforts. You have a responsibility to develop software that is safe, isn’t a potential safety hazard for your clients.
I enjoyed researching some of these “epic bugs”. So look for another blog post in the future with some more.