When a critical system's programming badly malfunctions, if we’re unlucky, the bug might cause substantial property damage, lost stock value, or other administrative cost managing an unnecessary disaster—not to mention the programmer's job. Sometimes though, the consequences can be unimaginable. Today, we’re looking back at on some famous cases that demonstrate just how easy it is for bad programming to turn deadly.
Therac-25: It's Negligence All The Way Down
Anyone who has taken computer science at the university level has heard of the Therac-25 report. It is a major case study given to students by professors as a dire warning of what can happen when programmers get careless; specifically, that people can and will die.
The Therac-25 was a radiation therapy machine built in 1982 for treating aggressive forms of cancer. Built off earlier models, the Therac-6 and Therac-20, the Therac-25 used a dual mode configuration, one for a milder dose of electron radiation, and a second, significantly more powerful dose that was hundreds of times stronger.
The designers of the earlier models recognized the danger of using the wrong setting, and built hardware safety mechanisms that would prevent an accidental overdose of radiation. As such, the programming that ran the machine could rely on the hardware interlocks to protect against incorrectly configuring the machine.
So, when the programmers for the Therac-25 copied the old code from the earlier machines and reused it in the new, they had no idea that the software fail-safes in the old code weren’t nearly as rigorous as they thought they were. And because the programmer for the original machines, a self-taught programmer with no formal training, left no comments in his code as he should have, the Therac-25 programmers had no way of knowing that theirs was a fundamentally different machine. They didn't even bother to adequately test the old code on the new machine, which would have revealed the discrepancy.
Most importantly, the original programmer hadn’t taken into account a “race-condition” between the two different dosage settings. If a technician missed a keystroke, they could accidentally instruct the machine to use one setting followed immediately by an instruction to use the other. Whichever instruction reached the machine first was how the machine was set, regardless of what the technician intended to do, and they wouldn't know the difference.
So rather than shooting a massive blast of radiation at a metal plate that would spread the X-rays over a wider area as it was supposed to, the Therac-25 would occasionally shoot a patient full on with a narrow beam of radiation hundreds of times more powerful than intended—powerful enough to leave radiation burns on many and, in at least five known cases, killed them.
The Patriot Missile That Failed To Fire
In life, simple, seemingly inconsequential mistakes are rarely dangerous. In programming, the simplest mistakes are often the cause of major disasters, and one of the most common mistakes of this kind is misusing floating-point arithmetic.
The problem with floating-point numbers is that there is only so much space available to represent a stored value in memory and if the value you need to store is larger than the available space, the machine cuts the number off to fit it into memory. Cutting those dangling digits at the far end of a string of 0s might seem no big deal, but to a computer performing millions of operations a second, these kinds of mistakes add up.
On February 25th, 1991, in the middle of the First Gulf War, a US Military Patriot Missle Battery in Dharan, Suadi Arabia had been operating for 100 hours without a reboot. The Patriot battery was part of the US Army's defense against the mobile SCUD missiles being used by the Iraqi Army and the program responsible for tracking incoming SCUD missiles relied on a series of calculations that would predict where the missile being tracked would be at the next interval as a function of its velocity and time.
Since the program controlling the Patriot battery had to keep track of time elapsed to determine the SCUD’s trajectory, every tenth of a second the program would query the time since boot-up from the system clock and the clock returned the tenths of a second since start-up as an integer. To convert this into seconds to make their calculations, the programmers multiplied this integer by 1/10, committing a freshman-level programming error in the middle of a war.
The timestamp was stored in a 24-bit memory block, so any binary number larger than 24-bits had to be truncated to fit. 1/10 is a non-terminating binary representation, so cutting the number off at 24-bits introduces a drift of 0.000000095 seconds.
Add all of these errors together over the 100 hours the program was running and the time the program was using was 0.34 seconds off from where the other systems were.
So when the Patriot battery picked up and tracked an incoming SCUD missile on February 25th, it predicted where it would appear at the next interval by triangulating against two radar signals, one reflecting the correct time and one that had the 0.34-second error. It crunched the numbers and looked at the coordinates it expected to find the SCUD, but there was the only empty sky.
Assuming the SCUD had passed out of range, the Patriot did not fire as the SCUD sailed past and soon hit the barracks that the Patriot battery was supposed to protect, killing 28 people and injuring around 100 more.
Mistakes in coding can be deadly, but the failure to properly design and thoroughly test software before it is used can be just as deadly as the accidental misuse of floating-point arithmetic.
Such was the case in Panama City, Panama, when doctors at the National Cancer Institute were using medical software built by a US Firm, Multidata Systems International. The software was an after-market attempt to keep the institute's old and out of date Cobalt-60 radiological equipment functioning.
Add to the poor quality equipment, the doctors were overworked, stressed, and heavily reliant on Multidata’s software to determine the proper dose of radiation to give patients battling severe cancers.
Part of the process was to look at the model of the patient on the screen and block off the healthy tissue with metal plates to protect it from the radiation, drawing the rectangular “blocks” directly onto the model on the screen. To the frustration of doctors at the Institute, the software only allowed the doctors to draw four blocks or less.
The doctors insisted they needed a fifth block, so they looked through the documentation for a workaround. One doctor discovered that they could “fool” the software if they drew all five blocks as one large block with a hole cut out at the top where the large rectangle and the smaller cut-out shared a side.
What they didn’t know was that the software's programming was designed to read in the blocks a certain way and calculate the appropriate dosage based on what was drawn. Amazingly, their programming did not account for how someone might draw shapes with precisely overlapping sides, drawn on the screen in the same direction, clockwise or counterclockwise.
When drawn together like this, the programming read the shapes as a single continuous line and sort of lost its mind trying to figure out what the shape actually was, like feeding it an optical illusion? Somehow, that didn't stop it from calculating a dosage anyway, based on this shape it didn't understand.
The doctor took the dosage figures and for 7 months using the same technique, getting dosages double what they should have been and overdosing an untold number of people.
Of the 28 people we do know received overdoses based on the sloppy programming in Multidata’s software, 8 of them died and the rest were horribly injured by the treatments they received. Considering that all of the patients at the institute were fighting an intractable and deadly disease, there's no way of knowing how many more people were sickened or killed by the erroneous dosage figures output by Multidata's software.
The Importance of Best Practices and Procedures in Programming
No one would ever accuse programmers of having an easy job. Learning to program sophisticated and highly-technical equipment takes a lifetime of work to get right, and even the best programmers still write buggy code.
That is why procedures and best practices have been developed in response to disasters like these, because what the best programmers never lose sight is how quickly a bug can turn into a disaster and how easy it is for bad, negligent, and sloppy programming to kill the very people they intended to help.