The Tragic Race Condition
Last week we talked about the future 2038 problem. But today let's go back in time.
In the early 1980s, a groundbreaking piece of medical technology emerged, promising to revolutionize cancer treatment.
The Therac-25, a computer-controlled radiation therapy machine, was a leap forward, promising to deliver precise radiation doses to patients battling cancer. It was one of the first radiation therapy machines controlled primarily by software rather than hardwired logic.
Hospitals eagerly adopted this cutting-edge device, trusting its software to accurately calibrate and deliver calculated dosages of high-energy radiation to destroy cancerous cells without harming surrounding healthy tissue. Little did they know, the Therac-25's software carried lethal bugs that would turn its precision medicine promise into a real-life nightmare.
Fatal Flaws
Between 1985-1987, a series of horrifying incidents occurred where the Therac-25 massively overdosed patients with radiation levels hundreds of times higher than safe limits. The impacts were catastrophic - causing serious radiation injuries, permanent disabilities, and multiple deaths.
Malfunction 54
It started at a cancer center in East Texas in 1986. An experienced technician was setting up a radiation treatment, which some training and use of a machine would only take seconds to breeze through. After initially entering the wrong radiation mode code, she quickly course-corrected using the cursor keys as was common practice. The machine then showed an error message "Malfunction 54". The technician had seen these unclear error messages many times before during normal use. She didn't think it was a serious issue, so she went ahead with the treatment anyway.
But this bypassed an important safety step. It meant the machine's powerful scanning magnets were out of sync with the new radiation settings the technician had entered. As a result, the patient ended up getting blasted with a massive radiation dose - over 16,000 times more than he was supposed to receive. It was radiation exposure on the same level as the Chernobyl nuclear disaster. The patient said it felt like a "burning, shocking" sensation, as if hot coffee had been poured on him. After a few days the patient unfortunately died from the complications of this exposure.
Root Cause Analysis
The fatally flawed software behavior was eventually recreated during another patient treatment by the same technician who had dealt with the initial "Malfunction 54" error. Tragically, this second patient also died from a massive radiation overdose. The hospital physicist became convinced there was an underlying issue with how the machine's powerful magnets were controlling and shaping the radiation beam.
After trying different scenarios, the physicist finally managed to trigger the glitch by rapidly entering treatment settings in a very specific sequence at maximum speed. To his alarm, the amount of radiation being delivered was so high that it went beyond the maximum his testing equipment could measure. Recalibrating the detectors revealed a dose in the range of 10,000 to 20,000 rads - over 100 times the intended therapeutic level and lethally excessive.
Here's what was happening: multiple routines were running concurrently in the software, including ones for data entry and keyboard input handling. These routines shared a single variable that recorded when the technician finished entering commands.
Once data entry completed, the beam calibration and magnet setting phase would begin. However, if the technician made a specific sequence of rapid edits during that 8-second magnet setting window, the new settings wouldn't actually apply to the hardware due to the shared program variable.
The user interface would then display the wrong treatment mode to the technician, who would proceed to confirm and start the potentially lethal treatment, unaware of the inconsistent machine state. This race condition defect was present in the previous Therac-20 model as well. But hardware interlocks on that older system prevented the flaw from actually causing radiation overdoses.
The Aftermath
It took over 2 years and multiple serious overdoses before the connection was made to the race condition causing these catastrophic failures in the Therac-25's operation. This resulted in at least 6 incidents, where patients were given massive overdoses of radiation altering their life forever.
The aftermath investigations found several major contributing causes beyond just the specific race condition software bug. These included:
- Poor software development practices at the manufacturer AECL, with lack of independent code review, testing, and safety analysis. The software was developed by a single unidentified programmer over the course of several years
- Overconfidence and assurances given to operators that overdoses were impossible, causing them to discount safety warnings and hardware safeties.
- Engineering issues like cryptic error messages, lack of hardware interlocks, reusing software from older models without verifying safety.
In addition to Malfunction 54 there was a number of other bugs potentially costing lives. The tragedy exposed widespread lapses in regulatory oversight and quality control processes for medical device software at the time. It prompted an overhaul of FDA regulations and safety standards, including:
- Mandating rigorous hazard analysis, risk assessment, and testing protocols for safety-critical code.
- Stricter requirements around software design, development lifecycles, and quality assurance.
- Enabling better traceability through design history file documentation.
Not many of us work on software projects like this, but the Therac-25 incidents serves as a wake-up call that software bugs could have catastrophic real-world consequences when proper development rigor and safety processes were not followed.
Links to learn more:
Member discussion