Software Architecture: Therac-25 the killer radiation machine
Introduction
Every day in class I tell my students insistently that the software must be tested, that they are playing with people's lives. They think that I am joking when I make this comment, since we are developing Web services that perform small calculations using a simple Restful API. Therefore, tests are just extra work for them, and they do not see them as a fundamental element in the development of software.
History can reveal that this small recommendation is fundamental when developing software for medical purposes.
When I was a student of Computer Science, more than 17 years ago :-(, the professors explained to us that Software Engineering had the same civil responsibility as the classic Engineering (for example, Architecture) since we had the same responsibility an architect would have if a bridge were to collapse ... In one of the software quality classes we were talking about the famous case of Therac-25, which came to my mind these days after dealing with my students.
Therac-25: the killer radiation machine
In 1982 a machine called Therac-25 created by the Atomic Energy of Canada Limited (AECL) appeared in the medical field for cancer treatments, using radiation and x-rays. This machine was an improvement of the Therac-20 and cost approximately 1 million dollars. The machine worked well in most of the occasions it was used but it was related to the death of several people and the development of serious diseases in others. During the investigation process, up to 6 cases were identified in which the machine was related to the deaths or side effects of these people.
Wikipedia:
The machine offered two modes of radiation therapy:
- Direct electron-beam therapy, in which a narrow low-current beam of high-energy (5 MeV to 25 MeV) electrons was scanned over the treatment area by magnets;
- Megavolt X-ray (or photon) therapy, which delivered a fixed width beam of X-rays, produced by colliding a narrow 100-times higher current beam of 25 MeV electrons with a target, then passing the emitted X-rays through both a flattening filter and a collimator.
It also included a "Field light" mode, which allowed the patient to be correctly positioned by illuminating the treatment areas with visible light.
The software was basically responsible for:
- monitoring the machine
- accepting the input for the treatment
- setting up the machine to administer this treatment
- and finally controlling the machine while it carried out the treatment
Software Bugs
The bugs that appeared in the software are quite difficult to identify. The procedure for the appearance of the bug was the following:
- The operator made an error at the start of the treatment (using the user interface) in the configuration of the machine.
- Rectified using the software of the machine.
- The user interface indicated that everything was going well or not stopping the process, but it allowed to continue in the operation of the radiation.
- The patients received radiation up to 125 times higher than what had been configured.
The main problems in the development of this software had been the following:
- The programmers did not assume that the operators could make a mistake in the configuration and would have to reconfigure the machine. They assumed that the operator would always take the right path of configuration. This fact is one of the main reasons why it is recommended that the tests be done by people who are different from those who develop the software, since the developers know the use of software so well that they rarely make any mistakes, but rather that the use is quasi- perfect.
- No tests of any kind were developed, that is, the software was only tested by them in the cases of successes when they were developing it.
One commission concluded that the main reasons for this disaster were due to an incorrect software architecture and poor software development practices. The commission did not attribute the errors to specific programming errors, which could have been avoided if software quality minimums had been applied.
Most of the software that newbies develop is difficult to test due to the strong coupling between the software functions and elements. The really surprising thing was that the software, of about 1 million dollars, developed by a company that dedicates itself to medical machinery had the exact same issues as the code of a novice. That is, due to the low quality of the development, no tests could be developed by the commission.
The conclusions of the commission were the following ones:
- AECL did not have the software code independently reviewed (only for developers).
- AECL did not consider the design of the software during its assessment of how the machine might produce the desired results and what failure modes existed. These form parts of the general techniques known as reliability modeling and risk management.
- The system noticed that something was wrong and halted the X-ray beam, but merely displayed the word "MALFUNCTION" followed by a number from 1 to 64. The user manual did not explain or even address the error codes, so the operator pressed the P key to override the warning and proceed anyway.
- AECL personnel, as well as machine operators, initially did not believe complaints. This was likely due to overconfidence.
- AECL had never tested the Therac-25 with the combination of software and hardware until it was assembled at the hospital.
The researchers also found several engineering issues:
- The failure occurred only when a particular nonstandard sequence of keystrokes was entered on the VT-100 terminal which controlled the PDP-11 computer: an "X" to (erroneously) select 25 MeV photon mode followed by "cursor up", "E" to (correctly) select 25 MeV Electron mode, then "Enter", all within eight seconds.
- The design did not have any hardware interlocks to prevent the electron-beam from operating in its high-energy mode without the target in place.
- The engineer had reused software from older models. Such methods manifest in so called cargo cult programming where there is blind reliance on previously created code that is poorly understood and may or may not be applicable. These models had hardware interlocks that masked their software defects. Those hardware safeties had no way of reporting that they had been triggered, so there was no indication of the existence of faulty software commands.
- The hardware provided no way for the software to verify that sensors were working correctly (see open-loop controller). The table-position system was the first implicated in Therac-25's failures; the manufacturer revised it with redundant switches to cross-check their operation.
- The equipment control task did not properly synchronize with the operator interface task, so that race conditions occurred if the operator changed the setup too quickly. This was missed during testing, since it took some practice before operators were able to work quickly enough to trigger this failure mode.
- The software set a flag variable by incrementing it, rather than by setting it to a fixed non-zero value. Occasionally an arithmetic overflow occurred, causing the flag to return to zero and the software to bypass safety checks.
Conclusions
It has presented a famous case of the history of medicine and Software Engineering, which puts in the spotlight how bad practices in software engineering have ended the lives of people.
This story makes us reflect seriously on the need for a regulation of minimum quality in software development as in other Engineering fields. Although most of the software applications that are developed are applications without major importance, critical software is also being developed, such as the one installed in airports, airplanes, cars, sanitary machinery, elevators, etc ..., with no minimum software quality, the final outcome can be dramatic.
Therefore, software can not be simply copy and pasting code. People who think they are super hackers or divine gods and do not need to develop software tests in the code they produce should be punished in the same way as an architect or naval engineer does not meet the quality requirements.
However, accidents can always occur, their cause not necessarily being due to ignorance or bad practices. Software development is a science that begins to have a considerable maturity. Nevertheless, every day new professionals are incorporated, who require an adequate training and culture in the good practices of software engineering
More, More and More...
- http://sunnyday.mit.edu/papers/therac.pdf
- Death and Denial, The Failure of the Therac-25 (http://cobra.csc.calpoly.edu/~dbutler/papers/THERAC25.html)
- The Downfall of the Therac-25 (http://net.cs.utexas.edu/users/dianelaw/cs378/therac.htm)
- The Therac-25 Incident (http://ei.cs.vt.edu/~cs3604/lib/Therac_25/TheracClass.html)
- Human Error in Medicine (http://www.smi.stanford.edu/people/felciano/research/humanerror/humanerrortalk.html)
- An Investigation of the Therac-25 Accidents (part 1-5) (http://ei.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html)