How to Kill People with Bad Software

The Incidents

Linda Knight: June 3,1985

61-year old Linda Knight had been receiving follow-up treatment at the Kennestone Regional Oncology Center (Marietta, GA) for the removal of a malignant breast tumor. On June 3, staff at Kennestone prepared Knight for electron treatment to the clavicle area, using a Therac-25 radiology machine. Knight had been through the process before, which was ordinarily uneventful. This time, when the machine was turned on, Knight felt a "tremendous force of heat" this red-hot sensation." When the technician re-entered the therapy room, Knight said, "you burned me." The technician replied that that was "not possible."

For a week, doctors continued to send Knight back to Kennestone for Therac treatment, but when the welt on her chest began to break down and lose layers of skin, Knight refused to undergo any more radiation treatment.

About two weeks later, the physicist at Kennestone noticed that Knight had a matching burn on her back, as though the burn had gone through her body. The swelling on her back had also begun to slough off skin. Knight was in great pain, and her shoulder had become immobile. These clues led the physicist to conclude that Knight had indeed suffered a major radiation burn. Knight had probably received one or two radiation doses in the 20,000-rad (radiation absorbed dose) range, well above the typical prescribed dosage of around 200-rads.

Linda Knight was in constant pain, lost the use of her shoulder and arm, and her left breast had to be removed because of the radiation burns.

Radiation Absorbed Dose

Medical linear accelerators (linacs) accelerate electrons to create high-energy beams that can destroy tumors with minimal impact on the surrounding healthy tissue. Relatively shallow tissue is treated with the accelerated electrons; to reach deeper tissue, the electron beam is converted into X-ray photons. The measure used for therapeutic uses is the radiation absorbed dose (rad). This is a measure of the radiation that is absorbed by tissue in a treatment. Standard single radiation treatments are in the range of 200 rads. 500 rads is the accepted level of radiation that, if the entire body is exposed to it, will result in the death of 50% of the cases. The unprotected electron beam in the Therac-25 is capable of producing between 15,000 and 20,000 rads in a single treatment. The unprotected beam is never aimed directly at a patient. It is either spread to a safe concentration by scanning magnets or turned into X-rays and reduced by a beam flattener.

Donna Gartner: July 26,1985

Donna Gartner, a 40-year old cancer patient, was at the Ontario Cancer Foundation clinic in Hamilton, Ontario, Canada for her 24th Therac treatment for carcinoma of the cervix. During the procedure the software indicated a minor operating error. The treatment continued. Despite the fact that the Therac had indicated that no radiation dose had been given during Donna Gartner's five therapy attempts that day, Gartner complained of a burning sensation she described as an "electric tingling shock" in the treated area of her hip.

Donna Gartner died on November 3, 1985 from cancer. An autopsy revealed that had the cancer not killed Gartner, a total hip replacement would have been necessary because of the radiation overexposure.

Janis Tilman: December 1985

Janis Tilman was being treated with the Therac-25 machine at the Yakima Valley Memorial Hospital in Yakima, Washington. After one treatment in December 1985, her skin in the treatment area, her right hip, began to redden in a parallel striped pattern. The reddening did not immediately follow treatment with the Therac-25 because it generally takes at least several days before the skin reddens and/or swells from a radiation overexposure. Upon investigation in February 1987, the Yakima staff found Tilman to have a chronic skin ulcer, dead tissue, and constant pain in her hip, providing further evidence for a radiation overexposure. Tilman underwent surgery and skin grafts, and overcame the incident with minor disability and some scarring related to the overdose.

Isaac Dahl: March 22, 1986

At the East Texas Cancer Center (ETCC) in Tyler, Texas, 33-year old Isaac Dahl was to receive his ninth Therac-25 radiation therapy session after a tumor had been successfully removed from his left shoulder.

The operator was isolated from Dahl because the Therac-25 operates from within a shielded room. On this day the video monitor was unplugged and the audio monitor was broken, leaving no way for the operator to know what was happening inside. Isaac Dahl had been lying on the treatment table, waiting for the usually uneventful radiation therapy, when he saw a bright flash of light, heard a frying, buzzing sound, and felt a thump and heat like an electric shock.

Dahl, knowing from his previous 8 sessions that this was not normal, began to get up from the treatment table when the second "attempt" at treatment occurred. This time the electric-like jolt hit him in the neck and shoulder. He rolled off the table and pounded on the treatment room door until the surprised Therac-25 operator opened it. Dahl was immediately examined by a physician, who observed reddening of the skin but suspected only an electric shock. Dahl was discharged and told to return if he suffered any further complications.

Isaac Dahl's condition worsened as he lost the use of his left arm and had constant pain and periodic nausea and vomiting spells. He was later hospitalized for several major radiation-induced symptoms (including vocal cord paralysis, paralysis of his left arm and both legs, and a lesion on his left lung). Dahl died in August of 1986 due to complications from the radiation overdose.

Daniel McCarthy : April 11,1986

Technicians could find nothing wrong with the Therac-25 unit at the East Texas Cancer Center (ETCC) and left it in service.

Daniel McCarthy was being treated for skin cancer on the side of his face. The same Therac operator who had treated Isaac Dahl was treating McCarthy. She then began treatment, but the Therac-25 shut down within a few seconds, making a noise audible through the newly repaired intercom. The Therac monitor read "Malfunction 54." The operator rushed into the treatment room and found McCarthy moaning for help. He said that his face was on fire. The hospital physicist was called. McCarthy said that something had hit the side of his face, and that he had seen a flash of light and heard a sizzling sound.

Over the next three weeks Daniel McCarthy became very disoriented and then fell into a coma. He had a fever as high as 104 degrees and had suffered neurological damage. He died on May 1, 1986.

Anders Engman: January 17, 1987

Anders Engman was at the Yakima Valley Memorial Hospital on January 17, 1987 to receive three sets of radiation treatment from the Therac-25. CMC engineers estimated that Engman received between 8,000 and 10,000 rads instead of the prescribed 86.

Anders Engman died in April 1987. He had been suffering from a terminal form of cancer before the Therac accident, but it was determined that his death was primarily caused by complications related to the radiation overdose, not the cancer.

Because some accidents were never officially investigated, some information on the Therac-25 software development, management, and quality control procedures are not available. What is included below has been gleaned from law suits and depositions, government records, and copies of correspondence and other material obtained from the U.S. Food and Drug Administration (FDA), which regulates these devices.

All lawsuits arising from these incidents were settled out of court.

Background

In the early 1970s, Atomic Energy of Canada Limited (AECL) 1 and a French company called CGR went into business together building linear accelerators. The products of this cooperation were (1) the Therac-6, a 6 million electron volt (MeV) accelerator capable of producing X-rays only and later (2) the Therac-20, a 20 MeV, dual-mode (X-rays or electrons) accelerator.

Several features of the Therac-25 are important in understanding the accidents. First, like the Therac-6 and the Therac-20, the Therac-25 is controlled by a PDP-11 computer. However, AECL designed the Therac-25 to take advantage of computer control from the outset; they did not build on a stand-alone machine. The Therac-6 and Therac-20 had been designed around machines that already had histories of clinical use without computer control.

In addition, the Therac-25 software has more responsibility for maintaining safety than the software in the previous machines. The Therac-20 has independent protective circuits for monitoring the electron-beam scanning plus mechanical interlocks for policing the machine and ensuring safe operation. The Therac-25 relies more on software for these functions. AECL took advantage of the computer's abilities to control and monitor the hardware and decided not to duplicate all the existing hardware safety mechanisms and interlocks.

Real Time Software

Real-time software is software that interacts with the world on the world's schedule, not the software's. For instance, software to keep a radio tuner on the signal of a drifting station could take two approaches. It might simply update the signal every 0.1 seconds, searching for the strongest signal within some bandwidth. Another approach is to include a sensor that detects when the signal loses strength and only then search for a stronger signal nearby. This latter approach is real-time. If senses the world and responds to changes in the world when those changes occur.

First, it involves the software in reading and responding to sensors about the state of "the world." With Therac-25, these sensors indicated things like the intensity of the beam, the position of various parts of the machine (e.g. the turntable) and commands entered at the console by the operator. Sensors, of course, can go bad, or give incorrect readings. When they do, the software needs to be able to detect these problems and respond accordingly, or at least fail in a graceful manner that doesn't endanger life. So, Therac software needed to track and respond to several things in real-time without dropping any important balls.

The main tasks for which the software is responsible include:

Operator

Monitoring input and editing changes from an operator
Updating the screen to show current status of machine
Printing in response to an operator commands

Machine

monitoring the machine status
placement of turntable
strength and shape of beam
operation of bending and scanning magnets
setting the machine up for the specified treatment
turning the beam on
turning the beam off (after treatment, on operator command, or if a malfunction is detected)

The Reactions:

After each overdose the creators of Therac-25 were contacted. After the first incident the AECL responses was simple, "After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac-25 or by any operator error (Leveson, 1993)."

After the 2^nd incident the AECL sent a service technician to the Therac-25 machine, he was unable to recreate the malfunction and therefore conclude nothing was wrong with the software. Some minor adjustments to the hardware were changed but the main problems still remained.

It was not until the fifth incident that any formal action was taken by the AECL. However it was a physicist at the hospital where the 4^th and 5th incident took place in Tyler, Texas who actually was able to reproduce the mysterious "malfunction 54". The AECL finally took action and made a variety of changes in the software of the Therac-25 radiation treatment system. The machine itself is still in use today.

The Blame

The general consensus is that the Atomic Energy of Canada Limited is to blame. There was only one person programming the code for this system and he largely did all the testing. The machine was tested for only 2700 hours of use, but for code which controls such a critical machine, many more hours should have been put in to the testing phase. Also Therac-25 was tested as a whole machine rather then in separate modules. Testing in separate modules would have discovered many of the bugs. Also, if the AECL believed that there were problems with the Therac-25 right after the first incident then it is possible that most of the 5 other incidents could have been avoided and possibly the 3 fatalities.

Causal Factors

Many lessons can be learned from this series of accidents. A few are considered here.

Overconfidence in Software.

A common mistake in engineering, in this case and in many others, is to put too much confidence in software. There seems to be a feeling among non-software professionals that software will not or cannot fail, which leads to complacency and over reliance on computer functions.

A related tendency among engineers is to ignore software. The first safety analysis on the Therac-25 did not include software---although nearly full responsibility for safety rested on it. When problems started occurring, it was assumed that hardware had caused them, and the investigation looked only at the hardware.

Confusing Reliability with Safety.

This software was highly reliable. It worked tens of thousands of times before overdosing anyone, and occurrences of erroneous behavior were few and far between. AECL assumed that their software was safe because it was reliable, and this led to complacency.

Lack of Defensive Design.

The software did not contain self-checks or other error-detection and error-handling features that would have detected the inconsistencies and coding errors. Audit trails were limited because of a lack of memory. However, today larger memories are available and audit trails and other design techniques must be given high priority in making trade off decisions.

Patient reactions were the only real indications of the seriousness of the problems with the Therac-25; there were no independent checks that the machine and its software were operating correctly. Such verification cannot be assigned to operators without providing them with some means of detecting errors: The Therac-25 software "lied" to the operators, and the machine itself was not capable of detecting that a massive overdose had occurred. The ion chambers on the Therac-25 could not handle the high density of ionization from the unscanned electron beam at high beam current; they thus became saturated and gave an indication of a low dosage. Engineers need to design for the worst case.

Failure to Eliminate Root Causes.

One of the lessons to be learned from the Therac-25 experiences is that focusing on particular software design errors is not the way to make a system safe. Virtually all complex software can be made to behave in an unexpected fashion under some conditions: There will always be another software bug. Just as engineers would not rely on a design with a hardware single point of failure that could lead to catastrophe, they should not do so if that single point of failure is software. The Therac-20 contained the same software error implicated in the Tyler deaths, but this machine included hardware interlocks that mitigated the consequences of the error. Protection against software errors can and should be built into both the system and the software itself. We cannot eliminate all software errors, but we can often protect against their worst effects, and we can recognize their likelihood in our decision making.

One of the serious mistakes that led to the multiple Therac-25 accidents was the tendency to believe that the cause of an accident had been determined (e.g., a microswitch failure in the case of Hamilton) without adequate evidence to come to this conclusion and without looking at all possible contributing factors. Without a thorough investigation, it is not possible to determine whether a sensor provided the wrong information, the software provided an incorrect command, or the actuator had a transient failure and did the wrong thing on its own. In the case of the Hamilton accident, a transient microswitch failure was assumed to be the cause even though the engineers were unable to reproduce the failure or to find anything wrong with the microswitch.

In general, it is a mistake to patch just one causal factor (such as the software) and assume that future accidents will be eliminated. Accidents are unlikely to occur in exactly the same way again. If we patch only the symptoms and ignore the deeper underlying causes, or if we fix only the specific cause of one accident, we are unlikely to have much effect on future accidents. The series of accidents involving the Therac-25 is a good example of exactly this problem: Fixing each individual software flaw as it was found did not solve the safety problems of the device.

Complacency.

Often it takes an accident to alert people to the dangers involved in technology. A medical physicist wrote about the Therac-25 accidents:

In the past decade or two, the medical accelerator "industry" has become perhaps a little complacent about safety. We have assumed that the manufacturers have all kinds of safety design experience since they've been in the business a long time. We know that there are many safety codes, guides, and regulations to guide them and we have been reassured by the hitherto excellent record of these machines. Except for a few incidents in the 1960's (e.g., at Hammersmith, Hamburg) the use of medical accelerators has been remarkably free of serious radiation accidents until now. Perhaps, though we have been spoiled by this success [6]. This problem seems to be common in all fields.

Unrealistic Risk Assessments.

The first hazard analyses initially ignored software, and then they treated it superficially by assuming that all software errors were equally likely. The probabilistic risk assessments generated undue confidence in the machine and in the results of the risk assessment themselves. When the first Yakima accident was reported to AECL, the company did not investigate. Their evidence for their belief that the radiation burn could not have been caused by their machine included a probabilistic risk assessment showing that safety had increased by five orders of magnitude as a result of the microswitch fix.

The belief that safety had been increased by such a large amount seems hard to justify. Perhaps it was based on the probability of failure of the microswitch (typically 10^5 ) and-ed with the other interlocks. The problem with all such analyses is that they typically make many independence assumptions and exclude aspects of the problem---in this case, software---that are difficult to quantify but which may have a larger impact on safety than the quantifiable factors that are included.

Inadequate Investigation or Followup on Accident Reports.

Every company building safety-critical systems should have audit trails and incident analysis procedures that are applied whenever any hint of a problem is found that might lead to an accident. The first phone call by Tim Still should have led to an extensive investigation of the events at Kennestone. Certainly, learning about the first lawsuit should have triggered an immediate response.

Inadequate Software Engineering Practices.

Some basic software engineering principles that apparently were violated in the case of the Therac-25 include the following:

Software specifications and documentation should not be an afterthought.
Rigorous software quality assurance practices and standards should be established.
Designs should be kept simple and dangerous coding practices avoided.
Ways to detect errors and and get information about them, such as software audit trails, should be designed into the software from the beginning.
The software should be subjected to extensive testing and formal analysis at the module and software level; system testing alone is not adequate. Regression testing should be performed on all software changes.
Computer displays and the presentation of information to the operators, such as error messages, along with user manuals and other documentation need to be carefully designed. The manufacturer said that the hardware and software were "tested and exercised separately or together over many years." In his deposition for one of the lawsuits, the quality assurance manager explained that testing was done in two parts. A "small amount" of software testing was done on a simulator, but most of the testing was done as a system. It appears that unit and software testing was minimal, with most of the effort directed at the integrated system test. At a Therac-25 user's meeting, the same man stated that the Therac-25 software was tested for 2,700 hours. Under questioning by the users, he clarified this as meaning "2700 hours of use." The FDA difficulty in getting an adequate test plan out of the company and the lack of regression testing are evidence that testing was not done well.

The design is unnecessarily complex for such critical software. It is untestable in the sense that the design ensured that the known errors (there may very well be more that have just not been found) would most likely not have been found using standard testing and verification techniques. This does not mean that software testing is not important, only that software must be designed to be testable and that simple designs may prevent errors in the first place.

Poor Software Reuse.

Important lessons about software reuse can be found in these accidents. A naive assumption is often made that reusing software or using commercial off-the-shelf software will increase safety because the software will have been exercised extensively. Reusing software modules does not guarantee safety in the new system to which they are transferred and sometimes leads to awkward and dangerous designs. Safety is a quality of the system in which the software is used; it is not a quality of the software itself. Rewriting the entire software in order to get a clean and simple design may be safer in many cases.

Safe versus Friendly User Interfaces.

Making the machine as easy as possible to use may conflict with safety goals. Certainly, the user interface design left much to be desired, but eliminating multiple data entry and assuming that operators would check the values carefully before pressing the return key was unrealistic.

Error messages provided to the operator were cryptic, and some merely consisted of the word malfunction followed by a number from 1 to 64 denoting an analog/digital channel number. According to an FDA memorandum written after one accident:

The operator's manual supplied with the machine does not explain nor even address the malfunction codes. The Maintance [sic] Manual lists the various malfunction numbers but gives no explanation. The materials provided give no indication that these malfunctions could place a patient at risk. The program does not advise the operator if a situation exists wherein the ion chambers used to monitor the patient are saturated, thus are beyond the measurement limits of the instrument.

This software package does not appear to contain a safety system to prevent parameters being entered and intermixed that would result in excessive radiation being delivered to the patient under treatment.

A radiation therapist at another clinic reported that an average of 40 dose-rate malfunctions, attributed to underdoses, occurred on some days. The operator further testified that during instruction she had been taught that there were "so many safety mechanisms" that she understood it was virtually impossible to overdose a patient.

User and Government Oversight and Standards.

Once the FDA got involved in the Therac-25, their response was impressive, especially considering how little experience they had with similar problems in computer-controlled medical devices. Since the Therac-25 events, the FDA has moved to improve the reporting system and to augment their procedures and guidelines to include software. The input and pressure from the user group was also important in getting the machine fixed and provides an important lesson to users in other industries.

Conclusion:

The Therac-25 is one of the most devastating computer related engineering disasters to date. The machine was designed to help people and largely, it did. Yet some sloppy engineering on the part of the AECL led to the death or serious injury of six people. These incidents could have been avoided if the AECL reacted instead of denying responsibility.

Specific Problem Example

During machine setup, Set Up Test will be executed several hundred times because it reschedules itself waiting for other events to occur. In the code, the Class3 variable is incremented by one in each pass through Set Up Test. Since the Class3 variable is one byte, it can only contain a maximum value of 255 decimal. Thus, on every 256th pass through the Set Up Test code, the variable will overflow and have a zero value. That means that on every 256th pass through Set Up Test, the upper collimator will not be checked and an upper collimator fault will not be detected. The overexposure occurred when the operator hit the "set" button at the precise moment that Class3 rolled over to zero. Thus, Chkcol was not executed and F$mal was not set to indicate that the upper collimator was still in the field-light position.

The following list of causal factors relate to the Therac-25 incidents discussed in this paper. These are the factors that led to the serious problems with the Therac-25 radiology machine.

Causal Factors

Overconfidence in Software
Confusing Reliability with Safety
Lack of Defensive Design
Failure to Eliminate Root Causes
Complacency
Unrealistic Risk Assessments
Inadequate Investigation or Followup on Accident Reports
Inadequate Software Engineering Practices
- Software specifications and documentation should not be an afterthought.
- Rigorous software quality assurance practices and standards should be established.
- Designs should be kept simple and dangerous coding practices avoided.
- Ways to detect errors and and get information about them, such as software audit trails, should be designed into the software from the beginning.
- The software should be subjected to extensive testing and formal analysis at the module and software level; system testing alone is not adequate. Regression testing should be performed on all software changes.
- Computer displays and the presentation of information to the operators, such as error messages, along with user manuals and other documentation need to be carefully designed.
Software Reuse
Safe versus Friendly User Interfaces
User and Government Oversight and Standards

Links

For more information on the Therac-25 please consult these links. The list of causal factors is taken from Nancy Levenson's excellent paper, Medical Devices: The Therac-25, which is the first link on this list.

Medical Devices: The Therac-25