Reading 05: Fatal Flaws

9/23/2018

From the readings, what were the root causes of the Therac-25 accidents? What are the challenges for software developers working safety-critical systems, how should they approach these projects, and should they be held liable when accidents happen?

Fatal Flaws

Software bugs often seem insignificant in comparison to issues with mechanical or electronic systems. It's hard to imagine comparing the flaws in the design of bridge that collapses with the flaws in the design in a software program that can be exploited to misuse a system. However, the Therac-25 accidents demonstrate that flaws in software design are often harder to pinpoint and recognize, yet can be just as dangerous. Although the Therac-25 accidents may appear to be extremes, after reading the articles I was able to recognize a few similar situations in the management of a large control system while I was working at my internship.

One of the key things that made the Therac-25 accidents appear to be blatant negligence was the apparent lack of testing that was performed on the system and the inability of the engineers to replicate the bug to determine the root cause of the overexposure. The bug was due to a race condition between two different agents who were attempting the change the location of magnetic spreading plates (which were able to direct the radiation between the two different modes). If both agents tried to change the locations of the plates at the same time, the system would be unable to determine the locations of the plates and would not be able to control the dose of radiation properly. This type of error is difficult to pinpoint because it is an issue between the timing of different agents. Thus, it could only be replicated through the fast application of a specific set of commands so both agents were trying to access the same item at one time. Although from an outside perspective, it seems like it would be an easy bug to test for with automated testing and it seems that it would be easy to recognize when this sort of flaw happened. However, after working with mutexes in OS, I know that my experience has taught me that race conditions are notoriously difficult to detect or even to replicate. Although I do agree that testing should have been implemented in the design and production, rather than assume that the software from the previous version was effective even without the hardware controls, it's hard to concretely say that testing alone would've prevented the bug from becoming a issue. The fact that engineers were unable to replicate the bug in a controlled environment (after reports of overexposure) proves this. Although it is easy to say testing needed to be implemented, it also may not have stopped the problem.

The bug actually existed in versions of the code in the Therac-20 systems, however the Therac-20 implemented hardware controls that led to a fuse being blown when the magnetic plates were out of position. The hardware controls were eliminated in the newer Therac-25 systems, thus the flaw in the code wasn't protected by the hardware controls. I believe that the removal of these controls in combination with the assumption that the code was already tested were the largest mistakes made leading to these accidents. Knowing that the hardware controls protected against user error by using mechanical interfaces, the removal of these controls needed to happen with the assumption that the hardware control were necessary because the software alone didn't protect from dangerous overexposure situations. In other words, the hardware safety controls could be removed only because the provided no additional protection. Because when running the software, the hardware controls still needed to be utilized, the hardware controls were still necessary. In my internship I experience scenarios where automated testing of the software alone would indicate that the software was correct, even if it didn't provide the proper hardware output in the system. Automated testing of software alone is not enough to ensure that the system and its hardware is functioning properly.

If a company produces a product which has a flaw in it, that company becomes responsible for their design and the flaws associated with it. Just as a car manufacturer is forced to issue a recall when there's a flaw in their design, a company is responsible to their shareholders to disclose the flaws in their design and take responsibility for the product.

0 Comments

Blog

Reading 05: Fatal Flaws

Fatal Flaws

Leave a Reply.

Kaitlyn