The Mars Pathfinder mission, while widely proclaimed to be flawless, actually had a small bug in the scheduling code which handled tasks while the craft was operating on the surface of Mars. This bug caused intermittent computer resets, each of which resulted in the loss of data.

The Pathfinder computer handled three tasks; a high priority bus management thread which was responsible for moving data on and off the information bus (basically a shared memory area), a medium priority communications thread, and a low priority thread responsible for gathering meteorological data and publishing it to the information bus.

The low priority meteorological thread would occasionally lock a mutex (mutual exclusion lock) protecting access to the information bus so that is could publish it's data. The reset bug was the result of a deadlock that occasionally occurred when the communications task was scheduled in the short interval when the meteorological thread held the information bus lock, blocking the high priority bus management thread. Since the communications task was long running and had a higher priority than the meteorological task, it prevented the low priority task from running. But that task still held the mutex, preventing any other task from completing its work. After the information bus task hadn't executed for a given period of time, a watch dog timer would go off and reset the computer.

The Pathfinder computer was a victim of priority inversion. Priority inversion occurs when a the execution of a high priority task is prevented by a low priority task. In Pathfinder's case, the low priority meteorological task blocked the medium priority communications task by holding the mutex to the information bus.

The problem was identified and fixed by JPL engineers who worked with a computer identical to the one on Pathfinder.


Most of this information comes from various accounts of a talk given by David Wilner, Chief Technical Officer of Wind River Systems. Wind River's VxWorks was the RTOS which ran on the Pathfinder computer.

http://catless.ncl.ac.uk/Risks/19.49.html

Log in or register to write something here or to contact authors.