Mars Pathfinder – What really happened?
Launched on December 4, 1996 by NASA aboard a Delta II booster, the Mars Pathfinder landed on July 4, 1997 on Mars’s Ares Vallis, in a region called Chryse Planitia in the Oxia Palus quadrangle. The lander then opened, exposing the rover which conducted many experiments on the Martian surface.
The Mars Pathfinder landed to a media fanfare and began to transmit data back to Earth. Days later, the flow of information and images was interrupted by a series of total systems resets. The source of the problem was due to priority Inversion which subsequently caused a deadline-miss of a critical task, which was identified by a watchdog timer, and finally, the action in such faulty scenario was to reset the spacecraft.
How this problem was a) diagnosed and b) resolved makes for a fascinating tale for embedded engineers and is a great learning lesson as well.
Diagnosing the issue
The applications of Pathfinder were scheduled by the VxWorks RTOS. Since VxWorks provides pre-emptive priority scheduling of threads, tasks were executed as threads with priorities determined by their relative urgency.
The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus synchronized with semaphores. Other higher priority threads took precedence when necessary, including a very high priority bus management task, which also accessed the bus by acquiring the semaphore. Unfortunately in this case, a long-running communications task, having higher priority than the meteorological task, but lower than the bus management task, prevented it from running.
Soon, a watchdog timer noticed that the bus management task had not been executed for some time, concluded that something had gone wrong, and ordered a total system reset. (Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch and returned to focusing on the mission-critical landing software.)
Finding a solution
Engineers worked frantically on a lab replica to diagnose and fix the problem, eventually spotting a priority inversion. A priority inversion occurs when a high priority task is indirectly pre-empted by a medium priority task “inverting” the relative priorities of the two tasks (see Figure 1). This is a clear violation of the priority model which says high priority tasks can only be prevented from running by higher priority tasks and briefly by low priority tasks which will quickly complete their use of a resource shared by the high and low priority tasks.
To fix the problem, they turned on a boolean parameter or a flag. This flag indicates whether priority inheritance for the low prio task should be performed by the semaphore. The semaphore in question had been initialized with the parameter off (it was the default setting); had it been on, the priority inversion would have been prevented.
Under priority inheritance, the task that holds the semaphore inherits the priority of a higher priority task when the higher priority task requests the semaphore. In Figure 1, task “low” would inherit the priority of task “high” when that task requested the semaphore. This allows “low” to pre-empt “medium”.
A global variable stored the initialization parameter for the semaphore which caused the problem. Because VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed during system debugging, it was possible to upload a short C program to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. This put an end to the system resets.
What did we learn?
- Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified. A black box diagnosis without traces would have been impossible;
- The presence of “debugging” facilities in the system was extremely important. The problem could not have been corrected without the ability to modify the system;
- Spending extra time to ensure priority inheritance correctness at the testing stage, even at some additional performance cost, would have been invaluable.
The problem was identified before this incident
When the keynote speaker referred to a paper which first identified the priority inversion problem and proposed the solution, something extraordinary happened – amazingly, the authors were all in the room and received a rapturous reception. The original paper was:
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.