FW: [InternetInterest@cats.ucsc.edu: What really happened on Mars

Dan Kohn (dan@teledesic.com)
Tue, 9 Dec 1997 10:46:42 -0800


> -----Forwarded message from InternetInterest@cats.ucsc.edu-----
>
> Reply-To: booloo@cats.ucsc.edu
>
>
> >From: Peter Langston <psl@langston.com>
> Date: Mon, 8 Dec 97 13:49:57 -0800
> To: Fun_People@langston.com
>
> [This one is rated CG-13 -- there are some mildly technical parts (so
> people
> over 13 might require child guidance) but they aren't necessary to an
>
> understanding of the article ... -psl]
>
> Forwarded-by: Nev Dull <nev@bostic.com>
> >From: Mike Jones <mbj@MICROSOFT.com>
>
> The Mars Pathfinder mission was widely proclaimed as "flawless" in the
> early
> days after its July 4th, 1997 landing on the Martian surface.
> Successes
> included its unconventional "landing" -- bouncing onto the Martian
> surface
> surrounded by airbags, deploying the Sojourner rover, and gathering
> and
> transmitting voluminous data back to Earth, including the panoramic
> pictures
> that were such a hit on the Web. But a few days into the mission, not
> long
> after Pathfinder started gathering meteorological data, the spacecraft
> began
> experiencing total system resets, each resulting in losses of data.
> The
> press reported these failures in terms such as "software glitches" and
> "the
> computer was trying to do too many things at once".
>
> This week at the IEEE Real-Time Systems Symposium I heard a
> fascinating
> keynote address by David Wilner, Chief Technical Officer of Wind River
> Systems. Wind River makes VxWorks, the real-time embedded systems
> kernel
> that was used in the Mars Pathfinder mission. In his talk, he
> explained in
> detail the actual software problems that caused the total system
> resets of
> the Pathfinder spacecraft, how they were diagnosed, and how they were
> solved. I wanted to share his story with each of you.
>
> VxWorks provides preemptive priority scheduling of threads. Tasks on
> the
> Pathfinder spacecraft were executed as threads with priorities that
> were
> assigned in the usual manner reflecting the relative urgency of these
> tasks.
>
> Pathfinder contained an "information bus", which you can think of as a
> shared memory area used for passing information between different
> components
> of the spacecraft. A bus management task ran frequently with high
> priority
> to move certain kinds of data in and out of the information bus.
> Access to
> the bus was synchronized with mutual exclusion locks (mutexes).
>
> The meteorological data gathering task ran as an infrequent, low
> priority
> thread, and used the information bus to publish its data. When
> publishing
> its data, it would acquire a mutex, do writes to the bus, and release
> the
> mutex. If an interrupt caused the information bus thread to be
> scheduled
> while this mutex was held, and if the information bus thread then
> attempted
> to acquire this same mutex in order to retrieve published data, this
> would
> cause it to block on the mutex, waiting until the meteorological
> thread
> released the mutex before it could continue. The spacecraft also
> contained
> a communications task that ran with medium priority.
>
> Most of the time this combination worked fine. However, very
> infrequently
> it was possible for an interrupt to occur that caused the (medium
> priority)
> communications task to be scheduled during the short interval while
> the
> (high priority) information bus thread was blocked waiting for the
> (low
> priority) meteorological data thread. In this case, the long-running
> communications task, having higher priority than the meteorological
> task,
> would prevent it from running, consequently preventing the blocked
> information bus task from running. After some time had passed, a
> watchdog
> timer would go off, notice that the data bus task had not been
> executed for
> some time, conclude that something had gone drastically wrong, and
> initiate
> a total system reset.
>
> This scenario is a classic case of priority inversion.
>
> HOW WAS THIS DEBUGGED?
>
> VxWorks can be run in a mode where it records a total trace of all
> interesting system events, including context switches, uses of
> synchronization objects, and interrupts. After the failure, JPL
> engineers
> spent hours and hours running the system on the exact spacecraft
> replica in
> their lab with tracing turned on, attempting to replicate the precise
> conditions under which they believed that the reset occurred. Early
> in the
> morning, after all but one engineer had gone home, the engineer
> finally
> reproduced a system reset on the replica. Analysis of the trace
> revealed
> the priority inversion.
>
> HOW WAS THE PROBLEM CORRECTED?
>
> When created, a VxWorks mutex object accepts a boolean parameter that
> indicates whether priority inheritance should be performed by the
> mutex.
> The mutex in question had been initialized with the parameter off; had
> it
> been on, the low-priority meteorological thread would have inherited
> the
> priority of the high-priority data bus thread blocked on it while it
> held
> the mutex, causing it be scheduled with higher priority than the
> medium-priority communications task, thus preventing the priority
> inversion.
> Once diagnosed, it was clear to the JPL engineers that using priority
> inheritance would prevent the resets they were seeing.
>
> VxWorks contains a C language interpreter intended to allow developers
> to
> type in C expressions and functions to be executed on the fly during
> system
> debugging. The JPL engineers fortuitously decided to launch the
> spacecraft
> with this feature still enabled. By coding convention, the
> initialization
> parameter for the mutex in question (and those for two others which
> could
> have caused the same problem) were stored in global variables, whose
> addresses were in symbol tables also included in the launch software,
> and
> available to the C interpreter. A short C program was uploaded to the
> spacecraft, which when interpreted, changed the values of these
> variables
> from FALSE to TRUE. No more system resets occurred.
>
> ANALYSIS AND LESSONS
>
> First and foremost, diagnosing this problem as a black box would have
> been
> impossible. Only detailed traces of actual system behavior enabled
> the
> faulty execution sequence to be captured and identified.
>
> Secondly, leaving the "debugging" facilities in the system saved the
> day.
> Without the ability to modify the system in the field, the problem
> could
> not have been corrected.
>
> Finally, the engineer's initial analysis that "the data bus task
> executes
> very frequently and is time-critical -- we shouldn't spend the extra
> time
> in it to perform priority inheritance" was exactly wrong. It is
> precisely
> in such time critical and important situations where correctness is
> essential, even at some additional performance cost.
>
> HUMAN NATURE, DEADLINE PRESSURES
>
> David told us that the JPL engineers later confessed that one or two
> system
> resets had occurred in their months of pre-flight testing. They had
> never
> been reproducible or explainable, and so the engineers, in a very
> human-nature response of denial, decided that they probably weren't
> important, using the rationale "it was probably caused by a hardware
> glitch".
>
> Part of it too was the engineers' focus. They were extremely focused
> on
> ensuring the quality and flawless operation of the landing software.
> Should
> it have failed, the mission would have been lost. It is entirely
> understandable for the engineers to discount occasional glitches in
> the
> less-critical land-mission software, particularly given that a
> spacecraft
> reset was a viable recovery strategy at that phase of the mission.
>
> THE IMPORTANCE OF GOOD THEORY/ALGORITHMS
>
> David also said that some of the real heroes of the situation were
> some
> people from CMU who had published a paper he'd heard presented many
> years
> ago who first identified the priority inversion problem and proposed
> the
> solution. He apologized for not remembering the precise details of
> the
> paper or who wrote it. Bringing things full circle, it turns out that
> the
> three authors of this result were all in the room, and at the end of
> the
> talk were encouraged by the program chair to stand and be
> acknowledged.
> They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was the last
> time
> you saw a room of people cheer a group of computer science theorists
> for
> their significant practical contribution to advancing human knowledge?
> :-)
> It was quite a moment.
>
> POSTLUDE
>
> For the record, the paper was:
>
> L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance
> Protocols: An
> Approach to Real-Time Synchronization. In IEEE Transactions on
> Computers,
> vol. 39, pp. 1175-1185, Sep. 1990.
>
> -----End of forwarded message-----