RE: loom25 being used as a crash box

Brian Bershad (bershad@cs.washington.edu)
Wed, 28 Oct 1998 12:32:01 -0800

For background, the problem being solved here is:

- Porcupine One is now so reliable that most failures are of the
kernel.
- There is presently no way to monitor and recover from kernel
failures
- Yasushi is building a Failure Recovery Service that runs on a
single node:
- when a porc node suspects that another porc node has
gone down, it calls up
the failure recovery service and says "check it out!"
- the failure recovery service pings the porc app on the
suspect node
if the porc app is there and running, it reports
a false positive
if the porc app is not there but the node is up,
the porc app is restarted
if the node is down or hung, then the node is
restarted
At the end of the failure recovery, a mail msg goes out
tracking what happened.

.

-----Original Message-----
From: yasushi@yasushi-pc [mailto:yasushi@yasushi-pc]
Sent: Wednesday, October 28, 1998 12:02 PM
To: porcupine@yasushi-pc; syn@yasushi-pc; spin-m3@yasushi-pc
Subject: loom25 being used as a crash box

For a next couple of days, I will be testing a watchdog mechanism on
loom25. This means loom25 will go through many involuntary reboots and
powercycling.

yaz