Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of Internal/Reliability

Timestamp:: Feb 19, 2007, 5:10:57 PM (18 years ago)
Author:: anonymous
Comment:: —

Legend:

: Unmodified
: Added
: Removed
: Modified

Internal/Reliability

               v1
+= ORBIT Reliability 2/2007 =
+== Power Supplies ==
+The power supplies in some ORBIT nodes are failing.  Two power supply
+failure modes from regular operation have been identified.  First, the
+power supply degrades to the point where the CM has enough power to
+report back to the CMC, but not enough power to reliably turn the node
+PC on or off.  It is unclear, but it also seems that this first
+failure mode may also mean incorrect communication between the CM and
+the Node ID box.  Second, the power supply further degrades to where
+there is not even enough power to operate the CM at all.  It is
+possible for a node to operate in one of these failure modes for a
+while and then come back, so for example retrying the power on
+operation might work on a node in the first failure mode.  It seems
+the power supplies degrade over time, not for example over how many
+times they are used in a particular way.  We know this because nodes
+that are used more frequently, around (1, 1), do not fail any more
+frequently than other nodes.  The only known remedy for nodes with
+failed power supplies is to replace the power supply entirely.  It is
+presently unclear how best to do this.  The power supplies in the
+nodes are not in a regular ATX form, and replacing a part in all 400
+nodes of the grid is not a trivial undertaking.  Currently, A small
+number of known good power supplies is used to replace power supplies
+in nodes in either failure mode during weekly scheduled maintenance,
+if not sooner.
+Once a node enters the first failure mode, the problem cascades into
+the software.  The CMC receives regular watchdog messages from each
+CM, with which it makes decisions about node availability.  In the
+first failure mode, the CM will report back to the CMC as if nothing
+is wrong.  That is, you will see nodes listed as "available" on the
+status page, even when it is impossible for the CM to reliably turn
+the node on or off.  The CMC in turn reports incorrect node
+availability to the NodeAgent and NodeHandler, which frustrates any
+attempt to run an experiment on every available node.  Once the power
+supply has degraded into the second failure mode, the CMC stops
+getting watchdog messages, and can correctly mark the node as
+unavailable.
+== CM/CMC Software ==
+We do not have enough evidence to be sure of this, but it seems that
+the CMC issuing UDP commands to CMs fails more often than expect
+scripts issuing equivalent telnet commands to CM consoles.
+Furthermore, the UDP commands seem to upset the internal state of CM
+such that a reset make future commands more reliable.  There also
+exist error conditions in which the CM operates incorrectly, or
+freezes, such that issuing it a reset command does not do anything;
+power must be interrupted to recover the CM from such a state.  This
+is exceptionally bad for remote users, who cannot physically
+manipulate the grid to clear the error.
+There is uncertainty associated with the development environment
+``Dynamic C''.  Dynamic C is not a mature compiler.  Many language
+features a C programmer would expect have been left out or are subtly
+different.  Dynamic C provides several different programming
+constructs for cooperative (or preemptive!) multitasking, and it is
+unclear whether or not the current CM code is using them correctly.
+== Network Infrastructure ==
+We regularly experience bugs in our network switches.  Momentarily
+interrupting the power of the switches often clears otherwise
+unidentifiable network errors.  We strongly suspect that any strenuous
+utilization of the switches, such as would cause packets to be queued
+or discarded, makes the future operation of the switches more likely
+to be in error.  Additionally, we seem to lose one or two out of 27
+Netgear switches every month, such that the switch becomes completely
+inoperable and must be sent back to Netgear for replacement.  Higher
+quality switches are too expensive for us to obtain.
+== Software Remedies ==
+Rewriting the CMC as a properly threaded web service prevents problems
+in failed CM software, as well as power supplies in the first failure
+mode described above, from cascading into the rest of the system.
+Changing the protocol between the CMC and CM to a stateful TCP based
+protocol will make detection even quicker.  Ultimately, failing power
+supplies must be replaced, and the CM code must be made more robust.
+Making CMs reset, rather than turn on and off, their nodes can be used
+to extend the lifetime of the current grid.  There's little we can do
+about the switches, but we can at least detect switch problems more
+quickly.
+=== Threaded CMC ===
+It is difficult to instrument the current CMC to compensate for any
+failure in a command to a CM to turn the node on or off.  One could
+imagine a CMC which checked status of nodes after telling them to turn
+on, perhaps retrying if the first failure mode is detected.  However,
+because the CM and the CMC communicate using a stateless, asynchronous
+protocol over UDP, and because the present implementation of the CMC
+is not threaded, it is impractical to determine whether status check
+results came from before or after a restart command was issued.  Each
+interaction between the CMC and the CM would need to wait from 20 to
+seconds to be sure the status being reported was status from after
+a command was issued.  Because the present CMC implementation can only
+interact in this way with one node at a time, this mandatory wait time
+does not scale.
+=== New CM ===
+The CM is a relatively large program, and we do not have the resources
+to rewrite it all.  However, a smaller feature set would not only make
+a rewrite possible, it would reduce the amount of code.  Less code
+gives the Dynamic C compiler less opportunity to err, and gives us
+less to maintain in the long run.
+=== Switch Tools ===
+We update the firmware in the switches as often as the vendor supplies
+changes, but this does not seem to make things better.  Because the
+software on the switches is closed source software on a closed
+hardware platform there is nothing we can do to directly fix the
+problem.  We are developing better tools for detecting when switch
+ports autonegotiate or otherwise enter unexpected states.
+=== Reset to 'Off Image' ===
+Even in the first failure mode of a power supply, a CM can reliably
+reset the node, causing it to reboot.  The CMC could be modified to
+send reset commands in the place of on and off commands.
+Additionally, the CMC could somehow make it so that these reset
+commands resulted in booting the node from the network, and that the
+network boot image would be a special 'off image' in the case of what
+would normally be a off command.  The current software is careful to
+separate the job of selecting an image for a node into the NodeHandler
+and NodeAgent software, so this change would be a kludge.
+Using just this kludge, the CM would always report the node as being
+on, and therefore it would be impossible to distinguish between a node
+being active or inactive in an experiment.  The 'off image' would
+therefore be made to run an echo service on an obscure port number,
+and the CMC would need to be further modified detect this to determine
+each node's activation state.  Because it is the only software
+performing commands that could change the activation state, the CMC
+could instead keep a record of which nodes are active and which are
+not, however this is a fragile arrangement; if the CMC failed for any
+reason there would need to be something like the obscurely numbered
+echo port to rediscover what was going on.