Changes between Initial Version and Version 1 of Internal/Reliability


Ignore:
Timestamp:
Feb 19, 2007, 5:10:57 PM (18 years ago)
Author:
anonymous
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Internal/Reliability

    v1 v1  
     1= ORBIT Reliability 2/2007 =
     2
     3== Power Supplies ==
     4
     5The power supplies in some ORBIT nodes are failing.  Two power supply
     6failure modes from regular operation have been identified.  First, the
     7power supply degrades to the point where the CM has enough power to
     8report back to the CMC, but not enough power to reliably turn the node
     9PC on or off.  It is unclear, but it also seems that this first
     10failure mode may also mean incorrect communication between the CM and
     11the Node ID box.  Second, the power supply further degrades to where
     12there is not even enough power to operate the CM at all.  It is
     13possible for a node to operate in one of these failure modes for a
     14while and then come back, so for example retrying the power on
     15operation might work on a node in the first failure mode.  It seems
     16the power supplies degrade over time, not for example over how many
     17times they are used in a particular way.  We know this because nodes
     18that are used more frequently, around (1, 1), do not fail any more
     19frequently than other nodes.  The only known remedy for nodes with
     20failed power supplies is to replace the power supply entirely.  It is
     21presently unclear how best to do this.  The power supplies in the
     22nodes are not in a regular ATX form, and replacing a part in all 400
     23nodes of the grid is not a trivial undertaking.  Currently, A small
     24number of known good power supplies is used to replace power supplies
     25in nodes in either failure mode during weekly scheduled maintenance,
     26if not sooner.
     27
     28Once a node enters the first failure mode, the problem cascades into
     29the software.  The CMC receives regular watchdog messages from each
     30CM, with which it makes decisions about node availability.  In the
     31first failure mode, the CM will report back to the CMC as if nothing
     32is wrong.  That is, you will see nodes listed as "available" on the
     33status page, even when it is impossible for the CM to reliably turn
     34the node on or off.  The CMC in turn reports incorrect node
     35availability to the NodeAgent and NodeHandler, which frustrates any
     36attempt to run an experiment on every available node.  Once the power
     37supply has degraded into the second failure mode, the CMC stops
     38getting watchdog messages, and can correctly mark the node as
     39unavailable.
     40
     41== CM/CMC Software ==
     42
     43We do not have enough evidence to be sure of this, but it seems that
     44the CMC issuing UDP commands to CMs fails more often than expect
     45scripts issuing equivalent telnet commands to CM consoles.
     46Furthermore, the UDP commands seem to upset the internal state of CM
     47such that a reset make future commands more reliable.  There also
     48exist error conditions in which the CM operates incorrectly, or
     49freezes, such that issuing it a reset command does not do anything;
     50power must be interrupted to recover the CM from such a state.  This
     51is exceptionally bad for remote users, who cannot physically
     52manipulate the grid to clear the error.
     53
     54There is uncertainty associated with the development environment
     55``Dynamic C''.  Dynamic C is not a mature compiler.  Many language
     56features a C programmer would expect have been left out or are subtly
     57different.  Dynamic C provides several different programming
     58constructs for cooperative (or preemptive!) multitasking, and it is
     59unclear whether or not the current CM code is using them correctly.
     60
     61== Network Infrastructure ==
     62
     63We regularly experience bugs in our network switches.  Momentarily
     64interrupting the power of the switches often clears otherwise
     65unidentifiable network errors.  We strongly suspect that any strenuous
     66utilization of the switches, such as would cause packets to be queued
     67or discarded, makes the future operation of the switches more likely
     68to be in error.  Additionally, we seem to lose one or two out of 27
     69Netgear switches every month, such that the switch becomes completely
     70inoperable and must be sent back to Netgear for replacement.  Higher
     71quality switches are too expensive for us to obtain.
     72
     73== Software Remedies ==
     74
     75Rewriting the CMC as a properly threaded web service prevents problems
     76in failed CM software, as well as power supplies in the first failure
     77mode described above, from cascading into the rest of the system.
     78Changing the protocol between the CMC and CM to a stateful TCP based
     79protocol will make detection even quicker.  Ultimately, failing power
     80supplies must be replaced, and the CM code must be made more robust.
     81Making CMs reset, rather than turn on and off, their nodes can be used
     82to extend the lifetime of the current grid.  There's little we can do
     83about the switches, but we can at least detect switch problems more
     84quickly.
     85
     86=== Threaded CMC ===
     87
     88It is difficult to instrument the current CMC to compensate for any
     89failure in a command to a CM to turn the node on or off.  One could
     90imagine a CMC which checked status of nodes after telling them to turn
     91on, perhaps retrying if the first failure mode is detected.  However,
     92because the CM and the CMC communicate using a stateless, asynchronous
     93protocol over UDP, and because the present implementation of the CMC
     94is not threaded, it is impractical to determine whether status check
     95results came from before or after a restart command was issued.  Each
     96interaction between the CMC and the CM would need to wait from 20 to
     9740 seconds to be sure the status being reported was status from after
     98a command was issued.  Because the present CMC implementation can only
     99interact in this way with one node at a time, this mandatory wait time
     100does not scale.
     101
     102=== New CM ===
     103
     104The CM is a relatively large program, and we do not have the resources
     105to rewrite it all.  However, a smaller feature set would not only make
     106a rewrite possible, it would reduce the amount of code.  Less code
     107gives the Dynamic C compiler less opportunity to err, and gives us
     108less to maintain in the long run.
     109
     110=== Switch Tools ===
     111
     112We update the firmware in the switches as often as the vendor supplies
     113changes, but this does not seem to make things better.  Because the
     114software on the switches is closed source software on a closed
     115hardware platform there is nothing we can do to directly fix the
     116problem.  We are developing better tools for detecting when switch
     117ports autonegotiate or otherwise enter unexpected states.
     118
     119=== Reset to 'Off Image' ===
     120
     121Even in the first failure mode of a power supply, a CM can reliably
     122reset the node, causing it to reboot.  The CMC could be modified to
     123send reset commands in the place of on and off commands.
     124Additionally, the CMC could somehow make it so that these reset
     125commands resulted in booting the node from the network, and that the
     126network boot image would be a special 'off image' in the case of what
     127would normally be a off command.  The current software is careful to
     128separate the job of selecting an image for a node into the NodeHandler
     129and NodeAgent software, so this change would be a kludge.
     130
     131Using just this kludge, the CM would always report the node as being
     132on, and therefore it would be impossible to distinguish between a node
     133being active or inactive in an experiment.  The 'off image' would
     134therefore be made to run an echo service on an obscure port number,
     135and the CMC would need to be further modified detect this to determine
     136each node's activation state.  Because it is the only software
     137performing commands that could change the activation state, the CMC
     138could instead keep a record of which nodes are active and which are
     139not, however this is a fragile arrangement; if the CMC failed for any
     140reason there would need to be something like the obscurely numbered
     141echo port to rediscover what was going on.