|  | 1 | = ORBIT Reliability 2/2007 = | 
          
            |  | 2 |  | 
          
            |  | 3 | == Power Supplies == | 
          
            |  | 4 |  | 
          
            |  | 5 | The power supplies in some ORBIT nodes are failing.  Two power supply | 
          
            |  | 6 | failure modes from regular operation have been identified.  First, the | 
          
            |  | 7 | power supply degrades to the point where the CM has enough power to | 
          
            |  | 8 | report back to the CMC, but not enough power to reliably turn the node | 
          
            |  | 9 | PC on or off.  It is unclear, but it also seems that this first | 
          
            |  | 10 | failure mode may also mean incorrect communication between the CM and | 
          
            |  | 11 | the Node ID box.  Second, the power supply further degrades to where | 
          
            |  | 12 | there is not even enough power to operate the CM at all.  It is | 
          
            |  | 13 | possible for a node to operate in one of these failure modes for a | 
          
            |  | 14 | while and then come back, so for example retrying the power on | 
          
            |  | 15 | operation might work on a node in the first failure mode.  It seems | 
          
            |  | 16 | the power supplies degrade over time, not for example over how many | 
          
            |  | 17 | times they are used in a particular way.  We know this because nodes | 
          
            |  | 18 | that are used more frequently, around (1, 1), do not fail any more | 
          
            |  | 19 | frequently than other nodes.  The only known remedy for nodes with | 
          
            |  | 20 | failed power supplies is to replace the power supply entirely.  It is | 
          
            |  | 21 | presently unclear how best to do this.  The power supplies in the | 
          
            |  | 22 | nodes are not in a regular ATX form, and replacing a part in all 400 | 
          
            |  | 23 | nodes of the grid is not a trivial undertaking.  Currently, A small | 
          
            |  | 24 | number of known good power supplies is used to replace power supplies | 
          
            |  | 25 | in nodes in either failure mode during weekly scheduled maintenance, | 
          
            |  | 26 | if not sooner. | 
          
            |  | 27 |  | 
          
            |  | 28 | Once a node enters the first failure mode, the problem cascades into | 
          
            |  | 29 | the software.  The CMC receives regular watchdog messages from each | 
          
            |  | 30 | CM, with which it makes decisions about node availability.  In the | 
          
            |  | 31 | first failure mode, the CM will report back to the CMC as if nothing | 
          
            |  | 32 | is wrong.  That is, you will see nodes listed as "available" on the | 
          
            |  | 33 | status page, even when it is impossible for the CM to reliably turn | 
          
            |  | 34 | the node on or off.  The CMC in turn reports incorrect node | 
          
            |  | 35 | availability to the NodeAgent and NodeHandler, which frustrates any | 
          
            |  | 36 | attempt to run an experiment on every available node.  Once the power | 
          
            |  | 37 | supply has degraded into the second failure mode, the CMC stops | 
          
            |  | 38 | getting watchdog messages, and can correctly mark the node as | 
          
            |  | 39 | unavailable. | 
          
            |  | 40 |  | 
          
            |  | 41 | == CM/CMC Software == | 
          
            |  | 42 |  | 
          
            |  | 43 | We do not have enough evidence to be sure of this, but it seems that | 
          
            |  | 44 | the CMC issuing UDP commands to CMs fails more often than expect | 
          
            |  | 45 | scripts issuing equivalent telnet commands to CM consoles. | 
          
            |  | 46 | Furthermore, the UDP commands seem to upset the internal state of CM | 
          
            |  | 47 | such that a reset make future commands more reliable.  There also | 
          
            |  | 48 | exist error conditions in which the CM operates incorrectly, or | 
          
            |  | 49 | freezes, such that issuing it a reset command does not do anything; | 
          
            |  | 50 | power must be interrupted to recover the CM from such a state.  This | 
          
            |  | 51 | is exceptionally bad for remote users, who cannot physically | 
          
            |  | 52 | manipulate the grid to clear the error. | 
          
            |  | 53 |  | 
          
            |  | 54 | There is uncertainty associated with the development environment | 
          
            |  | 55 | ``Dynamic C''.  Dynamic C is not a mature compiler.  Many language | 
          
            |  | 56 | features a C programmer would expect have been left out or are subtly | 
          
            |  | 57 | different.  Dynamic C provides several different programming | 
          
            |  | 58 | constructs for cooperative (or preemptive!) multitasking, and it is | 
          
            |  | 59 | unclear whether or not the current CM code is using them correctly. | 
          
            |  | 60 |  | 
          
            |  | 61 | == Network Infrastructure == | 
          
            |  | 62 |  | 
          
            |  | 63 | We regularly experience bugs in our network switches.  Momentarily | 
          
            |  | 64 | interrupting the power of the switches often clears otherwise | 
          
            |  | 65 | unidentifiable network errors.  We strongly suspect that any strenuous | 
          
            |  | 66 | utilization of the switches, such as would cause packets to be queued | 
          
            |  | 67 | or discarded, makes the future operation of the switches more likely | 
          
            |  | 68 | to be in error.  Additionally, we seem to lose one or two out of 27 | 
          
            |  | 69 | Netgear switches every month, such that the switch becomes completely | 
          
            |  | 70 | inoperable and must be sent back to Netgear for replacement.  Higher | 
          
            |  | 71 | quality switches are too expensive for us to obtain. | 
          
            |  | 72 |  | 
          
            |  | 73 | == Software Remedies == | 
          
            |  | 74 |  | 
          
            |  | 75 | Rewriting the CMC as a properly threaded web service prevents problems | 
          
            |  | 76 | in failed CM software, as well as power supplies in the first failure | 
          
            |  | 77 | mode described above, from cascading into the rest of the system. | 
          
            |  | 78 | Changing the protocol between the CMC and CM to a stateful TCP based | 
          
            |  | 79 | protocol will make detection even quicker.  Ultimately, failing power | 
          
            |  | 80 | supplies must be replaced, and the CM code must be made more robust. | 
          
            |  | 81 | Making CMs reset, rather than turn on and off, their nodes can be used | 
          
            |  | 82 | to extend the lifetime of the current grid.  There's little we can do | 
          
            |  | 83 | about the switches, but we can at least detect switch problems more | 
          
            |  | 84 | quickly. | 
          
            |  | 85 |  | 
          
            |  | 86 | === Threaded CMC === | 
          
            |  | 87 |  | 
          
            |  | 88 | It is difficult to instrument the current CMC to compensate for any | 
          
            |  | 89 | failure in a command to a CM to turn the node on or off.  One could | 
          
            |  | 90 | imagine a CMC which checked status of nodes after telling them to turn | 
          
            |  | 91 | on, perhaps retrying if the first failure mode is detected.  However, | 
          
            |  | 92 | because the CM and the CMC communicate using a stateless, asynchronous | 
          
            |  | 93 | protocol over UDP, and because the present implementation of the CMC | 
          
            |  | 94 | is not threaded, it is impractical to determine whether status check | 
          
            |  | 95 | results came from before or after a restart command was issued.  Each | 
          
            |  | 96 | interaction between the CMC and the CM would need to wait from 20 to | 
          
            |  | 97 | 40 seconds to be sure the status being reported was status from after | 
          
            |  | 98 | a command was issued.  Because the present CMC implementation can only | 
          
            |  | 99 | interact in this way with one node at a time, this mandatory wait time | 
          
            |  | 100 | does not scale. | 
          
            |  | 101 |  | 
          
            |  | 102 | === New CM === | 
          
            |  | 103 |  | 
          
            |  | 104 | The CM is a relatively large program, and we do not have the resources | 
          
            |  | 105 | to rewrite it all.  However, a smaller feature set would not only make | 
          
            |  | 106 | a rewrite possible, it would reduce the amount of code.  Less code | 
          
            |  | 107 | gives the Dynamic C compiler less opportunity to err, and gives us | 
          
            |  | 108 | less to maintain in the long run. | 
          
            |  | 109 |  | 
          
            |  | 110 | === Switch Tools === | 
          
            |  | 111 |  | 
          
            |  | 112 | We update the firmware in the switches as often as the vendor supplies | 
          
            |  | 113 | changes, but this does not seem to make things better.  Because the | 
          
            |  | 114 | software on the switches is closed source software on a closed | 
          
            |  | 115 | hardware platform there is nothing we can do to directly fix the | 
          
            |  | 116 | problem.  We are developing better tools for detecting when switch | 
          
            |  | 117 | ports autonegotiate or otherwise enter unexpected states. | 
          
            |  | 118 |  | 
          
            |  | 119 | === Reset to 'Off Image' === | 
          
            |  | 120 |  | 
          
            |  | 121 | Even in the first failure mode of a power supply, a CM can reliably | 
          
            |  | 122 | reset the node, causing it to reboot.  The CMC could be modified to | 
          
            |  | 123 | send reset commands in the place of on and off commands. | 
          
            |  | 124 | Additionally, the CMC could somehow make it so that these reset | 
          
            |  | 125 | commands resulted in booting the node from the network, and that the | 
          
            |  | 126 | network boot image would be a special 'off image' in the case of what | 
          
            |  | 127 | would normally be a off command.  The current software is careful to | 
          
            |  | 128 | separate the job of selecting an image for a node into the NodeHandler | 
          
            |  | 129 | and NodeAgent software, so this change would be a kludge. | 
          
            |  | 130 |  | 
          
            |  | 131 | Using just this kludge, the CM would always report the node as being | 
          
            |  | 132 | on, and therefore it would be impossible to distinguish between a node | 
          
            |  | 133 | being active or inactive in an experiment.  The 'off image' would | 
          
            |  | 134 | therefore be made to run an echo service on an obscure port number, | 
          
            |  | 135 | and the CMC would need to be further modified detect this to determine | 
          
            |  | 136 | each node's activation state.  Because it is the only software | 
          
            |  | 137 | performing commands that could change the activation state, the CMC | 
          
            |  | 138 | could instead keep a record of which nodes are active and which are | 
          
            |  | 139 | not, however this is a fragile arrangement; if the CMC failed for any | 
          
            |  | 140 | reason there would need to be something like the obscurely numbered | 
          
            |  | 141 | echo port to rediscover what was going on. |