ORBIT-USER: Node not registered for testbed?

Max Ott max at semandex.net
Sun Aug 13 19:03:09 EDT 2006


Chris,

You ran into a problem with the grid services.

If you look at the design of the Orbit framework, you will see that
the interface between the user and the system is what we call 'grid
services'. They are a set of web services which all user activities
are building on.

Specifically, there is a service called 'CMC' which allows a user to
control the node hardware: switch on/off power, get environment
measurements, such as voltage levels and temperature, and get access
to the serial console.

The nodehandler is using this service to switch the nodes on or off.
The CMC service exposes a few methods for this - actually far too many
at this stage. The CMC service also maintains a database for what
nodes are in service and which aren't and that's what you are hitting.
It appears that node 4 at 1 is 'out-of-service'.

console:~$ wget -O - -q http://cmc:5012/cmc/allStatus | grep "'n_4_1'"
                    <node name = 'n_4_1' x='4' y='1' state='NODE NOT
AVAILABLE' />

The problem is that the CMC service is unfortunately not implemented
consistently. When the nodehandler requests that an entire set of
nodes to be switched on, the CMC service quietly ignores all the
'out-of-service' nodes and reports success. As the nodehandler doesn't
know that some nodes are out-of-service it will later request those
nodes to be reset - assuming they went astray. Now the CMC service is
reporting an error which leads to what you have observed.

Now, what should you, or we do. Unfortunately, the CMC service is a
really important one to automate the operation of Orbit. We spent a
lot of engineering effort in getting it nicely integrated into the
hardware - and that part works really well. However, the software on
the server side never reached the same level of maturity. Lots of
history, water under the bridge.

There are ways for the nodehandler to work around those. Let me see if
I can come up with something.

In the meantime, I can only ask you to use the 'allStatus' command I
showed above to see if the nodes you need for your experiment are
really ready for use.

Sorry for the inconvenience,

-max



On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com> wrote:
> Hello,
>
> I'm getting an odd message when trying to imageNodes on the main grid.
>
> FATAL run: ServiceException: ServiceException
>         Node (4,1) Not Registered for Testbed: '#<CMC::Testbed:0xa7ab5fc0>'
>
> I've pasted the entire run below.
>
> If I try subsections of the grid, I get other nodes that give the same error.
>
> I haven't seen this error on the sandboxes. How do I register a node for
> the testbed?
>
> Thanks,
> Chris
>
> ----------------------------
> Imaging nodes: 1..20,1..20 with image baseline.ndz
> Using config /etc/nodehandler/grid.cfg
> /etc/nodehandler/grid.cfg:20: warning: Insecure world writable dir /tmp,
> mode 040777
> Using logfile /etc/nodehandler/nodehandler_log.xml
>  INFO init: NodeHandler Version 3.6.4-1 (849)
>  INFO init: Experiment ID: grid_2006_08_13_16_56_19
>  INFO Experiment: load system:exp:stdlib
>  INFO prop.resetDelay: resetDelay = 180:Fixnum
>  INFO Experiment: load system:exp:imageNode
>  INFO prop.nodes: nodes = [1..20, 1..20]:Array
>  INFO prop.image: image = "baseline.ndz":String
>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
> /tmp/eee.169/lib/util/communication.rb:127: warning: Insecure world
> writable dir /tmp, mode 040777
>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>  INFO n_18_19: Checked in as /ip/10.10.18.19 booting off baseline:1.0.9
>  WARN n_18_19: Expected image 'pxe:1.1.4', but node reported
> 'baseline:1.0.9'.
>  INFO n_18_19: Resseting node
> FATAL run: ServiceException: ServiceException
>         Node (4,1) Not Registered for Testbed: '#<CMC::Testbed:0xa7ab5fc0>'
>  INFO run: Experiment grid_2006_08_13_16_56_19 finished after 0:42
>  done.
>
>
>


-- 
Dr. Max Ott
Research Program Leader - Network and Pervasive Computing, NICTA Australia
Founder & CTO, Semandex Networks
Research Professor, WINLAB, Rutgers University



More information about the orbit-user mailing list