ORBIT-USER: Node not registered for testbed?

chris at orderonenetworks.com chris at orderonenetworks.com
Sun Aug 13 19:58:11 EDT 2006


Max,

Thanks very much for the detailed reply. It was unexpected and welcome
coming on a Sunday evening!

I'm hoping to run a routing protocol scalability test making use of the
entire grid (if possible). So the more nodes that are able to be part of
it, the better.

Will I need to do anything special to ensure that my OML logs (when I get
them working) can be collected from that many nodes?

Is there a FAQ anywhere talking about issues when trying to use the entire
grid at once?

Thank you again,
Chris




> Chris,
>
> You ran into a problem with the grid services.
>
> If you look at the design of the Orbit framework, you will see that
> the interface between the user and the system is what we call 'grid
> services'. They are a set of web services which all user activities
> are building on.
>
> Specifically, there is a service called 'CMC' which allows a user to
> control the node hardware: switch on/off power, get environment
> measurements, such as voltage levels and temperature, and get access
> to the serial console.
>
> The nodehandler is using this service to switch the nodes on or off.
> The CMC service exposes a few methods for this - actually far too many
> at this stage. The CMC service also maintains a database for what
> nodes are in service and which aren't and that's what you are hitting.
> It appears that node 4 at 1 is 'out-of-service'.
>
> console:~$ wget -O - -q http://cmc:5012/cmc/allStatus | grep "'n_4_1'"
>                     <node name = 'n_4_1' x='4' y='1' state='NODE NOT
> AVAILABLE' />
>
> The problem is that the CMC service is unfortunately not implemented
> consistently. When the nodehandler requests that an entire set of
> nodes to be switched on, the CMC service quietly ignores all the
> 'out-of-service' nodes and reports success. As the nodehandler doesn't
> know that some nodes are out-of-service it will later request those
> nodes to be reset - assuming they went astray. Now the CMC service is
> reporting an error which leads to what you have observed.
>
> Now, what should you, or we do. Unfortunately, the CMC service is a
> really important one to automate the operation of Orbit. We spent a
> lot of engineering effort in getting it nicely integrated into the
> hardware - and that part works really well. However, the software on
> the server side never reached the same level of maturity. Lots of
> history, water under the bridge.
>
> There are ways for the nodehandler to work around those. Let me see if
> I can come up with something.
>
> In the meantime, I can only ask you to use the 'allStatus' command I
> showed above to see if the nodes you need for your experiment are
> really ready for use.
>
> Sorry for the inconvenience,
>
> -max
>
>
>
> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com> wrote:
>> Hello,
>>
>> I'm getting an odd message when trying to imageNodes on the main grid.
>>
>> FATAL run: ServiceException: ServiceException
>>         Node (4,1) Not Registered for Testbed:
>> '#<CMC::Testbed:0xa7ab5fc0>'
>>
>> I've pasted the entire run below.
>>
>> If I try subsections of the grid, I get other nodes that give the same
>> error.
>>
>> I haven't seen this error on the sandboxes. How do I register a node for
>> the testbed?
>>
>> Thanks,
>> Chris
>>
>> ----------------------------
>> Imaging nodes: 1..20,1..20 with image baseline.ndz
>> Using config /etc/nodehandler/grid.cfg
>> /etc/nodehandler/grid.cfg:20: warning: Insecure world writable dir /tmp,
>> mode 040777
>> Using logfile /etc/nodehandler/nodehandler_log.xml
>>  INFO init: NodeHandler Version 3.6.4-1 (849)
>>  INFO init: Experiment ID: grid_2006_08_13_16_56_19
>>  INFO Experiment: load system:exp:stdlib
>>  INFO prop.resetDelay: resetDelay = 180:Fixnum
>>  INFO Experiment: load system:exp:imageNode
>>  INFO prop.nodes: nodes = [1..20, 1..20]:Array
>>  INFO prop.image: image = "baseline.ndz":String
>>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>> /tmp/eee.169/lib/util/communication.rb:127: warning: Insecure world
>> writable dir /tmp, mode 040777
>>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>  INFO n_18_19: Checked in as /ip/10.10.18.19 booting off baseline:1.0.9
>>  WARN n_18_19: Expected image 'pxe:1.1.4', but node reported
>> 'baseline:1.0.9'.
>>  INFO n_18_19: Resseting node
>> FATAL run: ServiceException: ServiceException
>>         Node (4,1) Not Registered for Testbed:
>> '#<CMC::Testbed:0xa7ab5fc0>'
>>  INFO run: Experiment grid_2006_08_13_16_56_19 finished after 0:42
>>  done.
>>
>>
>>
>
>
> --
> Dr. Max Ott
> Research Program Leader - Network and Pervasive Computing, NICTA Australia
> Founder & CTO, Semandex Networks
> Research Professor, WINLAB, Rutgers University
>





More information about the orbit-user mailing list