ORBIT-USER: Node not registered for testbed?
chris at orderonenetworks.com
chris at orderonenetworks.com
Mon Aug 14 11:28:24 EDT 2006
Kishore,
Thanks for the reply. I'll let you know if it works when I reach that point.
Thanks,
Chris
> Hi Chris:
>
>> In terms of working around the issue of some nodes not being registered
>> to
>> the test bed, can I have an experiment with something like this in it:
>>
>> defNodes('sender', [[1,1],[1,2],[1,3],[1,4]..[2,1],[2,2]..[3,1]...])
>> {|node|
>> ...
>> }
>>
>> with basically a list of all 400 nodes, less those that don't work?
>
> Yes - that should work.
>
> regards,
> Kishore
>
> ---------------------------------------------------------------------------
> Kishore Ramachandran
> Graduate Assistant, WINLAB/ECE, Rutgers University.
> WWW : http://www.winlab.rutgers.edu/~kishore
> ---------------------------------------------------------------------------
>
> On Mon, 14 Aug 2006 chris at orderonenetworks.com wrote:
>
>> Max,
>>
>> I'm hoping to be able to do a full run in the next few days. I'll let
>> you
>> if I encounter any problems.
>>
>> In terms of working around the issue of some nodes not being registered
>> to
>> the test bed, can I have an experiment with something like this in it:
>>
>> defNodes('sender', [[1,1],[1,2],[1,3],[1,4]..[2,1],[2,2]..[3,1]...])
>> {|node|
>> ...
>> }
>>
>> with basically a list of all 400 nodes, less those that don't work?
>>
>> Thanks,
>> Chris
>>
>>> Chris,
>>>
>>> I'm not sure how testing has been done with the currently installed
>>> versions of nodehandler and services.
>>>
>>> In the little time I have I've started to re-write many of the
>>> performance limited parts and I have repeatedly worked with all 400
>>> nodes without much problems.
>>>
>>> As for OML, there are reported cases of problems and large amounts of
>>> dropped measurements, but I have not been able to reproduce them.
>>> There seem to be some issues with the reported sequence numbers, but
>>> again, I haven't gotten to the bottom of that. It normally works for
>>> me.
>>>
>>> We have made some improvement to OML in the last few weeks but haven't
>>> tested them. If you have an experiment which stress tests OML, please
>>> let me know.
>>>
>>> Thanks,
>>>
>>> -max
>>>
>>>
>>> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com>
>>> wrote:
>>>> Max,
>>>>
>>>> Thanks very much for the detailed reply. It was unexpected and welcome
>>>> coming on a Sunday evening!
>>>>
>>>> I'm hoping to run a routing protocol scalability test making use of
>>>> the
>>>> entire grid (if possible). So the more nodes that are able to be part
>>>> of
>>>> it, the better.
>>>>
>>>> Will I need to do anything special to ensure that my OML logs (when I
>>>> get
>>>> them working) can be collected from that many nodes?
>>>>
>>>> Is there a FAQ anywhere talking about issues when trying to use the
>>>> entire
>>>> grid at once?
>>>>
>>>> Thank you again,
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>> Chris,
>>>>>
>>>>> You ran into a problem with the grid services.
>>>>>
>>>>> If you look at the design of the Orbit framework, you will see that
>>>>> the interface between the user and the system is what we call 'grid
>>>>> services'. They are a set of web services which all user activities
>>>>> are building on.
>>>>>
>>>>> Specifically, there is a service called 'CMC' which allows a user to
>>>>> control the node hardware: switch on/off power, get environment
>>>>> measurements, such as voltage levels and temperature, and get access
>>>>> to the serial console.
>>>>>
>>>>> The nodehandler is using this service to switch the nodes on or off.
>>>>> The CMC service exposes a few methods for this - actually far too
>>>>> many
>>>>> at this stage. The CMC service also maintains a database for what
>>>>> nodes are in service and which aren't and that's what you are
>>>>> hitting.
>>>>> It appears that node 4 at 1 is 'out-of-service'.
>>>>>
>>>>> console:~$ wget -O - -q http://cmc:5012/cmc/allStatus | grep
>>>>> "'n_4_1'"
>>>>> <node name = 'n_4_1' x='4' y='1' state='NODE NOT
>>>>> AVAILABLE' />
>>>>>
>>>>> The problem is that the CMC service is unfortunately not implemented
>>>>> consistently. When the nodehandler requests that an entire set of
>>>>> nodes to be switched on, the CMC service quietly ignores all the
>>>>> 'out-of-service' nodes and reports success. As the nodehandler
>>>>> doesn't
>>>>> know that some nodes are out-of-service it will later request those
>>>>> nodes to be reset - assuming they went astray. Now the CMC service is
>>>>> reporting an error which leads to what you have observed.
>>>>>
>>>>> Now, what should you, or we do. Unfortunately, the CMC service is a
>>>>> really important one to automate the operation of Orbit. We spent a
>>>>> lot of engineering effort in getting it nicely integrated into the
>>>>> hardware - and that part works really well. However, the software on
>>>>> the server side never reached the same level of maturity. Lots of
>>>>> history, water under the bridge.
>>>>>
>>>>> There are ways for the nodehandler to work around those. Let me see
>>>>> if
>>>>> I can come up with something.
>>>>>
>>>>> In the meantime, I can only ask you to use the 'allStatus' command I
>>>>> showed above to see if the nodes you need for your experiment are
>>>>> really ready for use.
>>>>>
>>>>> Sorry for the inconvenience,
>>>>>
>>>>> -max
>>>>>
>>>>>
>>>>>
>>>>> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com>
>>>> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I'm getting an odd message when trying to imageNodes on the main
>>>> grid.
>>>>>>
>>>>>> FATAL run: ServiceException: ServiceException
>>>>>> Node (4,1) Not Registered for Testbed:
>>>>>> '#<CMC::Testbed:0xa7ab5fc0>'
>>>>>>
>>>>>> I've pasted the entire run below.
>>>>>>
>>>>>> If I try subsections of the grid, I get other nodes that give the
>>>> same
>>>>>> error.
>>>>>>
>>>>>> I haven't seen this error on the sandboxes. How do I register a node
>>>> for
>>>>>> the testbed?
>>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> ----------------------------
>>>>>> Imaging nodes: 1..20,1..20 with image baseline.ndz
>>>>>> Using config /etc/nodehandler/grid.cfg
>>>>>> /etc/nodehandler/grid.cfg:20: warning: Insecure world writable dir
>>>> /tmp,
>>>>>> mode 040777
>>>>>> Using logfile /etc/nodehandler/nodehandler_log.xml
>>>>>> INFO init: NodeHandler Version 3.6.4-1 (849)
>>>>>> INFO init: Experiment ID: grid_2006_08_13_16_56_19
>>>>>> INFO Experiment: load system:exp:stdlib
>>>>>> INFO prop.resetDelay: resetDelay = 180:Fixnum
>>>>>> INFO Experiment: load system:exp:imageNode
>>>>>> INFO prop.nodes: nodes = [1..20, 1..20]:Array
>>>>>> INFO prop.image: image = "baseline.ndz":String
>>>>>> INFO stdlib: 400 out of 400 node(s) still down
>>>>>> n_16_19,n_6_1,n_20_16
>>>>>> INFO stdlib: 400 out of 400 node(s) still down
>>>>>> n_16_19,n_6_1,n_20_16
>>>>>> INFO stdlib: 400 out of 400 node(s) still down
>>>>>> n_16_19,n_6_1,n_20_16
>>>>>> /tmp/eee.169/lib/util/communication.rb:127: warning: Insecure world
>>>>>> writable dir /tmp, mode 040777
>>>>>> INFO stdlib: 400 out of 400 node(s) still down
>>>>>> n_16_19,n_6_1,n_20_16
>>>>>> INFO n_18_19: Checked in as /ip/10.10.18.19 booting off
>>>> baseline:1.0.9
>>>>>> WARN n_18_19: Expected image 'pxe:1.1.4', but node reported
>>>>>> 'baseline:1.0.9'.
>>>>>> INFO n_18_19: Resseting node
>>>>>> FATAL run: ServiceException: ServiceException
>>>>>> Node (4,1) Not Registered for Testbed:
>>>>>> '#<CMC::Testbed:0xa7ab5fc0>'
>>>>>> INFO run: Experiment grid_2006_08_13_16_56_19 finished after 0:42
>>>>>> done.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dr. Max Ott
>>>>> Research Program Leader - Network and Pervasive Computing, NICTA
>>>> Australia
>>>>> Founder & CTO, Semandex Networks
>>>>> Research Professor, WINLAB, Rutgers University
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Dr. Max Ott
>>> Research Program Leader - Network and Pervasive Computing, NICTA
>>> Australia
>>> Founder & CTO, Semandex Networks
>>> Research Professor, WINLAB, Rutgers University
>>>
>>
>>
>>
>
More information about the orbit-user
mailing list