ORBIT-USER: Node not registered for testbed?
Kishore Ramachandran
kishore at winlab.rutgers.edu
Sun Aug 13 05:12:57 EDT 2006
Hi Chris:
> In terms of working around the issue of some nodes not being registered to
> the test bed, can I have an experiment with something like this in it:
>
> defNodes('sender', [[1,1],[1,2],[1,3],[1,4]..[2,1],[2,2]..[3,1]...]) {|node|
> ...
> }
>
> with basically a list of all 400 nodes, less those that don't work?
Yes - that should work.
regards,
Kishore
---------------------------------------------------------------------------
Kishore Ramachandran
Graduate Assistant, WINLAB/ECE, Rutgers University.
WWW : http://www.winlab.rutgers.edu/~kishore
---------------------------------------------------------------------------
On Mon, 14 Aug 2006 chris at orderonenetworks.com wrote:
> Max,
>
> I'm hoping to be able to do a full run in the next few days. I'll let you
> if I encounter any problems.
>
> In terms of working around the issue of some nodes not being registered to
> the test bed, can I have an experiment with something like this in it:
>
> defNodes('sender', [[1,1],[1,2],[1,3],[1,4]..[2,1],[2,2]..[3,1]...]) {|node|
> ...
> }
>
> with basically a list of all 400 nodes, less those that don't work?
>
> Thanks,
> Chris
>
>> Chris,
>>
>> I'm not sure how testing has been done with the currently installed
>> versions of nodehandler and services.
>>
>> In the little time I have I've started to re-write many of the
>> performance limited parts and I have repeatedly worked with all 400
>> nodes without much problems.
>>
>> As for OML, there are reported cases of problems and large amounts of
>> dropped measurements, but I have not been able to reproduce them.
>> There seem to be some issues with the reported sequence numbers, but
>> again, I haven't gotten to the bottom of that. It normally works for
>> me.
>>
>> We have made some improvement to OML in the last few weeks but haven't
>> tested them. If you have an experiment which stress tests OML, please
>> let me know.
>>
>> Thanks,
>>
>> -max
>>
>>
>> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com> wrote:
>>> Max,
>>>
>>> Thanks very much for the detailed reply. It was unexpected and welcome
>>> coming on a Sunday evening!
>>>
>>> I'm hoping to run a routing protocol scalability test making use of the
>>> entire grid (if possible). So the more nodes that are able to be part of
>>> it, the better.
>>>
>>> Will I need to do anything special to ensure that my OML logs (when I
>>> get
>>> them working) can be collected from that many nodes?
>>>
>>> Is there a FAQ anywhere talking about issues when trying to use the
>>> entire
>>> grid at once?
>>>
>>> Thank you again,
>>> Chris
>>>
>>>
>>>
>>>
>>>> Chris,
>>>>
>>>> You ran into a problem with the grid services.
>>>>
>>>> If you look at the design of the Orbit framework, you will see that
>>>> the interface between the user and the system is what we call 'grid
>>>> services'. They are a set of web services which all user activities
>>>> are building on.
>>>>
>>>> Specifically, there is a service called 'CMC' which allows a user to
>>>> control the node hardware: switch on/off power, get environment
>>>> measurements, such as voltage levels and temperature, and get access
>>>> to the serial console.
>>>>
>>>> The nodehandler is using this service to switch the nodes on or off.
>>>> The CMC service exposes a few methods for this - actually far too many
>>>> at this stage. The CMC service also maintains a database for what
>>>> nodes are in service and which aren't and that's what you are hitting.
>>>> It appears that node 4 at 1 is 'out-of-service'.
>>>>
>>>> console:~$ wget -O - -q http://cmc:5012/cmc/allStatus | grep "'n_4_1'"
>>>> <node name = 'n_4_1' x='4' y='1' state='NODE NOT
>>>> AVAILABLE' />
>>>>
>>>> The problem is that the CMC service is unfortunately not implemented
>>>> consistently. When the nodehandler requests that an entire set of
>>>> nodes to be switched on, the CMC service quietly ignores all the
>>>> 'out-of-service' nodes and reports success. As the nodehandler doesn't
>>>> know that some nodes are out-of-service it will later request those
>>>> nodes to be reset - assuming they went astray. Now the CMC service is
>>>> reporting an error which leads to what you have observed.
>>>>
>>>> Now, what should you, or we do. Unfortunately, the CMC service is a
>>>> really important one to automate the operation of Orbit. We spent a
>>>> lot of engineering effort in getting it nicely integrated into the
>>>> hardware - and that part works really well. However, the software on
>>>> the server side never reached the same level of maturity. Lots of
>>>> history, water under the bridge.
>>>>
>>>> There are ways for the nodehandler to work around those. Let me see if
>>>> I can come up with something.
>>>>
>>>> In the meantime, I can only ask you to use the 'allStatus' command I
>>>> showed above to see if the nodes you need for your experiment are
>>>> really ready for use.
>>>>
>>>> Sorry for the inconvenience,
>>>>
>>>> -max
>>>>
>>>>
>>>>
>>>> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com>
>>> wrote:
>>>>> Hello,
>>>>>
>>>>> I'm getting an odd message when trying to imageNodes on the main
>>> grid.
>>>>>
>>>>> FATAL run: ServiceException: ServiceException
>>>>> Node (4,1) Not Registered for Testbed:
>>>>> '#<CMC::Testbed:0xa7ab5fc0>'
>>>>>
>>>>> I've pasted the entire run below.
>>>>>
>>>>> If I try subsections of the grid, I get other nodes that give the
>>> same
>>>>> error.
>>>>>
>>>>> I haven't seen this error on the sandboxes. How do I register a node
>>> for
>>>>> the testbed?
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>>> ----------------------------
>>>>> Imaging nodes: 1..20,1..20 with image baseline.ndz
>>>>> Using config /etc/nodehandler/grid.cfg
>>>>> /etc/nodehandler/grid.cfg:20: warning: Insecure world writable dir
>>> /tmp,
>>>>> mode 040777
>>>>> Using logfile /etc/nodehandler/nodehandler_log.xml
>>>>> INFO init: NodeHandler Version 3.6.4-1 (849)
>>>>> INFO init: Experiment ID: grid_2006_08_13_16_56_19
>>>>> INFO Experiment: load system:exp:stdlib
>>>>> INFO prop.resetDelay: resetDelay = 180:Fixnum
>>>>> INFO Experiment: load system:exp:imageNode
>>>>> INFO prop.nodes: nodes = [1..20, 1..20]:Array
>>>>> INFO prop.image: image = "baseline.ndz":String
>>>>> INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>>>> INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>>>> INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>>>> /tmp/eee.169/lib/util/communication.rb:127: warning: Insecure world
>>>>> writable dir /tmp, mode 040777
>>>>> INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>>>>> INFO n_18_19: Checked in as /ip/10.10.18.19 booting off
>>> baseline:1.0.9
>>>>> WARN n_18_19: Expected image 'pxe:1.1.4', but node reported
>>>>> 'baseline:1.0.9'.
>>>>> INFO n_18_19: Resseting node
>>>>> FATAL run: ServiceException: ServiceException
>>>>> Node (4,1) Not Registered for Testbed:
>>>>> '#<CMC::Testbed:0xa7ab5fc0>'
>>>>> INFO run: Experiment grid_2006_08_13_16_56_19 finished after 0:42
>>>>> done.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Max Ott
>>>> Research Program Leader - Network and Pervasive Computing, NICTA
>>> Australia
>>>> Founder & CTO, Semandex Networks
>>>> Research Professor, WINLAB, Rutgers University
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> Dr. Max Ott
>> Research Program Leader - Network and Pervasive Computing, NICTA Australia
>> Founder & CTO, Semandex Networks
>> Research Professor, WINLAB, Rutgers University
>>
>
>
>
More information about the orbit-user
mailing list