ORBIT-USER: Node not registered for testbed?

chris at orderonenetworks.com chris at orderonenetworks.com
Mon Aug 14 08:15:47 EDT 2006


Max,

I'm hoping to be able to do a full run in the next few days. I'll let you
if I encounter any problems.

In terms of working around the issue of some nodes not being registered to
the test bed, can I have an experiment with something like this in it:

defNodes('sender', [[1,1],[1,2],[1,3],[1,4]..[2,1],[2,2]..[3,1]...]) {|node|
 ...
}

with basically a list of all 400 nodes, less those that don't work?

Thanks,
Chris

> Chris,
>
> I'm not sure how testing has been done with the currently installed
> versions of nodehandler and services.
>
> In the little time I have I've started to re-write many of the
> performance limited parts and I have repeatedly worked with all 400
> nodes without much problems.
>
> As for OML, there are reported cases of problems and large amounts of
> dropped measurements, but I have not been able to reproduce them.
> There seem to be some issues with the reported sequence numbers, but
> again, I haven't gotten to the bottom of that. It normally works for
> me.
>
> We have made some improvement to OML in the last few weeks but haven't
> tested them. If you have an experiment which stress tests OML, please
> let me know.
>
> Thanks,
>
> -max
>
>
> On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com> wrote:
>> Max,
>>
>> Thanks very much for the detailed reply. It was unexpected and welcome
>> coming on a Sunday evening!
>>
>> I'm hoping to run a routing protocol scalability test making use of the
>> entire grid (if possible). So the more nodes that are able to be part of
>> it, the better.
>>
>> Will I need to do anything special to ensure that my OML logs (when I
>> get
>> them working) can be collected from that many nodes?
>>
>> Is there a FAQ anywhere talking about issues when trying to use the
>> entire
>> grid at once?
>>
>> Thank you again,
>> Chris
>>
>>
>>
>>
>> > Chris,
>> >
>> > You ran into a problem with the grid services.
>> >
>> > If you look at the design of the Orbit framework, you will see that
>> > the interface between the user and the system is what we call 'grid
>> > services'. They are a set of web services which all user activities
>> > are building on.
>> >
>> > Specifically, there is a service called 'CMC' which allows a user to
>> > control the node hardware: switch on/off power, get environment
>> > measurements, such as voltage levels and temperature, and get access
>> > to the serial console.
>> >
>> > The nodehandler is using this service to switch the nodes on or off.
>> > The CMC service exposes a few methods for this - actually far too many
>> > at this stage. The CMC service also maintains a database for what
>> > nodes are in service and which aren't and that's what you are hitting.
>> > It appears that node 4 at 1 is 'out-of-service'.
>> >
>> > console:~$ wget -O - -q http://cmc:5012/cmc/allStatus | grep "'n_4_1'"
>> >                     <node name = 'n_4_1' x='4' y='1' state='NODE NOT
>> > AVAILABLE' />
>> >
>> > The problem is that the CMC service is unfortunately not implemented
>> > consistently. When the nodehandler requests that an entire set of
>> > nodes to be switched on, the CMC service quietly ignores all the
>> > 'out-of-service' nodes and reports success. As the nodehandler doesn't
>> > know that some nodes are out-of-service it will later request those
>> > nodes to be reset - assuming they went astray. Now the CMC service is
>> > reporting an error which leads to what you have observed.
>> >
>> > Now, what should you, or we do. Unfortunately, the CMC service is a
>> > really important one to automate the operation of Orbit. We spent a
>> > lot of engineering effort in getting it nicely integrated into the
>> > hardware - and that part works really well. However, the software on
>> > the server side never reached the same level of maturity. Lots of
>> > history, water under the bridge.
>> >
>> > There are ways for the nodehandler to work around those. Let me see if
>> > I can come up with something.
>> >
>> > In the meantime, I can only ask you to use the 'allStatus' command I
>> > showed above to see if the nodes you need for your experiment are
>> > really ready for use.
>> >
>> > Sorry for the inconvenience,
>> >
>> > -max
>> >
>> >
>> >
>> > On 8/14/06, chris at orderonenetworks.com <chris at orderonenetworks.com>
>> wrote:
>> >> Hello,
>> >>
>> >> I'm getting an odd message when trying to imageNodes on the main
>> grid.
>> >>
>> >> FATAL run: ServiceException: ServiceException
>> >>         Node (4,1) Not Registered for Testbed:
>> >> '#<CMC::Testbed:0xa7ab5fc0>'
>> >>
>> >> I've pasted the entire run below.
>> >>
>> >> If I try subsections of the grid, I get other nodes that give the
>> same
>> >> error.
>> >>
>> >> I haven't seen this error on the sandboxes. How do I register a node
>> for
>> >> the testbed?
>> >>
>> >> Thanks,
>> >> Chris
>> >>
>> >> ----------------------------
>> >> Imaging nodes: 1..20,1..20 with image baseline.ndz
>> >> Using config /etc/nodehandler/grid.cfg
>> >> /etc/nodehandler/grid.cfg:20: warning: Insecure world writable dir
>> /tmp,
>> >> mode 040777
>> >> Using logfile /etc/nodehandler/nodehandler_log.xml
>> >>  INFO init: NodeHandler Version 3.6.4-1 (849)
>> >>  INFO init: Experiment ID: grid_2006_08_13_16_56_19
>> >>  INFO Experiment: load system:exp:stdlib
>> >>  INFO prop.resetDelay: resetDelay = 180:Fixnum
>> >>  INFO Experiment: load system:exp:imageNode
>> >>  INFO prop.nodes: nodes = [1..20, 1..20]:Array
>> >>  INFO prop.image: image = "baseline.ndz":String
>> >>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>> >>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>> >>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>> >> /tmp/eee.169/lib/util/communication.rb:127: warning: Insecure world
>> >> writable dir /tmp, mode 040777
>> >>  INFO stdlib: 400 out of 400 node(s) still down n_16_19,n_6_1,n_20_16
>> >>  INFO n_18_19: Checked in as /ip/10.10.18.19 booting off
>> baseline:1.0.9
>> >>  WARN n_18_19: Expected image 'pxe:1.1.4', but node reported
>> >> 'baseline:1.0.9'.
>> >>  INFO n_18_19: Resseting node
>> >> FATAL run: ServiceException: ServiceException
>> >>         Node (4,1) Not Registered for Testbed:
>> >> '#<CMC::Testbed:0xa7ab5fc0>'
>> >>  INFO run: Experiment grid_2006_08_13_16_56_19 finished after 0:42
>> >>  done.
>> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Dr. Max Ott
>> > Research Program Leader - Network and Pervasive Computing, NICTA
>> Australia
>> > Founder & CTO, Semandex Networks
>> > Research Professor, WINLAB, Rutgers University
>> >
>>
>>
>>
>
>
> --
> Dr. Max Ott
> Research Program Leader - Network and Pervasive Computing, NICTA Australia
> Founder & CTO, Semandex Networks
> Research Professor, WINLAB, Rutgers University
>





More information about the orbit-user mailing list