ORBIT-USER: Problem with main grid

chris at orderonenetworks.com chris at orderonenetworks.com
Wed Aug 16 19:04:44 EDT 2006


I've been doing more looking into it. (experiment grid_2006_08_16_18_44_52)

everything goes well until:

... stuff before
 INFO n_3_1: Device 'net/w0' reported 02:15:00:84:30:81
 INFO n_2_7: Device 'net/w1' reported 02:60:B3:AC:A1:72
 INFO n_2_7: Device 'net/w0' reported 02:60:B3:AC:A1:72
 INFO OML: Started: {"port"=>"7000", "iface"=>"eth1", "addr"=>"224.0.0.6"}
ERROR LOST_HANDLER_ERROR: 'n_3_1' lost us
ERROR LOST_HANDLER_ERROR: 'n_3_9' lost us
ERROR LOST_HANDLER_ERROR: 'n_2_15' lost us
ERROR LOST_HANDLER_ERROR: 'n_2_11' lost us
... (a bunch like this)
INFO n_3_11: Checked in as /ip/10.10.3.11 booting off baseline:1.0.9
 INFO n_3_9: Checked in as /ip/10.10.3.9 booting off baseline:1.0.9
 INFO n_2_15: Checked in as /ip/10.10.2.15 booting off baseline:1.0.9
 INFO n_2_11: Checked in as /ip/10.10.2.11 booting off baseline:1.0.9
 INFO n_2_4: Checked in as /ip/10.10.2.4 booting off baseline:1.0.9
... (bunch like this)

I made of point of watching the console as well. before the ERROR
LOST_HANDLER_ERROR

the console dumps:

ath_pci: driver unloaded
ath_rate_sample: unloaded
wlan: driver unloaded
ath_hal: driver unloaded

confirming this I ssh'd to the node and iwconfig gave me 'no wireless
extensions'.

I modified this experiment a little by making it wait forever (almost).

whenAllInstalled() {|node|
  wait 30

  allNodes.startApplications
  wait 180000  #long pause

  Experiment.done
}

The thing that really gets me is the indentical experiment works if I
limit the number of nodes:

Works: [2..3,1..8]
Dies: [2..3,1..15]

This is driving me buggy! Any suggestions of alternatives would be welcome
or ideally, how to make this nodehandler thing work.

Thanks again,
Chris


> I dont think the IP address is the main problem. From the experiment
> log, it looks the the experiment has progressed beyond the point where
> the OML creates a database (infact a database with the experiment ID
> was also created). Beyond that point, the experiment is stuck when
> sending the startAllApplications where the nodehandler keeps receivng
> resend requests from nodes for the app command (an example below)
>
> At that point, the Ctrl C was hit..
>
>
> 2006-08-16 15:08:33 DEBUG nodeHandler::NodeHandler: Resend request for 14
> 2006-08-16 15:08:34 DEBUG nodeHandler::comm: process message (434):
> '/r_3/c_13 0 HEARTBEAT 2 13 15:00:30}'
> 2006-08-16 15:08:35 DEBUG nodeHandler::comm: process message (435):
> '/r_3/c_3 0 HEARTBEAT 3 13 15:00:30}'
> 2006-08-16 15:08:38 DEBUG nodeHandler::NodeHandler: Resend
> message(224.4.0.1:9006-14): '-14 /B/* exec app:oon2 env -i
> %OML_NAME=node%x-%y L
> D_LIBRARY_PATH=/usr/lib/
> OML_CONFIG=http://consolec:4000/omlc?id=oml_B_oon2 /root/oon2 -m 4 -i
> ath0 -r 1000 -l 10 -p 500 -c 0 -k 10 -f'
>
> I am not sure if the experiment would have worked if not aborted....
>
>
>
>
>
>
>
>
>
> On 8/16/06, Sumathi Gopal <sumathi at winlab.rutgers.edu> wrote:
>> Hi Chris,
>>
>> As a side, I noticed that you have 192.167.%x.%y as ip addresses for
>> net/w1. invalid private IP addresses.
>>
>> Sumathi
>>
>> On Wed, 16 Aug 2006 chris at orderonenetworks.com wrote:
>>
>> > Hello,
>> >
>> > I'm running into problems with the main grid.
>> >
>> > I've got a test that runs great on the sandboxes, however when I run
>> the
>> > same test on the grid it works fine, (ie: 2 nodes) but when I scale it
>> up
>> > to about 100 nodes it looks like things have hung.
>> >
>> > The experiment where it hangs is grid_2006_08_16_14_53_59. I've
>> attached
>> > my experiment to the end of this email.
>> >
>> > It seems to hang here:
>> > INFO n_3_17: Device 'net/w1' reported 02:60:B3:AC:2C:DF
>> > INFO n_2_16: Device 'net/w1' reported 02:60:B3:AC:2C:DF
>> > INFO n_3_18: Device 'net/w1' reported 02:60:B3:AC:2C:DF
>> > INFO OML: Started: {"port"=>"7000", "iface"=>"eth1",
>> "addr"=>"224.0.0.6"}
>> >
>> > after all the nodes have check in.
>> >
>> > using 'top' and 'ps -ax' this process on console.grid:
>> > 3757 pts/3    Rl+    5:30 /tmp/eee.305/bin/ruby
>> > /tmp/eee.305/app/nodeHandler.rb bootgrid
>> >
>> > is sitting at 100% cpu useage.
>> >
>> > when I ssh to an individual node and 'ps -ax' I see (ommitting pids <
>> 1000
>> > that are all the standard stuff)
>> >
>> > 1463 ?        Ss     0:00 /sbin/syslogd
>> > 1476 ?        Ss     0:00 /sbin/klogd
>> > 1487 ?        Ss     0:00 /usr/sbin/inetd
>> > 1496 ?        Ss     0:00 /usr/sbin/sshd
>> > 1551 ?        Ss     0:00 dhclient -e -pf /var/run/dhclient.eth1.pid
>> -lf
>> > /var/run/dhclient.eth1.leases eth1
>> > 1558 ?        Ss     0:00 /usr/sbin/cron
>> > 1573 ?        Ss     0:00 /usr/sbin/nodeagent
>> > 1576 ?        Sl     0:02 /tmp/eee.114/bin/ruby
>> > /tmp/eee.114/app/nodeAgent.rb
>> > 1579 ?        SLs    0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid
>> > 1604 tty1     Ss+    0:00 /sbin/getty 38400 tty1
>> > 1605 tty2     Ss+    0:00 /sbin/getty 38400 tty2
>> > 1606 ttyS0    Ss+    0:00 /bin/bash --
>> > 1636 ?        Z      0:00 [ifconfig] <defunct>
>> > 1662 ?        S      0:00 wget -q -O
>> > /tmp/0b3f171ac5454878a1d294406e987fcb.xml
>> > http://consolec:4000/omlc?id=oml_A_oon1
>> > 1719 ?        Rs     0:00 sshd: root at pts/0
>> > 1723 pts/0    Ss     0:00 -bash
>> >
>> > I let it sit here for quite a while, and I kept checking with ps -ax.
>> > There was no change. I manually enter the command line for 1662 and it
>> > just sits there for me as well until it ctrl-c it.
>> >
>> > It appears that when things work correctly:
>> >
>> > 1662 ?        S      0:00 wget -q -O
>> > /tmp/0b3f171ac5454878a1d294406e987fcb.xml
>> > http://consolec:4000/omlc?id=oml_A_oon1
>> >
>> > gets replaced with a running version of my application.
>> >
>> > Is there any advice on how I can solve this?
>> > Thanks,
>> > Chris
>> >
>> >
>> > experiment.rb
>> > -------------------
>> > Experiment.name = "oonScaleTest"
>> > Experiment.project = "test:oonscale"
>> >
>> >
>> > defNodes('A',[1..3,1..18]) { |node|
>> >   node.prototype("test:proto:oonrouter1", {
>> >       'rate' => '1000',
>> >       'interface' => 'ath1',
>> >       'nfilter' => nil,
>> >       'logroute' => '10',
>> >       'logother' => '4',
>> >       'keepalive' => '10'
>> >   })
>> > }
>> >
>> > defNodes('B',[1..3,1..18]) { |node|
>> >   node.prototype("test:proto:oonrouter2", {
>> >     'rate' => '1000',
>> >     'interface' => 'ath0',
>> >     'nfilter' => nil,
>> >     'logroute' => '10',
>> >     'logother' => '4',
>> >     'keepalive' => '10'
>> >   })
>> > }
>> >
>> > allNodes.net.w0 { |w|
>> >  w.type = 'g'
>> >  w.essid = "oontest"
>> >  w.ip = "%192.168.%x.%y"
>> >  w.mode = "ad-hoc"
>> >  w.channel =11
>> >  w.rate = "11M"
>> > }
>> >
>> > allNodes.net.w1 { |w|
>> >  w.type = 'g'
>> >  w.essid = "oontest"
>> >  w.ip = "%192.167.%x.%y"
>> >  w.mode = "ad-hoc"
>> >  w.channel = 11
>> >  w.rate = "11M"
>> > }
>> >
>> > whenAllInstalled() {|node|
>> >  wait 60
>> >
>> >  allNodes.startApplications
>> >  wait 3000
>> >
>> >  Experiment.done
>> > }
>> >
>> >
>> >
>>
>





More information about the orbit-user mailing list