ORBIT-USER: Problem with main grid

Sumathi Gopal sumathi at winlab.rutgers.edu
Wed Aug 16 16:00:10 EDT 2006


Hi Chris,

As a side, I noticed that you have 192.167.%x.%y as ip addresses for 
net/w1. invalid private IP addresses.

Sumathi

On Wed, 16 Aug 2006 chris at orderonenetworks.com wrote:

> Hello,
>
> I'm running into problems with the main grid.
>
> I've got a test that runs great on the sandboxes, however when I run the
> same test on the grid it works fine, (ie: 2 nodes) but when I scale it up
> to about 100 nodes it looks like things have hung.
>
> The experiment where it hangs is grid_2006_08_16_14_53_59. I've attached
> my experiment to the end of this email.
>
> It seems to hang here:
> INFO n_3_17: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> INFO n_2_16: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> INFO n_3_18: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> INFO OML: Started: {"port"=>"7000", "iface"=>"eth1", "addr"=>"224.0.0.6"}
>
> after all the nodes have check in.
>
> using 'top' and 'ps -ax' this process on console.grid:
> 3757 pts/3    Rl+    5:30 /tmp/eee.305/bin/ruby
> /tmp/eee.305/app/nodeHandler.rb bootgrid
>
> is sitting at 100% cpu useage.
>
> when I ssh to an individual node and 'ps -ax' I see (ommitting pids < 1000
> that are all the standard stuff)
>
> 1463 ?        Ss     0:00 /sbin/syslogd
> 1476 ?        Ss     0:00 /sbin/klogd
> 1487 ?        Ss     0:00 /usr/sbin/inetd
> 1496 ?        Ss     0:00 /usr/sbin/sshd
> 1551 ?        Ss     0:00 dhclient -e -pf /var/run/dhclient.eth1.pid -lf
> /var/run/dhclient.eth1.leases eth1
> 1558 ?        Ss     0:00 /usr/sbin/cron
> 1573 ?        Ss     0:00 /usr/sbin/nodeagent
> 1576 ?        Sl     0:02 /tmp/eee.114/bin/ruby
> /tmp/eee.114/app/nodeAgent.rb
> 1579 ?        SLs    0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid
> 1604 tty1     Ss+    0:00 /sbin/getty 38400 tty1
> 1605 tty2     Ss+    0:00 /sbin/getty 38400 tty2
> 1606 ttyS0    Ss+    0:00 /bin/bash --
> 1636 ?        Z      0:00 [ifconfig] <defunct>
> 1662 ?        S      0:00 wget -q -O
> /tmp/0b3f171ac5454878a1d294406e987fcb.xml
> http://consolec:4000/omlc?id=oml_A_oon1
> 1719 ?        Rs     0:00 sshd: root at pts/0
> 1723 pts/0    Ss     0:00 -bash
>
> I let it sit here for quite a while, and I kept checking with ps -ax.
> There was no change. I manually enter the command line for 1662 and it
> just sits there for me as well until it ctrl-c it.
>
> It appears that when things work correctly:
>
> 1662 ?        S      0:00 wget -q -O
> /tmp/0b3f171ac5454878a1d294406e987fcb.xml
> http://consolec:4000/omlc?id=oml_A_oon1
>
> gets replaced with a running version of my application.
>
> Is there any advice on how I can solve this?
> Thanks,
> Chris
>
>
> experiment.rb
> -------------------
> Experiment.name = "oonScaleTest"
> Experiment.project = "test:oonscale"
>
>
> defNodes('A',[1..3,1..18]) { |node|
>   node.prototype("test:proto:oonrouter1", {
>       'rate' => '1000',
>       'interface' => 'ath1',
>       'nfilter' => nil,
>       'logroute' => '10',
>       'logother' => '4',
>       'keepalive' => '10'
>   })
> }
>
> defNodes('B',[1..3,1..18]) { |node|
>   node.prototype("test:proto:oonrouter2", {
>     'rate' => '1000',
>     'interface' => 'ath0',
>     'nfilter' => nil,
>     'logroute' => '10',
>     'logother' => '4',
>     'keepalive' => '10'
>   })
> }
>
> allNodes.net.w0 { |w|
>  w.type = 'g'
>  w.essid = "oontest"
>  w.ip = "%192.168.%x.%y"
>  w.mode = "ad-hoc"
>  w.channel =11
>  w.rate = "11M"
> }
>
> allNodes.net.w1 { |w|
>  w.type = 'g'
>  w.essid = "oontest"
>  w.ip = "%192.167.%x.%y"
>  w.mode = "ad-hoc"
>  w.channel = 11
>  w.rate = "11M"
> }
>
> whenAllInstalled() {|node|
>  wait 60
>
>  allNodes.startApplications
>  wait 3000
>
>  Experiment.done
> }
>
>
>



More information about the orbit-user mailing list