ORBIT-USER: Problem with main grid
chris at orderonenetworks.com
chris at orderonenetworks.com
Wed Aug 16 15:21:59 EDT 2006
Hello,
I'm running into problems with the main grid.
I've got a test that runs great on the sandboxes, however when I run the
same test on the grid it works fine, (ie: 2 nodes) but when I scale it up
to about 100 nodes it looks like things have hung.
The experiment where it hangs is grid_2006_08_16_14_53_59. I've attached
my experiment to the end of this email.
It seems to hang here:
INFO n_3_17: Device 'net/w1' reported 02:60:B3:AC:2C:DF
INFO n_2_16: Device 'net/w1' reported 02:60:B3:AC:2C:DF
INFO n_3_18: Device 'net/w1' reported 02:60:B3:AC:2C:DF
INFO OML: Started: {"port"=>"7000", "iface"=>"eth1", "addr"=>"224.0.0.6"}
after all the nodes have check in.
using 'top' and 'ps -ax' this process on console.grid:
3757 pts/3 Rl+ 5:30 /tmp/eee.305/bin/ruby
/tmp/eee.305/app/nodeHandler.rb bootgrid
is sitting at 100% cpu useage.
when I ssh to an individual node and 'ps -ax' I see (ommitting pids < 1000
that are all the standard stuff)
1463 ? Ss 0:00 /sbin/syslogd
1476 ? Ss 0:00 /sbin/klogd
1487 ? Ss 0:00 /usr/sbin/inetd
1496 ? Ss 0:00 /usr/sbin/sshd
1551 ? Ss 0:00 dhclient -e -pf /var/run/dhclient.eth1.pid -lf
/var/run/dhclient.eth1.leases eth1
1558 ? Ss 0:00 /usr/sbin/cron
1573 ? Ss 0:00 /usr/sbin/nodeagent
1576 ? Sl 0:02 /tmp/eee.114/bin/ruby
/tmp/eee.114/app/nodeAgent.rb
1579 ? SLs 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid
1604 tty1 Ss+ 0:00 /sbin/getty 38400 tty1
1605 tty2 Ss+ 0:00 /sbin/getty 38400 tty2
1606 ttyS0 Ss+ 0:00 /bin/bash --
1636 ? Z 0:00 [ifconfig] <defunct>
1662 ? S 0:00 wget -q -O
/tmp/0b3f171ac5454878a1d294406e987fcb.xml
http://consolec:4000/omlc?id=oml_A_oon1
1719 ? Rs 0:00 sshd: root at pts/0
1723 pts/0 Ss 0:00 -bash
I let it sit here for quite a while, and I kept checking with ps -ax.
There was no change. I manually enter the command line for 1662 and it
just sits there for me as well until it ctrl-c it.
It appears that when things work correctly:
1662 ? S 0:00 wget -q -O
/tmp/0b3f171ac5454878a1d294406e987fcb.xml
http://consolec:4000/omlc?id=oml_A_oon1
gets replaced with a running version of my application.
Is there any advice on how I can solve this?
Thanks,
Chris
experiment.rb
-------------------
Experiment.name = "oonScaleTest"
Experiment.project = "test:oonscale"
defNodes('A',[1..3,1..18]) { |node|
node.prototype("test:proto:oonrouter1", {
'rate' => '1000',
'interface' => 'ath1',
'nfilter' => nil,
'logroute' => '10',
'logother' => '4',
'keepalive' => '10'
})
}
defNodes('B',[1..3,1..18]) { |node|
node.prototype("test:proto:oonrouter2", {
'rate' => '1000',
'interface' => 'ath0',
'nfilter' => nil,
'logroute' => '10',
'logother' => '4',
'keepalive' => '10'
})
}
allNodes.net.w0 { |w|
w.type = 'g'
w.essid = "oontest"
w.ip = "%192.168.%x.%y"
w.mode = "ad-hoc"
w.channel =11
w.rate = "11M"
}
allNodes.net.w1 { |w|
w.type = 'g'
w.essid = "oontest"
w.ip = "%192.167.%x.%y"
w.mode = "ad-hoc"
w.channel = 11
w.rate = "11M"
}
whenAllInstalled() {|node|
wait 60
allNodes.startApplications
wait 3000
Experiment.done
}
More information about the orbit-user
mailing list