ORBIT-USER: Problem with main grid

Sachin Ganu sachinganu at gmail.com
Wed Aug 16 18:42:24 EDT 2006


I dont think the IP address is the main problem. From the experiment
log, it looks the the experiment has progressed beyond the point where
the OML creates a database (infact a database with the experiment ID
was also created). Beyond that point, the experiment is stuck when
sending the startAllApplications where the nodehandler keeps receivng
resend requests from nodes for the app command (an example below)

At that point, the Ctrl C was hit..


2006-08-16 15:08:33 DEBUG nodeHandler::NodeHandler: Resend request for 14
2006-08-16 15:08:34 DEBUG nodeHandler::comm: process message (434):
'/r_3/c_13 0 HEARTBEAT 2 13 15:00:30}'
2006-08-16 15:08:35 DEBUG nodeHandler::comm: process message (435):
'/r_3/c_3 0 HEARTBEAT 3 13 15:00:30}'
2006-08-16 15:08:38 DEBUG nodeHandler::NodeHandler: Resend
message(224.4.0.1:9006-14): '-14 /B/* exec app:oon2 env -i
%OML_NAME=node%x-%y L
D_LIBRARY_PATH=/usr/lib/
OML_CONFIG=http://consolec:4000/omlc?id=oml_B_oon2 /root/oon2 -m 4 -i
ath0 -r 1000 -l 10 -p 500 -c 0 -k 10 -f'

I am not sure if the experiment would have worked if not aborted....









On 8/16/06, Sumathi Gopal <sumathi at winlab.rutgers.edu> wrote:
> Hi Chris,
>
> As a side, I noticed that you have 192.167.%x.%y as ip addresses for
> net/w1. invalid private IP addresses.
>
> Sumathi
>
> On Wed, 16 Aug 2006 chris at orderonenetworks.com wrote:
>
> > Hello,
> >
> > I'm running into problems with the main grid.
> >
> > I've got a test that runs great on the sandboxes, however when I run the
> > same test on the grid it works fine, (ie: 2 nodes) but when I scale it up
> > to about 100 nodes it looks like things have hung.
> >
> > The experiment where it hangs is grid_2006_08_16_14_53_59. I've attached
> > my experiment to the end of this email.
> >
> > It seems to hang here:
> > INFO n_3_17: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> > INFO n_2_16: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> > INFO n_3_18: Device 'net/w1' reported 02:60:B3:AC:2C:DF
> > INFO OML: Started: {"port"=>"7000", "iface"=>"eth1", "addr"=>"224.0.0.6"}
> >
> > after all the nodes have check in.
> >
> > using 'top' and 'ps -ax' this process on console.grid:
> > 3757 pts/3    Rl+    5:30 /tmp/eee.305/bin/ruby
> > /tmp/eee.305/app/nodeHandler.rb bootgrid
> >
> > is sitting at 100% cpu useage.
> >
> > when I ssh to an individual node and 'ps -ax' I see (ommitting pids < 1000
> > that are all the standard stuff)
> >
> > 1463 ?        Ss     0:00 /sbin/syslogd
> > 1476 ?        Ss     0:00 /sbin/klogd
> > 1487 ?        Ss     0:00 /usr/sbin/inetd
> > 1496 ?        Ss     0:00 /usr/sbin/sshd
> > 1551 ?        Ss     0:00 dhclient -e -pf /var/run/dhclient.eth1.pid -lf
> > /var/run/dhclient.eth1.leases eth1
> > 1558 ?        Ss     0:00 /usr/sbin/cron
> > 1573 ?        Ss     0:00 /usr/sbin/nodeagent
> > 1576 ?        Sl     0:02 /tmp/eee.114/bin/ruby
> > /tmp/eee.114/app/nodeAgent.rb
> > 1579 ?        SLs    0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid
> > 1604 tty1     Ss+    0:00 /sbin/getty 38400 tty1
> > 1605 tty2     Ss+    0:00 /sbin/getty 38400 tty2
> > 1606 ttyS0    Ss+    0:00 /bin/bash --
> > 1636 ?        Z      0:00 [ifconfig] <defunct>
> > 1662 ?        S      0:00 wget -q -O
> > /tmp/0b3f171ac5454878a1d294406e987fcb.xml
> > http://consolec:4000/omlc?id=oml_A_oon1
> > 1719 ?        Rs     0:00 sshd: root at pts/0
> > 1723 pts/0    Ss     0:00 -bash
> >
> > I let it sit here for quite a while, and I kept checking with ps -ax.
> > There was no change. I manually enter the command line for 1662 and it
> > just sits there for me as well until it ctrl-c it.
> >
> > It appears that when things work correctly:
> >
> > 1662 ?        S      0:00 wget -q -O
> > /tmp/0b3f171ac5454878a1d294406e987fcb.xml
> > http://consolec:4000/omlc?id=oml_A_oon1
> >
> > gets replaced with a running version of my application.
> >
> > Is there any advice on how I can solve this?
> > Thanks,
> > Chris
> >
> >
> > experiment.rb
> > -------------------
> > Experiment.name = "oonScaleTest"
> > Experiment.project = "test:oonscale"
> >
> >
> > defNodes('A',[1..3,1..18]) { |node|
> >   node.prototype("test:proto:oonrouter1", {
> >       'rate' => '1000',
> >       'interface' => 'ath1',
> >       'nfilter' => nil,
> >       'logroute' => '10',
> >       'logother' => '4',
> >       'keepalive' => '10'
> >   })
> > }
> >
> > defNodes('B',[1..3,1..18]) { |node|
> >   node.prototype("test:proto:oonrouter2", {
> >     'rate' => '1000',
> >     'interface' => 'ath0',
> >     'nfilter' => nil,
> >     'logroute' => '10',
> >     'logother' => '4',
> >     'keepalive' => '10'
> >   })
> > }
> >
> > allNodes.net.w0 { |w|
> >  w.type = 'g'
> >  w.essid = "oontest"
> >  w.ip = "%192.168.%x.%y"
> >  w.mode = "ad-hoc"
> >  w.channel =11
> >  w.rate = "11M"
> > }
> >
> > allNodes.net.w1 { |w|
> >  w.type = 'g'
> >  w.essid = "oontest"
> >  w.ip = "%192.167.%x.%y"
> >  w.mode = "ad-hoc"
> >  w.channel = 11
> >  w.rate = "11M"
> > }
> >
> > whenAllInstalled() {|node|
> >  wait 60
> >
> >  allNodes.startApplications
> >  wait 3000
> >
> >  Experiment.done
> > }
> >
> >
> >
>



More information about the orbit-user mailing list