ORBIT-USER: Image Problems
Ivan Seskar
Seskar at winlab.rutgers.edu
Fri Jan 19 09:05:33 EST 2007
Hi Chris,
Just to add to that: we did fix 14 nodes yesterday and are actively
working on improved status reporting but because of the grid occupancy
this all takes much longer than it should.
Ivan.
-----Original Message-----
From: owner-orbit-user at winlab.rutgers.edu
[mailto:owner-orbit-user at winlab.rutgers.edu] On Behalf Of Max Ott
Sent: Friday, January 19, 2007 2:11 AM
To: orbit-user at winlab.rutgers.edu
Subject: Re: ORBIT-USER: Image Problems
Chris,
The list of the successful nodes can be extracted from the log file with
a grep and a cut looking for DONE.OK. If you are adventurous, you can
use some XST code on the xml file generated :)
The number of nodes available should be much higher than what you see
(and we all see) and is right now a combination of many factors, most of
them being currently addressed, some of them are out of our reach due to
budget constraints.
Finally, the reason it won't finish has only shown up recently and is
the result of frisbee not finishing. It would help me a lot if you could
telnet into the node shown in parenthesis (n_14_18 in your recent case),
download the /var/log/nodeAgent.log file and send it to me.
I haven't figured out if frisbee goes deaf, the network link goes down,
or there is a disk write error which goes unreported.
Please remember that imaging is most likely the most stressing
application for the support infrastructure while installing a full
operating system on 300+ nodes in about 15 minutes is a very impressive
number. Now I agree, that you have been kept waiting for another 45
nodes for the last three nodes to not finish is frustrating.
Hopefully the numbers and reliability will increase. There is a lot of
work going on behind the scene to improve stability and reliability and
we all hope it will bear fruits soon.
Thanks,
-max
On 1/19/07, chris at orderonenetworks.com <chris at orderonenetworks.com>
wrote:
> Ivan,
>
> I ran the image test today and was only able to image 308 nodes. Do
> you anticpate these other nodes being able to be fixed? or is this
> about the size that it will be?
>
> I'd like be to able to use as many nodes as possible, but I have no
> idea which nodes were able to image and which were not. Is there a way
> to get a list of all nodes that imaged successfully? If I had this
> list, I could then feed it into the rest of my experiment.
>
> As well, the imageNodes4 command still doesn't end if all the nodes
> don't image. I think I let it run for an hour or so, and it basically
> kept repeating this line:
>
> INFO exp: Progress(308/1/311): 0/98/100 min(n_14_18)/avg/max
> (139.040551)
>
> The experiment ID was 'grid_2007_01_18_17_37_25'.
>
> Thanks,
> Chris
>
>
>
--
Dr. Max Ott
Research Program Leader - Network and Pervasive Computing, NICTA
Australia Founder & CTO, Semandex Networks Research Professor, WINLAB,
Rutgers University Senior Visiting Fellow, School of EE&T, UNSW
More information about the orbit-user
mailing list