ORBIT-USER: most of the grid does not work!

Mesut Ali Ergin ergin at eden.rutgers.edu
Wed Feb 14 22:37:33 EST 2007


Dear Andrea,

 From time to time I get to conduct experiments with max. possible 
number of nodes of ORBIT. Please allow me to make you some suggestions 
before you come to conclusive (and potentially offensive) generic 
statements like "Can somebody try to fix the grid?" and "Most of the 
nodes do not work..." etc. My experience with full imaging (referring to 
the attempt of imaging all 400 nodes) has been, on the contrary to what 
you have said, acceptably good. I can not say that the process is 
perfect, more than that it is far from being perfect, however, getting 
more than 320 nodes imaged for 70% of my attempts and getting no less 
than 250 nodes imaged whatsoever qualifies for 'acceptably good' from my 
viewpoint. To be more precise on this matter, I am attaching an imaging 
performance histogram from my 'full imaging required' experiments over 
the last three months to the attention of the ORBIT community. More 
detailed stats as to which nodes were more robust than the others etc. 
for those experiments are available in case someone might need them. 
Long story short, let me provide a few suggestions for getting more 
nodes imaged in shorter amount of time and getting help from the list 
about such subjects in general.

(i) Please make sure that you check the status of the nodes before you 
start attempting imaging (I can not stress it enough if you are 
attempting a full imaging). Easiest way of doing it is via 
http://orbit-lab.org/wiki/Status . Red nodes there are the ones that 
will be reported to be missing if you start imaging right away. Luckily 
there are usually only a handful of them on average (less than that 
these days). Also, there is a small possibility that the blue nodes 
there might fail getting an image. So, the best practice is to use 
allOffSoft (also if absolutely necessary allOffHard after a couple of 
allOffSoft attempts) command of CMC to turn off as many nodes as 
possible (kindly refer to CMC documentation for the details of using the 
CMC web service). More gray nodes there mean higher chances of getting 
more nodes to complete the imaging successfully. Although there are some 
automated processes in ORBIT infrastructure to help getting nodes to 
healthy state before imaging and after a fellow user logs out, extra 
care won't hurt and will buy you a lot of time if you have to encounter 
"a bad day for imaging".

(ii) Use imageNodes4 instead of imageNodes if you are attempting full 
imaging.

(iii) Do not forget that imaging (and almost all things involved in 
ORBIT experiments) is a *physical process*, that literally goes to 400 
nodes, instructs their power supplies to turn on, registers them for PXE 
imaging, delivers gigabytes of data in minutes to be written to 400 
magnetic disk drives. We, as people spending countless hours in front of 
a plain old UNIX terminal, tend to forget that fact sometimes and expect 
things to happen instantaneously after hitting enter. An example to that 
is hitting Ctrl-C during anytime in imaging and restarting imaging again 
right after that. From my experience it takes somewhere between 2-5 
minutes for the ORBIT infrastructure to restore the nodes to the state 
they were in before the imaging start. I think majority of the cases 
where hundreds of nodes get reported missing (or reported to be given 
up) are somewhat related to this. If you had to abort imaging process by 
hitting Ctrl-C, think about doing the things listed in (i) before 
restarting imaging. I assure you that the time you spend doing that will 
save you much more when it comes to achieving your objective (max. 
possible number of nodes getting imaged) here.

(iv) Do not tie yourself to specific nodes and take advantage of the 
nodeHandler imaging log in /tmp folder (refer to your experiment ID to 
find the right log file) to determine which nodes got imaged 
successfully etc. (If applicable) automate your scripting so that the 
duties of the nodes in your experiment are auto assigned from the 
successfully imaged nodes you learn as described above. If you like, you 
can ask me for a simple script to extract a list of nodes (in the format 
useful for doing ssh, sftp, scp etc) from the mentioned log. Keep in 
mind that we are talking about 400 tiny PCs that are repeatedly "abused" 
(imaged again and again, forced to do hard reset, etc.) every single 
day. They absolutely do fail and ORBIT administrators try hard to keep 
as much of them as possible alive. However, there is no guarantee that a 
given node (e.g., your faithful servant AP 14-5) will always be there 
waiting for your image.

(v) Do not be persistent to reach a certain number of nodes if you can 
already live with what has completed so far. Number of completed nodes 
are reported periodically by the nodeHandler in the form of:

*INFO nodeHandler::exp: Progress(308/0/324): 90/98/100 
min(n_2_2)/avg/max (144.379793)*

Above line means that nodeHandler is attempting to image 324 nodes 
actively and 308 of them already completed imaging successfully. You can 
safely hit Ctrl-C one time, let nodeHandler clean up gracefully and use 
the mentioned log file to extract the useful nodes (i.e. 308 of them in 
the above case). If you absolutely need to reach a certain number, try 
checking out the ones that seem like not getting the image (e.g. the 
ones reported as *min(n_X_Y)* by the nodeHandler) by logging onto their 
respective consoles (telnet 10.1.X.Y 3025) and look for weird things 
(hda errors in system logs, lack of IP address reflecting the coordinate 
of the node etc.). If you see such things, you can cross that node off 
your list (at least for your current slot), save time from waiting it to 
complete and move forward with some other nodes.


(vi) Last but not the least (as reminded numerous times before), please 
give as much detail as possible to the error you have experienced, 
including the experiment ID (the identifier in the following form: 
grid_XXXX_XX_XX_XX_XX_XX ). Simply pasting the list of nodes that are 
throwing you errors will be of little help to someone who decides to 
spend some of his/her time to help you out.

Obviously, above list is not complete or exhaustive in any way. So you, 
me and others may and would probably encounter errors in the future 
during imaging and other processes involved in ORBIT. My suggestion 
would be to stay calm, ask for help immediately via the list. People 
involved in ORBIT are proven to approach friendly to all sorts of issues 
and be cooperative to solve them no matter what their responsibilities 
are. Finally, all suggestions and statements in this e-mail are obtained 
from my personal experience as a fellow ORBIT user, and do not 
necessarily constitute ORBIT's official view. Hope these will make you 
life a bit easier on ORBIT.

Best regards,

-- 
Mesut Ali Ergin
ergin at winlab.rutgers.edu

Rutgers University, WINLAB,
Technology Center of New Jersey,
671 Rt. 1 South, North Brunswick,
New Jersey, 08902-3390, USA

Phone: 862-368-6620
Fax:   732-932-6882


Andrea G Forte wrote:
> Dear ORBIT administrators,
> 
> could somebody please try to fix the nodes in the grid? Most of the 
> nodes do not work and even imaging fails miserably. On top of this, many 
> nodes succeed in coming up but then the image process never ends, 
> everything gets stuck and I have to re-start again (an example is node 
> 5_5).
> *Please* could someone look into fixing the grid?
> Here is the list:
> WARN -:topo:image: Ignoring missing node '8 at 1'
> WARN -:topo:image: Ignoring missing node '14 at 18'
> WARN -:topo:image: Ignoring missing node '16 at 11'
> WARN stdlib: Giving up on node n_10_3
> WARN stdlib: Giving up on node n_19_7
> WARN stdlib: Giving up on node n_4_4
> WARN stdlib: Giving up on node n_5_14
> WARN stdlib: Giving up on node n_13_3
> WARN stdlib: Giving up on node n_6_1
> WARN stdlib: Giving up on node n_1_14
> WARN stdlib: Giving up on node n_15_8
> WARN stdlib: Giving up on node n_3_10
> WARN stdlib: Giving up on node n_17_15
> WARN stdlib: Giving up on node n_2_15
> WARN stdlib: Giving up on node n_12_14
> WARN stdlib: Giving up on node n_13_2
> WARN stdlib: Giving up on node n_7_19
> WARN stdlib: Giving up on node n_5_12
> WARN stdlib: Giving up on node n_4_19
> WARN stdlib: Giving up on node n_16_17
> WARN stdlib: Giving up on node n_10_19
> WARN stdlib: Giving up on node n_19_4
> WARN stdlib: Giving up on node n_6_10
> WARN stdlib: Giving up on node n_18_8
> WARN stdlib: Giving up on node n_4_1
> WARN stdlib: Giving up on node n_4_16
> WARN stdlib: Giving up on node n_9_16
> WARN stdlib: Giving up on node n_11_11
> WARN stdlib: Giving up on node n_6_12
> WARN stdlib: Giving up on node n_1_12
> WARN stdlib: Giving up on node n_16_13
> WARN stdlib: Giving up on node n_4_13
> WARN stdlib: Giving up on node n_3_4
> WARN stdlib: Giving up on node n_19_15
> WARN stdlib: Giving up on node n_9_5
> WARN stdlib: Giving up on node n_16_5
> WARN stdlib: Giving up on node n_8_4
> WARN stdlib: Giving up on node n_9_19
> WARN stdlib: Giving up on node n_12_13
> WARN stdlib: Giving up on node n_2_14
> WARN stdlib: Giving up on node n_10_14
> WARN stdlib: Giving up on node n_12_9
> WARN stdlib: Giving up on node n_12_7
> WARN stdlib: Giving up on node n_1_3
> WARN stdlib: Giving up on node n_9_15
> WARN stdlib: Giving up on node n_11_2
> WARN stdlib: Giving up on node n_12_3
> WARN stdlib: Giving up on node n_15_11
> WARN stdlib: Giving up on node n_11_5
> WARN stdlib: Giving up on node n_3_16
> WARN stdlib: Giving up on node n_7_13
> WARN stdlib: Giving up on node n_7_3
> WARN stdlib: Giving up on node n_5_17
> WARN stdlib: Giving up on node n_19_11
> WARN stdlib: Giving up on node n_19_5
> WARN stdlib: Giving up on node n_14_17
> WARN stdlib: Giving up on node n_17_17
> WARN stdlib: Giving up on node n_10_12
> WARN stdlib: Giving up on node n_16_10
> WARN stdlib: Giving up on node n_2_17
> WARN stdlib: Giving up on node n_4_11
> WARN stdlib: Giving up on node n_11_8
> WARN stdlib: Giving up on node n_3_7
> WARN stdlib: Giving up on node n_19_1
> WARN stdlib: Giving up on node n_9_18
> WARN stdlib: Giving up on node n_11_7
> WARN stdlib: Giving up on node n_1_13
> WARN stdlib: Giving up on node n_9_11
> WARN stdlib: Giving up on node n_11_4
> WARN stdlib: Giving up on node n_17_6
> WARN stdlib: Giving up on node n_14_19
> WARN stdlib: Giving up on node n_14_2
> WARN stdlib: Giving up on node n_7_8
> WARN stdlib: Giving up on node n_19_17
> WARN stdlib: Giving up on node n_13_6
> WARN stdlib: Giving up on node n_7_17
> WARN stdlib: Giving up on node n_3_3
> WARN stdlib: Giving up on node n_3_12
> WARN stdlib: Giving up on node n_12_6
> WARN stdlib: Giving up on node n_2_16
> WARN stdlib: Giving up on node n_8_11
> WARN stdlib: Giving up on node n_15_4
> WARN stdlib: Giving up on node n_4_2
> WARN stdlib: Giving up on node n_3_19
> WARN stdlib: Giving up on node n_14_8
> WARN stdlib: Giving up on node n_13_17
> WARN stdlib: Giving up on node n_11_12
> WARN stdlib: Giving up on node n_11_17
> WARN stdlib: Giving up on node n_10_9
> WARN stdlib: Giving up on node n_4_12
> WARN stdlib: Giving up on node n_5_2
> WARN stdlib: Giving up on node n_8_2
> WARN stdlib: Giving up on node n_13_1
> WARN stdlib: Giving up on node n_5_15
> WARN stdlib: Giving up on node n_17_11
> WARN stdlib: Giving up on node n_19_18
> WARN stdlib: Giving up on node n_10_4
> WARN stdlib: Giving up on node n_13_5
> WARN stdlib: Giving up on node n_14_1
> WARN stdlib: Giving up on node n_8_15
> WARN stdlib: Giving up on node n_16_9
> WARN stdlib: Giving up on node n_6_2
> WARN stdlib: Giving up on node n_11_13
> WARN stdlib: Giving up on node n_6_17
> WARN stdlib: Giving up on node n_14_9
> WARN stdlib: Giving up on node n_14_7
> WARN stdlib: Giving up on node n_8_9
> WARN stdlib: Giving up on node n_5_19
> WARN stdlib: Giving up on node n_11_9
> 
> -Andrea
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagingPerf.png
Type: image/png
Size: 29406 bytes
Desc: not available
Url : http://orbit-lab.org/pipermail/orbit-user/attachments/20070214/59f522b9/attachment.png 


More information about the orbit-user mailing list