ORBIT-USER: most of the grid does not work!
Mesut Ali Ergin
ergin at eden.rutgers.edu
Wed Feb 14 22:37:33 EST 2007
Dear Andrea,
From time to time I get to conduct experiments with max. possible
number of nodes of ORBIT. Please allow me to make you some suggestions
before you come to conclusive (and potentially offensive) generic
statements like "Can somebody try to fix the grid?" and "Most of the
nodes do not work..." etc. My experience with full imaging (referring to
the attempt of imaging all 400 nodes) has been, on the contrary to what
you have said, acceptably good. I can not say that the process is
perfect, more than that it is far from being perfect, however, getting
more than 320 nodes imaged for 70% of my attempts and getting no less
than 250 nodes imaged whatsoever qualifies for 'acceptably good' from my
viewpoint. To be more precise on this matter, I am attaching an imaging
performance histogram from my 'full imaging required' experiments over
the last three months to the attention of the ORBIT community. More
detailed stats as to which nodes were more robust than the others etc.
for those experiments are available in case someone might need them.
Long story short, let me provide a few suggestions for getting more
nodes imaged in shorter amount of time and getting help from the list
about such subjects in general.
(i) Please make sure that you check the status of the nodes before you
start attempting imaging (I can not stress it enough if you are
attempting a full imaging). Easiest way of doing it is via
http://orbit-lab.org/wiki/Status . Red nodes there are the ones that
will be reported to be missing if you start imaging right away. Luckily
there are usually only a handful of them on average (less than that
these days). Also, there is a small possibility that the blue nodes
there might fail getting an image. So, the best practice is to use
allOffSoft (also if absolutely necessary allOffHard after a couple of
allOffSoft attempts) command of CMC to turn off as many nodes as
possible (kindly refer to CMC documentation for the details of using the
CMC web service). More gray nodes there mean higher chances of getting
more nodes to complete the imaging successfully. Although there are some
automated processes in ORBIT infrastructure to help getting nodes to
healthy state before imaging and after a fellow user logs out, extra
care won't hurt and will buy you a lot of time if you have to encounter
"a bad day for imaging".
(ii) Use imageNodes4 instead of imageNodes if you are attempting full
imaging.
(iii) Do not forget that imaging (and almost all things involved in
ORBIT experiments) is a *physical process*, that literally goes to 400
nodes, instructs their power supplies to turn on, registers them for PXE
imaging, delivers gigabytes of data in minutes to be written to 400
magnetic disk drives. We, as people spending countless hours in front of
a plain old UNIX terminal, tend to forget that fact sometimes and expect
things to happen instantaneously after hitting enter. An example to that
is hitting Ctrl-C during anytime in imaging and restarting imaging again
right after that. From my experience it takes somewhere between 2-5
minutes for the ORBIT infrastructure to restore the nodes to the state
they were in before the imaging start. I think majority of the cases
where hundreds of nodes get reported missing (or reported to be given
up) are somewhat related to this. If you had to abort imaging process by
hitting Ctrl-C, think about doing the things listed in (i) before
restarting imaging. I assure you that the time you spend doing that will
save you much more when it comes to achieving your objective (max.
possible number of nodes getting imaged) here.
(iv) Do not tie yourself to specific nodes and take advantage of the
nodeHandler imaging log in /tmp folder (refer to your experiment ID to
find the right log file) to determine which nodes got imaged
successfully etc. (If applicable) automate your scripting so that the
duties of the nodes in your experiment are auto assigned from the
successfully imaged nodes you learn as described above. If you like, you
can ask me for a simple script to extract a list of nodes (in the format
useful for doing ssh, sftp, scp etc) from the mentioned log. Keep in
mind that we are talking about 400 tiny PCs that are repeatedly "abused"
(imaged again and again, forced to do hard reset, etc.) every single
day. They absolutely do fail and ORBIT administrators try hard to keep
as much of them as possible alive. However, there is no guarantee that a
given node (e.g., your faithful servant AP 14-5) will always be there
waiting for your image.
(v) Do not be persistent to reach a certain number of nodes if you can
already live with what has completed so far. Number of completed nodes
are reported periodically by the nodeHandler in the form of:
*INFO nodeHandler::exp: Progress(308/0/324): 90/98/100
min(n_2_2)/avg/max (144.379793)*
Above line means that nodeHandler is attempting to image 324 nodes
actively and 308 of them already completed imaging successfully. You can
safely hit Ctrl-C one time, let nodeHandler clean up gracefully and use
the mentioned log file to extract the useful nodes (i.e. 308 of them in
the above case). If you absolutely need to reach a certain number, try
checking out the ones that seem like not getting the image (e.g. the
ones reported as *min(n_X_Y)* by the nodeHandler) by logging onto their
respective consoles (telnet 10.1.X.Y 3025) and look for weird things
(hda errors in system logs, lack of IP address reflecting the coordinate
of the node etc.). If you see such things, you can cross that node off
your list (at least for your current slot), save time from waiting it to
complete and move forward with some other nodes.
(vi) Last but not the least (as reminded numerous times before), please
give as much detail as possible to the error you have experienced,
including the experiment ID (the identifier in the following form:
grid_XXXX_XX_XX_XX_XX_XX ). Simply pasting the list of nodes that are
throwing you errors will be of little help to someone who decides to
spend some of his/her time to help you out.
Obviously, above list is not complete or exhaustive in any way. So you,
me and others may and would probably encounter errors in the future
during imaging and other processes involved in ORBIT. My suggestion
would be to stay calm, ask for help immediately via the list. People
involved in ORBIT are proven to approach friendly to all sorts of issues
and be cooperative to solve them no matter what their responsibilities
are. Finally, all suggestions and statements in this e-mail are obtained
from my personal experience as a fellow ORBIT user, and do not
necessarily constitute ORBIT's official view. Hope these will make you
life a bit easier on ORBIT.
Best regards,
--
Mesut Ali Ergin
ergin at winlab.rutgers.edu
Rutgers University, WINLAB,
Technology Center of New Jersey,
671 Rt. 1 South, North Brunswick,
New Jersey, 08902-3390, USA
Phone: 862-368-6620
Fax: 732-932-6882
Andrea G Forte wrote:
> Dear ORBIT administrators,
>
> could somebody please try to fix the nodes in the grid? Most of the
> nodes do not work and even imaging fails miserably. On top of this, many
> nodes succeed in coming up but then the image process never ends,
> everything gets stuck and I have to re-start again (an example is node
> 5_5).
> *Please* could someone look into fixing the grid?
> Here is the list:
> WARN -:topo:image: Ignoring missing node '8 at 1'
> WARN -:topo:image: Ignoring missing node '14 at 18'
> WARN -:topo:image: Ignoring missing node '16 at 11'
> WARN stdlib: Giving up on node n_10_3
> WARN stdlib: Giving up on node n_19_7
> WARN stdlib: Giving up on node n_4_4
> WARN stdlib: Giving up on node n_5_14
> WARN stdlib: Giving up on node n_13_3
> WARN stdlib: Giving up on node n_6_1
> WARN stdlib: Giving up on node n_1_14
> WARN stdlib: Giving up on node n_15_8
> WARN stdlib: Giving up on node n_3_10
> WARN stdlib: Giving up on node n_17_15
> WARN stdlib: Giving up on node n_2_15
> WARN stdlib: Giving up on node n_12_14
> WARN stdlib: Giving up on node n_13_2
> WARN stdlib: Giving up on node n_7_19
> WARN stdlib: Giving up on node n_5_12
> WARN stdlib: Giving up on node n_4_19
> WARN stdlib: Giving up on node n_16_17
> WARN stdlib: Giving up on node n_10_19
> WARN stdlib: Giving up on node n_19_4
> WARN stdlib: Giving up on node n_6_10
> WARN stdlib: Giving up on node n_18_8
> WARN stdlib: Giving up on node n_4_1
> WARN stdlib: Giving up on node n_4_16
> WARN stdlib: Giving up on node n_9_16
> WARN stdlib: Giving up on node n_11_11
> WARN stdlib: Giving up on node n_6_12
> WARN stdlib: Giving up on node n_1_12
> WARN stdlib: Giving up on node n_16_13
> WARN stdlib: Giving up on node n_4_13
> WARN stdlib: Giving up on node n_3_4
> WARN stdlib: Giving up on node n_19_15
> WARN stdlib: Giving up on node n_9_5
> WARN stdlib: Giving up on node n_16_5
> WARN stdlib: Giving up on node n_8_4
> WARN stdlib: Giving up on node n_9_19
> WARN stdlib: Giving up on node n_12_13
> WARN stdlib: Giving up on node n_2_14
> WARN stdlib: Giving up on node n_10_14
> WARN stdlib: Giving up on node n_12_9
> WARN stdlib: Giving up on node n_12_7
> WARN stdlib: Giving up on node n_1_3
> WARN stdlib: Giving up on node n_9_15
> WARN stdlib: Giving up on node n_11_2
> WARN stdlib: Giving up on node n_12_3
> WARN stdlib: Giving up on node n_15_11
> WARN stdlib: Giving up on node n_11_5
> WARN stdlib: Giving up on node n_3_16
> WARN stdlib: Giving up on node n_7_13
> WARN stdlib: Giving up on node n_7_3
> WARN stdlib: Giving up on node n_5_17
> WARN stdlib: Giving up on node n_19_11
> WARN stdlib: Giving up on node n_19_5
> WARN stdlib: Giving up on node n_14_17
> WARN stdlib: Giving up on node n_17_17
> WARN stdlib: Giving up on node n_10_12
> WARN stdlib: Giving up on node n_16_10
> WARN stdlib: Giving up on node n_2_17
> WARN stdlib: Giving up on node n_4_11
> WARN stdlib: Giving up on node n_11_8
> WARN stdlib: Giving up on node n_3_7
> WARN stdlib: Giving up on node n_19_1
> WARN stdlib: Giving up on node n_9_18
> WARN stdlib: Giving up on node n_11_7
> WARN stdlib: Giving up on node n_1_13
> WARN stdlib: Giving up on node n_9_11
> WARN stdlib: Giving up on node n_11_4
> WARN stdlib: Giving up on node n_17_6
> WARN stdlib: Giving up on node n_14_19
> WARN stdlib: Giving up on node n_14_2
> WARN stdlib: Giving up on node n_7_8
> WARN stdlib: Giving up on node n_19_17
> WARN stdlib: Giving up on node n_13_6
> WARN stdlib: Giving up on node n_7_17
> WARN stdlib: Giving up on node n_3_3
> WARN stdlib: Giving up on node n_3_12
> WARN stdlib: Giving up on node n_12_6
> WARN stdlib: Giving up on node n_2_16
> WARN stdlib: Giving up on node n_8_11
> WARN stdlib: Giving up on node n_15_4
> WARN stdlib: Giving up on node n_4_2
> WARN stdlib: Giving up on node n_3_19
> WARN stdlib: Giving up on node n_14_8
> WARN stdlib: Giving up on node n_13_17
> WARN stdlib: Giving up on node n_11_12
> WARN stdlib: Giving up on node n_11_17
> WARN stdlib: Giving up on node n_10_9
> WARN stdlib: Giving up on node n_4_12
> WARN stdlib: Giving up on node n_5_2
> WARN stdlib: Giving up on node n_8_2
> WARN stdlib: Giving up on node n_13_1
> WARN stdlib: Giving up on node n_5_15
> WARN stdlib: Giving up on node n_17_11
> WARN stdlib: Giving up on node n_19_18
> WARN stdlib: Giving up on node n_10_4
> WARN stdlib: Giving up on node n_13_5
> WARN stdlib: Giving up on node n_14_1
> WARN stdlib: Giving up on node n_8_15
> WARN stdlib: Giving up on node n_16_9
> WARN stdlib: Giving up on node n_6_2
> WARN stdlib: Giving up on node n_11_13
> WARN stdlib: Giving up on node n_6_17
> WARN stdlib: Giving up on node n_14_9
> WARN stdlib: Giving up on node n_14_7
> WARN stdlib: Giving up on node n_8_9
> WARN stdlib: Giving up on node n_5_19
> WARN stdlib: Giving up on node n_11_9
>
> -Andrea
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagingPerf.png
Type: image/png
Size: 29406 bytes
Desc: not available
Url : http://orbit-lab.org/pipermail/orbit-user/attachments/20070214/59f522b9/attachment.png
More information about the orbit-user
mailing list