ORBIT-USER: imaging of nodes.

Andrea G Forte andreaf at cs.columbia.edu
Tue Jan 2 12:22:10 EST 2007


My bad. Sorry about this, I will be more careful in the future.
So, if I understand things correctly, when the imaging of nodes gets 
stuck to a particular node, the only way out of it is to 'control C'; it 
does not give up on that node and continue the imaging process with the 
other nodes, right?

-Andrea

Max Ott wrote:
> Hi,
>
> Unfortunately you typed something slightly different. You requested
> 'basline.ndz' which doesn't exist. Now doing that should result in
> something a bit more meaningful than a ServiceException with no real
> consequences.
>
> As for the duration of imaging. It should really never take more than
> 20 to 25 minutes to image the entire grid. I just did it and indeed it
> got stuck on one node, but the INFO messages were very clear about it:
>
> INFO exp: Progress(322/0/323): 10/99/100 min(n_17_19)/avg/max 
> (151.971942)
>
> 322 of 323 nodes were imaged. The one which didn't make it past the
> 10% mark was 'n_17_19'.
>
> If you don't need that one, simple ^C out of it. I guess we should add
> some more sanity checking code in to do this automatically. But we
> first need to fix the reason that 63 nodes didn't come up properly in
> the first place.
>
> However, I'm wondering if you really need 400 nodes? If you need less,
> and know which ones you want, create a topology file listing all the
> nodes you need and then use this one for imaging. Information on that
> should be on the wiki somewhere.
>
> -max
>
> On 1/2/07, Andrea G. Forte <andreaf at cs.columbia.edu> wrote:
>> Dear all,
>>
>> I am try to imaging the nodes of the grid using the command:
>> imageNodes4 [1..20,1..20] baseline.ndz
>>
>> after issuing this command the imaging starts. Everyday I get warnings
>> about different nodes not working, but this is not the issue.
>> Is it normal that after 45 minutes the imaging is still going on? Also,
>> I got the error "FATAL service_call: Exception: ServiceException
>> (ServiceException)
>> ERROR run: ServiceException: ServiceException". Should I ignore it? If
>> this is normal, one hour of the two hours that I can reserve each day
>> goes away only for imaging, this is really inefficient.
>>
>> After pressing "Control C" to terminate the imaging, I got the 
>> following:
>> ERROR ExecApp: Application 'commServer' failed (code=2)
>> ERROR Communicator: ComServer failed: status: 2
>> FATAL service_call: Exception:  (Interrupt)
>> /opt/nodehandler-4.1.2/app/nodeHandler.rb:94:in `service_call':
>> ServiceException (ServiceException)
>>         from /opt/nodehandler-4.1.2/lib/handler/cmc.rb:187:in
>> `nodeAllOffSoft'
>>         from /opt/nodehandler-4.1.2/app/nodeHandler.rb:419:in `shutdown'
>>         from /opt/nodehandler-4.1.2/app/nodeHandler.rb:651
>>
>> Unfortunately I do not have the exact experiment ID because the console
>> froze after interrupting the imaging process. However, it is January 1st
>> and I started the experiment at around 4:01 PM.
>>
>> Your help is very much appreciated as always.
>>
>> -Andrea
>>
>>
>>




More information about the orbit-user mailing list