[orbit-user] Warning: Almost lost handler

Fehmi Ben Abdesslem Fehmi.BEN-ABDESSLEM at lip6.fr
Thu Jul 10 08:29:08 EDT 2008


Hi Thierry,

Thank you very much for all the explanations !
Cheers
Fehmi

Thierry Rakotoarivelo a écrit :
> Hi Fehmi and all,
>
> I had a look at your experiment log file: 
> grid_2008_07_09_09_47_21.log, and focused on the first occurrence of 
> the "ALMOST_LOST_HANDLER" at 09:53:33 in the log.
>
> It seems to me that either (or both?) of the following things happened:
>
> 1) There are a lot of messages exchanged between the experiment nodes 
> and the NodeHandler at that time (quick line number difference gave me 
> 600+ messages within 1sec). These messages are from the stdout on the 
> nodes (showing ntpdate outputs), and all go to the NodeHandler. Thus 
> as Ivan said, there is a chance that all these long and numerous 
> stdout messages create some heavy traffic on the control subnet. 
> (maybe you might want to send part of your application's output to 
> /dev/null on the experiment nodes?)
>
> 2) Or/and, according to these outputs your nodes seem like there are 
> synchronizing their clock with ntp. Maybe as a result of this process, 
> the ones with "late clock" push their clock forward ("jumping" ahead 
> of their time before the ntp synch). Thus, the communication module of 
> the nodeAgent erroneously thinks that a long time has been gone 
> without a keep-alive from the nodeHandler, and triggers the "almost 
> lost" warnings. I that case, as Ivan said, you can dismiss these false 
> warnings.
>
> Regards,
> Thierry.
>
>
>
>
> Fehmi Ben Abdesslem wrote:
>> Hi Ivan,
>>
>> Thanks for your answer. It is happening while running experiments.
>> Here are the two last experiment IDs where I observed the warnings:
>> grid_2008_07_08_17_03_58.log
>> grid_2008_07_09_09_47_21.log
>> Since they do not affect the experiment results, that's ok for me :)
>>
>> Thanks again
>> Fehmi
>>
>> Ivan Seskar a écrit :
>>> Hi Fehmi,
>>>
>>> Is this happening during imaging or while running experiment? 
>>> Typically this is the sign that there is heavy traffic on the 
>>> control subnet and, since nodeAgent packets are not prioritized (it 
>>> is on a TODO list), this tends to starve the keep-alive packets 
>>> (that is why it sometimes happens during imaging). Alternative is to 
>>> use the data subnet (second Ethernet) for experimental data.
>>> Ivan.
>>>
>>> PS: Also what was the experiment id - maybe we can see something in 
>>> the logs?
>>>
>>> PPS: The warning shouldn't affect the experiment at all (it is there 
>>> just to "suggest" that there is a lot of Ethernet traffic).
>>>
>>> -----Original Message-----
>>> From: orbit-user-bounces at orbit-lab.org 
>>> [mailto:orbit-user-bounces at orbit-lab.org] On Behalf Of Fehmi Ben 
>>> Abdesslem
>>> Sent: Wednesday, July 09, 2008 10:04 AM
>>> To: Orbit user discussion mailing list
>>> Subject: [orbit-user] Warning: Almost lost handler
>>>
>>> Hi all,
>>>
>>> Since last week, I always get this kind of warnings while executing 
>>> scripts from the grid console:
>>>
>>>  WARN warning: n_2_16: n_2_16 senderId: 
>>> n_2_16:ALMOST_LOST_HANDLER,1,110239061.991212
>>>  WARN warning: n_16_11: n_16_11 senderId: 
>>> n_16_11:ALMOST_LOST_HANDLER,1,55.987045
>>>  WARN warning: n_6_16: n_6_16 senderId: 
>>> n_6_16:ALMOST_LOST_HANDLER,1,110239059.639049
>>>  WARN warning: n_18_18: n_18_18 senderId: 
>>> n_18_18:ALMOST_LOST_HANDLER,1,41.983455
>>>
>>> Does anyone understand what they mean, whether they affect the 
>>> experiment results, and how to fix those warnings ?
>>> Thanks for your help
>>> Fehmi
>>>
>>> -- 
>>> Fehmi Ben Abdesslem
>>> PhD Candidate
>>>
>>> University of Paris VI - Pierre et Marie Curie LIP6/CNRS - Networks 
>>> & Performance Analysis http://rp.lip6.fr/site_npa/
>>>
>>> Office 716 - 7th floor
>>> 104, avenue du président Kennedy
>>> F-75016 Paris
>>> Tel: +331 4427 8867
>>>
>>> _______________________________________________
>>> orbit-user mailing list
>>> orbit-user at orbit-lab.org
>>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>>> _______________________________________________
>>> orbit-user mailing list
>>> orbit-user at orbit-lab.org
>>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>>>
>>>
>>>   
>>
>
>

-- 
Fehmi Ben Abdesslem
PhD Candidate

University of Paris VI - Pierre et Marie Curie
LIP6/CNRS - Networks & Performance Analysis
http://rp.lip6.fr/site_npa/

Office 716 - 7th floor
104, avenue du président Kennedy
F-75016 Paris
Tel: +331 4427 8867



More information about the orbit-user mailing list