[orbit-user] Warning: Almost lost handler
Fehmi Ben Abdesslem
Fehmi.BEN-ABDESSLEM at lip6.fr
Thu Jul 10 08:29:08 EDT 2008
Hi Thierry,
Thank you very much for all the explanations !
Cheers
Fehmi
Thierry Rakotoarivelo a écrit :
> Hi Fehmi and all,
>
> I had a look at your experiment log file:
> grid_2008_07_09_09_47_21.log, and focused on the first occurrence of
> the "ALMOST_LOST_HANDLER" at 09:53:33 in the log.
>
> It seems to me that either (or both?) of the following things happened:
>
> 1) There are a lot of messages exchanged between the experiment nodes
> and the NodeHandler at that time (quick line number difference gave me
> 600+ messages within 1sec). These messages are from the stdout on the
> nodes (showing ntpdate outputs), and all go to the NodeHandler. Thus
> as Ivan said, there is a chance that all these long and numerous
> stdout messages create some heavy traffic on the control subnet.
> (maybe you might want to send part of your application's output to
> /dev/null on the experiment nodes?)
>
> 2) Or/and, according to these outputs your nodes seem like there are
> synchronizing their clock with ntp. Maybe as a result of this process,
> the ones with "late clock" push their clock forward ("jumping" ahead
> of their time before the ntp synch). Thus, the communication module of
> the nodeAgent erroneously thinks that a long time has been gone
> without a keep-alive from the nodeHandler, and triggers the "almost
> lost" warnings. I that case, as Ivan said, you can dismiss these false
> warnings.
>
> Regards,
> Thierry.
>
>
>
>
> Fehmi Ben Abdesslem wrote:
>> Hi Ivan,
>>
>> Thanks for your answer. It is happening while running experiments.
>> Here are the two last experiment IDs where I observed the warnings:
>> grid_2008_07_08_17_03_58.log
>> grid_2008_07_09_09_47_21.log
>> Since they do not affect the experiment results, that's ok for me :)
>>
>> Thanks again
>> Fehmi
>>
>> Ivan Seskar a écrit :
>>> Hi Fehmi,
>>>
>>> Is this happening during imaging or while running experiment?
>>> Typically this is the sign that there is heavy traffic on the
>>> control subnet and, since nodeAgent packets are not prioritized (it
>>> is on a TODO list), this tends to starve the keep-alive packets
>>> (that is why it sometimes happens during imaging). Alternative is to
>>> use the data subnet (second Ethernet) for experimental data.
>>> Ivan.
>>>
>>> PS: Also what was the experiment id - maybe we can see something in
>>> the logs?
>>>
>>> PPS: The warning shouldn't affect the experiment at all (it is there
>>> just to "suggest" that there is a lot of Ethernet traffic).
>>>
>>> -----Original Message-----
>>> From: orbit-user-bounces at orbit-lab.org
>>> [mailto:orbit-user-bounces at orbit-lab.org] On Behalf Of Fehmi Ben
>>> Abdesslem
>>> Sent: Wednesday, July 09, 2008 10:04 AM
>>> To: Orbit user discussion mailing list
>>> Subject: [orbit-user] Warning: Almost lost handler
>>>
>>> Hi all,
>>>
>>> Since last week, I always get this kind of warnings while executing
>>> scripts from the grid console:
>>>
>>> WARN warning: n_2_16: n_2_16 senderId:
>>> n_2_16:ALMOST_LOST_HANDLER,1,110239061.991212
>>> WARN warning: n_16_11: n_16_11 senderId:
>>> n_16_11:ALMOST_LOST_HANDLER,1,55.987045
>>> WARN warning: n_6_16: n_6_16 senderId:
>>> n_6_16:ALMOST_LOST_HANDLER,1,110239059.639049
>>> WARN warning: n_18_18: n_18_18 senderId:
>>> n_18_18:ALMOST_LOST_HANDLER,1,41.983455
>>>
>>> Does anyone understand what they mean, whether they affect the
>>> experiment results, and how to fix those warnings ?
>>> Thanks for your help
>>> Fehmi
>>>
>>> --
>>> Fehmi Ben Abdesslem
>>> PhD Candidate
>>>
>>> University of Paris VI - Pierre et Marie Curie LIP6/CNRS - Networks
>>> & Performance Analysis http://rp.lip6.fr/site_npa/
>>>
>>> Office 716 - 7th floor
>>> 104, avenue du président Kennedy
>>> F-75016 Paris
>>> Tel: +331 4427 8867
>>>
>>> _______________________________________________
>>> orbit-user mailing list
>>> orbit-user at orbit-lab.org
>>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>>> _______________________________________________
>>> orbit-user mailing list
>>> orbit-user at orbit-lab.org
>>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>>>
>>>
>>>
>>
>
>
--
Fehmi Ben Abdesslem
PhD Candidate
University of Paris VI - Pierre et Marie Curie
LIP6/CNRS - Networks & Performance Analysis
http://rp.lip6.fr/site_npa/
Office 716 - 7th floor
104, avenue du président Kennedy
F-75016 Paris
Tel: +331 4427 8867
More information about the orbit-user
mailing list