[orbit-user] Warning: Almost lost handler
Thierry Rakotoarivelo
Thierry.Rakotoarivelo at nicta.com.au
Wed Jul 9 19:55:02 EDT 2008
Hi Fehmi and all,
I had a look at your experiment log file: grid_2008_07_09_09_47_21.log,
and focused on the first occurrence of the "ALMOST_LOST_HANDLER" at
09:53:33 in the log.
It seems to me that either (or both?) of the following things happened:
1) There are a lot of messages exchanged between the experiment nodes
and the NodeHandler at that time (quick line number difference gave me
600+ messages within 1sec). These messages are from the stdout on the
nodes (showing ntpdate outputs), and all go to the NodeHandler. Thus as
Ivan said, there is a chance that all these long and numerous stdout
messages create some heavy traffic on the control subnet. (maybe you
might want to send part of your application's output to /dev/null on the
experiment nodes?)
2) Or/and, according to these outputs your nodes seem like there are
synchronizing their clock with ntp. Maybe as a result of this process,
the ones with "late clock" push their clock forward ("jumping" ahead of
their time before the ntp synch). Thus, the communication module of the
nodeAgent erroneously thinks that a long time has been gone without a
keep-alive from the nodeHandler, and triggers the "almost lost"
warnings. I that case, as Ivan said, you can dismiss these false warnings.
Regards,
Thierry.
Fehmi Ben Abdesslem wrote:
> Hi Ivan,
>
> Thanks for your answer. It is happening while running experiments.
> Here are the two last experiment IDs where I observed the warnings:
> grid_2008_07_08_17_03_58.log
> grid_2008_07_09_09_47_21.log
> Since they do not affect the experiment results, that's ok for me :)
>
> Thanks again
> Fehmi
>
> Ivan Seskar a écrit :
>> Hi Fehmi,
>>
>> Is this happening during imaging or while running experiment?
>> Typically this is the sign that there is heavy traffic on the control
>> subnet and, since nodeAgent packets are not prioritized (it is on a
>> TODO list), this tends to starve the keep-alive packets (that is why
>> it sometimes happens during imaging). Alternative is to use the data
>> subnet (second Ethernet) for experimental data.
>> Ivan.
>>
>> PS: Also what was the experiment id - maybe we can see something in
>> the logs?
>>
>> PPS: The warning shouldn't affect the experiment at all (it is there
>> just to "suggest" that there is a lot of Ethernet traffic).
>>
>> -----Original Message-----
>> From: orbit-user-bounces at orbit-lab.org
>> [mailto:orbit-user-bounces at orbit-lab.org] On Behalf Of Fehmi Ben
>> Abdesslem
>> Sent: Wednesday, July 09, 2008 10:04 AM
>> To: Orbit user discussion mailing list
>> Subject: [orbit-user] Warning: Almost lost handler
>>
>> Hi all,
>>
>> Since last week, I always get this kind of warnings while executing
>> scripts from the grid console:
>>
>> WARN warning: n_2_16: n_2_16 senderId:
>> n_2_16:ALMOST_LOST_HANDLER,1,110239061.991212
>> WARN warning: n_16_11: n_16_11 senderId:
>> n_16_11:ALMOST_LOST_HANDLER,1,55.987045
>> WARN warning: n_6_16: n_6_16 senderId:
>> n_6_16:ALMOST_LOST_HANDLER,1,110239059.639049
>> WARN warning: n_18_18: n_18_18 senderId:
>> n_18_18:ALMOST_LOST_HANDLER,1,41.983455
>>
>> Does anyone understand what they mean, whether they affect the
>> experiment results, and how to fix those warnings ?
>> Thanks for your help
>> Fehmi
>>
>> --
>> Fehmi Ben Abdesslem
>> PhD Candidate
>>
>> University of Paris VI - Pierre et Marie Curie LIP6/CNRS - Networks &
>> Performance Analysis http://rp.lip6.fr/site_npa/
>>
>> Office 716 - 7th floor
>> 104, avenue du président Kennedy
>> F-75016 Paris
>> Tel: +331 4427 8867
>>
>> _______________________________________________
>> orbit-user mailing list
>> orbit-user at orbit-lab.org
>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>> _______________________________________________
>> orbit-user mailing list
>> orbit-user at orbit-lab.org
>> http://orbit-lab.org/cgi-bin/mailman/listinfo/orbit-user
>>
>>
>>
>
--
-----
Thierry Rakotoarivelo
Networked Systems, ATP Lab - NICTA
Locked Bag 9013, Alexandria, NSW 1435, Australia
Tel. +61 2 8374 5245 / Fax. +61 2 8374 5531
Web. www.nicta.com.au
More information about the orbit-user
mailing list