ORBIT-USER: most of the grid does not work!

Andrea G Forte andreaf at cs.columbia.edu
Fri Feb 16 16:31:19 EST 2007


Thank you Chris. I will definitely give it a try.

-Andrea


chris at orderonenetworks.com wrote:
> I was a bit quick to post that code :) This version here seems better :)
>
> Don't forget the 'chnod a+x gne'
>
> gne
> ----------------------
> #!/usr/bin/ruby
> # Author: Chris Davies (chris at orderonenetworks.com)
>
> # determine the output type
> if (ARGV[0] == nil || (ARGV[0] != "nh" &&  ARGV[0] != "oh")) then
>     puts " Good Node Extractor v1.0"
>     puts "  usage: gne nh/oh"
>     puts " "
>     puts "   This utility extracts the nodes that imaged successfully"
>     puts "   from the last experiment that was run. The parameter"
>     puts "   specifies the output type"
>     puts " "
>     puts "      nh - node handler format for use with imageNodes4"
>     puts "      oh - orbit handler format"
>     puts " "
>     exit
> end
>
> # get the most recent log file
> command = "ls -c -1  /tmp/comm*.log"
>
> firstfile = IO.popen(command,"r").readlines[0]
>
> if (firstfile == nil) then
>     puts "Error: log file not found."
>     exit
> end
>
> # grep the file for nodes that imaged
> command = "grep 'INFO.*Wrote' " + firstfile
> if (ARGV[0] == "nh") then
>     puts "defTopology('my:topo') { |t|"
> end
>
> IO.popen(command,"r").each { |line|
>     if (line != nil) then
>         line = line.split('n_')[1]
>         line = line.split('>')[0]
>
>         first = line.split('_')[0];
>         second = line.split('_')[1];
>         if (ARGV[0] == "nh")
>             puts "    t.addNode(" + first + "," + second + ");"
>         else
>             puts "["+first+","+second+"]"
>         end
>     end
> }
>
> if (ARGV[0] == "nh")
>     puts "}"
> end
>
>
>
>
>
>   
>> Ivan and Andrea,
>>
>> I'd prefer to have access to all the nodes - since the quantity counts :)
>>
>> I've come up with a little ruby script that may help. It basically parses
>> the output from the last experiment and generates a list of nodes that
>> imaged sucessfully. The list may be formatted in either 'nodehandler'
>> format (for use with imageNodes4) or 'orbithandler' format.
>>
>> This basically allows you to image all the nodes, then stop the process
>> when it images enough of them. Run the script to make the list of nodes
>> that passed and then use that for the rest of your experiment.
>>
>> The code is new, so any bugs, please let me know.
>>
>> Thanks,
>> Chris
>>
>>
>> sample output:
>> ------------------------------
>>     
>>> gne nh
>>>       
>> defTopology('my:topo') { |t|
>>     t.addNode(1,10);
>>     t.addNode(8,4);
>>     t.addNode(9,1);
>> }
>>
>>
>> script (save as gne) (make sure to type 'chmod a+x gne' so it can run):
>> -----------------------------------------
>> #!/usr/bin/ruby
>> # Author: Chris Davies (chris at orderonenetworks.com)
>>
>> # determine the output type
>> if (ARGV[0] == nil || (ARGV[0] != "nh" &&  ARGV[0] != "oh")) then
>>     puts " Good Node Extractor v1.0"
>>     puts "  usage: gne nh/oh"
>>     puts " "
>>     puts "   This utility extracts the nodes that imaged successfully from
>> the"
>>     puts "   last experiment that was run. The parameter specifies the
>> output type"
>>     puts "      nh - node handler format for use with imageNodes4"
>>     puts "      oh - orbit handler format"
>>     puts " "
>>     exit
>> end
>>
>> # get the most recent log file
>> command = "ls -c -1 -r /tmp/*.log"
>>
>> firstfile = IO.popen(command,"r").readlines[0]
>>
>> if (firstfile == nil) then
>>     puts "Error: log file not found."
>>     exit
>> end
>>
>> # grep the file for nodes that imaged
>> command = "grep Wrote " + firstfile
>>
>> if (ARGV[0] == "nh") then
>>     puts "defTopology('my:topo') { |t|"
>> end
>>
>> IO.popen(command,"r").each { |line|
>> #puts line
>>     line = line.split('msg: <n_')[1]
>>     line = line.split('>')[0]
>>
>>     first = line.split('_')[0];
>>     second = line.split('_')[1];
>>     if (ARGV[0] == "nh")
>>        puts "    t.addNode(" + first + "," + second + ");"
>>     else
>>         puts "["+first+","+second+"]"
>>     end
>>
>> }
>>
>> if (ARGV[0] == "nh")
>>     puts "}"
>> end
>>
>>
>>
>>
>>
>>
>>     
>>> Hi Andrea,
>>>
>>> If everybody agrees, we can try that. The problem is that people who
>>> want to have as many nodes as possible even if they are not 100%
>>> guaranteed to come up lose big time (once we declare nodes as
>>> "administratively down" you can't access them at all). What do others
>>> think about it?
>>>
>>> Ivan.
>>>
>>> -----Original Message-----
>>> From: owner-orbit-user at winlab.rutgers.edu
>>> [mailto:owner-orbit-user at winlab.rutgers.edu] On Behalf Of Andrea G Forte
>>> Sent: Thursday, February 15, 2007 12:17 PM
>>> To: orbit-user at winlab.rutgers.edu
>>> Subject: Re: ORBIT-USER: most of the grid does not work!
>>>
>>> Ivan,
>>>
>>> thank you. This might be very helpful but perhaps I need to understand
>>> it better. What is the difference between a node being off and being
>>> unavailable? Currently the grid shows only 4 nodes as unavailable but
>>> from my experiments there is a very large number of nodes that does not
>>> turn on and others that turn on but do not complete the imaging process.
>>>
>>> In my opinion it would be very helpful to mark all of these nodes as
>>> unavailable and just turn them off. In this way we would be able to
>>> image the good nodes with minimum effort and without having to start the
>>> imaging process again and again because of nodes getting stuck.
>>> In other words, nodes that get stuck or do not turn on cause only
>>> problems and should be disconnected.
>>>
>>> -Andrea
>>>
>>>
>>> Ivan Seskar wrote:
>>>       
>>>>
>>>>         
>>>>> From: owner-orbit-user at winlab.rutgers.edu
>>>>>
>>>>>           
>>>> [mailto:owner-orbit-user at winlab.rutgers.edu] On Behalf Of Mesut Ali
>>>> Ergin
>>>>
>>>>         
>>>>> Sent: Wednesday, February 14, 2007 10:38 PM
>>>>> To: orbit-user at winlab.rutgers.edu
>>>>> Subject: Re: ORBIT-USER: most of the grid does not work!
>>>>>
>>>>>           
>>>> ...
>>>>
>>>> Just to add to this discussion: we are having problems with node power
>>>> supplies (the ones with the red dots on the status page actually have
>>>> dead power supplies). Unfortunately, the first symptoms of failing PSs
>>>> are CM lockups and nodes stuck in on or off state; it looks like we
>>>>         
>>> will
>>>       
>>>> have to replace all of them which is not a trivial thing to do. We are
>>>> trying to find ways of using interim software solution that will
>>>> (hopefully) prolong the life of power supplies as well as enable us to
>>>> do incremental replacement (rather than force us to shut down the grid
>>>> and replace all power supplies at once).
>>>>
>>>> Ivan.
>>>>
>>>> PS: Even better page for big grid status is
>>>> http://www.orbit-lab.org/wiki/Status/Grid - it has a webcam feed as
>>>>         
>>> well
>>>       
>>>> :-). (status pages do not auto-refresh so you will have to do it
>>>> manually - after all they are not really finished yet as you will
>>>> discover if you try to select individual nodes).
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>   




More information about the orbit-user mailing list