ORBIT-USER: most of the grid does not work!

chris at orderonenetworks.com chris at orderonenetworks.com
Fri Feb 16 14:06:16 EST 2007


Ivan and Andrea,

I'd prefer to have access to all the nodes - since the quantity counts :)

I've come up with a little ruby script that may help. It basically parses
the output from the last experiment and generates a list of nodes that
imaged sucessfully. The list may be formatted in either 'nodehandler'
format (for use with imageNodes4) or 'orbithandler' format.

This basically allows you to image all the nodes, then stop the process
when it images enough of them. Run the script to make the list of nodes
that passed and then use that for the rest of your experiment.

The code is new, so any bugs, please let me know.

Thanks,
Chris


sample output:
------------------------------
> gne nh

defTopology('my:topo') { |t|
    t.addNode(1,10);
    t.addNode(8,4);
    t.addNode(9,1);
}


script (save as gne) (make sure to type 'chmod a+x gne' so it can run):
-----------------------------------------
#!/usr/bin/ruby
# Author: Chris Davies (chris at orderonenetworks.com)

# determine the output type
if (ARGV[0] == nil || (ARGV[0] != "nh" &&  ARGV[0] != "oh")) then
    puts " Good Node Extractor v1.0"
    puts "  usage: gne nh/oh"
    puts " "
    puts "   This utility extracts the nodes that imaged successfully from
the"
    puts "   last experiment that was run. The parameter specifies the
output type"
    puts "      nh - node handler format for use with imageNodes4"
    puts "      oh - orbit handler format"
    puts " "
    exit
end

# get the most recent log file
command = "ls -c -1 -r /tmp/*.log"

firstfile = IO.popen(command,"r").readlines[0]

if (firstfile == nil) then
    puts "Error: log file not found."
    exit
end

# grep the file for nodes that imaged
command = "grep Wrote " + firstfile

if (ARGV[0] == "nh") then
    puts "defTopology('my:topo') { |t|"
end

IO.popen(command,"r").each { |line|
#puts line
    line = line.split('msg: <n_')[1]
    line = line.split('>')[0]

    first = line.split('_')[0];
    second = line.split('_')[1];
    if (ARGV[0] == "nh")
       puts "    t.addNode(" + first + "," + second + ");"
    else
        puts "["+first+","+second+"]"
    end

}

if (ARGV[0] == "nh")
    puts "}"
end






> Hi Andrea,
>
> If everybody agrees, we can try that. The problem is that people who
> want to have as many nodes as possible even if they are not 100%
> guaranteed to come up lose big time (once we declare nodes as
> "administratively down" you can't access them at all). What do others
> think about it?
>
> Ivan.
>
> -----Original Message-----
> From: owner-orbit-user at winlab.rutgers.edu
> [mailto:owner-orbit-user at winlab.rutgers.edu] On Behalf Of Andrea G Forte
> Sent: Thursday, February 15, 2007 12:17 PM
> To: orbit-user at winlab.rutgers.edu
> Subject: Re: ORBIT-USER: most of the grid does not work!
>
> Ivan,
>
> thank you. This might be very helpful but perhaps I need to understand
> it better. What is the difference between a node being off and being
> unavailable? Currently the grid shows only 4 nodes as unavailable but
> from my experiments there is a very large number of nodes that does not
> turn on and others that turn on but do not complete the imaging process.
>
> In my opinion it would be very helpful to mark all of these nodes as
> unavailable and just turn them off. In this way we would be able to
> image the good nodes with minimum effort and without having to start the
> imaging process again and again because of nodes getting stuck.
> In other words, nodes that get stuck or do not turn on cause only
> problems and should be disconnected.
>
> -Andrea
>
>
> Ivan Seskar wrote:
>>
>>
>>
>>> From: owner-orbit-user at winlab.rutgers.edu
>>>
>> [mailto:owner-orbit-user at winlab.rutgers.edu] On Behalf Of Mesut Ali
>> Ergin
>>
>>> Sent: Wednesday, February 14, 2007 10:38 PM
>>> To: orbit-user at winlab.rutgers.edu
>>> Subject: Re: ORBIT-USER: most of the grid does not work!
>>>
>>
>> ...
>>
>> Just to add to this discussion: we are having problems with node power
>> supplies (the ones with the red dots on the status page actually have
>> dead power supplies). Unfortunately, the first symptoms of failing PSs
>> are CM lockups and nodes stuck in on or off state; it looks like we
> will
>> have to replace all of them which is not a trivial thing to do. We are
>> trying to find ways of using interim software solution that will
>> (hopefully) prolong the life of power supplies as well as enable us to
>> do incremental replacement (rather than force us to shut down the grid
>> and replace all power supplies at once).
>>
>> Ivan.
>>
>> PS: Even better page for big grid status is
>> http://www.orbit-lab.org/wiki/Status/Grid - it has a webcam feed as
> well
>> :-). (status pages do not auto-refresh so you will have to do it
>> manually - after all they are not really finished yet as you will
>> discover if you try to select individual nodes).
>>
>
>
>





More information about the orbit-user mailing list