Context Navigation

Group Experimentation Support

Experiment scheduler

Installing and configuring packages on Ubuntu

This section describes how to set up torque PBS. If it's already set up, skip this section. First, we install the torque PSB system:

apt-get install torque-server torque-scheduler torque-mom torque-client

Then stop all the running torque processes:

/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop

Create the PBS server (say "yes" when prompted):

pbs_server -t create

Then kill the PBS server process:

killall pbs_server

Set up the PBS server:

echo $(hostname -f) > /etc/torque/server_name
echo $(hostname -f) >  /var/spool/torque/server_priv/acl_svr/acl_hosts
echo $(hostname -f) > /var/spool/torque/mom_priv/config

echo root@$(hostname -f) > /var/spool/torque/server_priv/acl_svr/operators
echo root@$(hostname -f) > /var/spool/torque/server_priv/acl_svr/managers

echo "$(hostname -f) np=4" > /var/spool/torque/server_priv/nodes

If you have a line in your /etc/hosts file that resolves your hostname to 127.0.1.1, you have to comment it out, e.g.

#127.0.1.1  console.grid.orbit-lab.org  console

Once you've done that, start everything back up again:

/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start

Now we'll set up some configuration values:

qmgr -c "set server scheduling = True"
qmgr -c "set server acl_host_enable = True"
qmgr -c "set server acl_hosts = $(hostname -f)"
qmgr -c "set server allow_node_submit = True"

You'll have to run the commands above as root, since you've set up the root user as the only PBS operator and manager.

Next we'll set up queues: one for each node.

For various reasons, we've decided to make a queue per node and have the console be the single "compute" node, instead of having the nodes be the "compute" nodes. (Mainly because then we can still use legacy disk images, and don't have to worry about configuring the nodes to work with torque.) It might seem "neater" to use "nodes" instead of queues, because this would make it simpler to run an experiment with multiple nodes. In practice, though, it would still be annoying to run an experiment with multiple nodes because you generally care which nodes in this scenario (e.g. you want nodes that are close.)

First we get a list of nodes, then we'll set up a queue for each one:

list=$(omf stat -t system:topo:all | grep "Node:" | awk -F" " '{print $2}' | cut -f1 -d$'.')

for l in $list
do 
    qmgr -c "create queue $l"
    qmgr -c "set queue $l queue_type = Execution"
    qmgr -c "set queue $l max_running = 1"
    qmgr -c "set queue $l enabled = True"
    qmgr -c "set queue $l started = True"
done

(Again, this must be done as root or as a queue manager.)

Now we'll disable all the queues, because we are going to re-enable them selectively.

list=$(qstat -Q | tail -n+3 | awk -F" " '{print $1}')

for l in $list
do 
    qdisable "$l"
done

Check with

qstat -Q

Scheduler utility script

Retrieve the scheduler utility script with

wget http://witestlab.poly.edu/repos/misc/omf

then make it executable

chmod a+x omf

You may run it locally (.omf) or put it in /usr/local/bin or similar location.

Enabling/disabling "scheduled" experiments

The easiest way to turn the scheduler on and off is to put the scheduler utility script in /usr/local/bin/omf (assuming the "real" omf is in /usr/bin). When you want it to be turned on, make it executable:

chmod a+x /usr/local/bin/omf

and when you want it to be turned off, turn off the execute bit:

chmod a-x /usr/local/bin/omf

On WITest, as part of the script that runs at the beginning of each reservation, the scheduler script is turned on if it is a group reservation and off if it isn't, and all queues are disabled.

It is then up to the group "leader" (i.e. tutorial instructor, teaching assistant, etc.) to enable individual queues for the experiment that the group will run. The group "leader" also needs to know to use the full path /usr/bin/omf for "load" commands, which are disabled in the utility script (because it is generally not desirable to load images in a group reservation.)

Making OMF scripts that work with the scheduler

To work with the scheduler utility script, the OMF experiment script should define a property 'node' that specifies the resource to use. This should also be the name of a queue. For an experiment that requires multiple resources, they should be specified with e.g. a dictionary.

For example, this script defines a "sndnode" and a "rcvnode":

defProperty('prefix', '', "Prefix for HRN")
defProperty('suffix', '.grid.orbit-lab.org', "Suffix for HRN")another
defProperty('node', 'node3-19', "ID of sender node, will be passed by job scheduler")

# Set up node pair
sndnode = "#{property.prefix}#{property.node}#{property.suffix}"
assignments = {"node3-19" => "node3-18", "node14-10" => "node14-11"}
rcvnode = assignments["#{property.node}"].to_s

To report status to the dashboard, the OMF experiment script should define an EXP_SUCCESS event. For example:

require 'sqlite3' 
# OMF 5.4.1+ no longer has the 'ms' to access experiment data so we need to use
# this workaround, which is not so nice because of database locking :(
defEvent(:EXP_SUCCESS, 5) do |event|
  sq3Filename= "/var/lib/oml2/#{Experiment.ID}.sq3" 
  if File.file?(sq3Filename)
    db = SQLite3::Database.new(sq3Filename)
    begin
      if db.get_first_value( "SELECT COUNT(seq_no) FROM otr2_udp_in;" ).to_i > 10
        event.fire(:comment=>"Received at least 10 measurements",:node=>"#{property.node.value}")
      end
    rescue
      # Do nothing in case database is locked or table doesn't exist yet
    end
  end
end

onEvent(:EXP_SUCCESS) do |event|
  info "Experiment success (measured ten incoming UDP packets)"
end

Using the scheduler

Some important commands follow.

To enable a queue named e.g. node13-10 so that it starts accepting jobs (user must be listed as a torque operator or manager):

qenable node13-10

To disable a queue named e.g. node13-10 so that it will finish currently queued jobs but not accept new ones (user must be listed as a torque operator or manager):

qdisable node13-10

To see the current torque server configuration:

qmgr -c 'p s'

To add a user e.g. ffund01 as a torque manager:

qmgr -c "set server managers += ffund@console.grid.orbit-lab.org"

To kill a job with job ID e.g. 13:

qdel 13

To see currently queued and running jobs (all if you're a torque manager, otherwise your own):

qstat

OMF with reporting

Patching OMF EC

Make some changes to the nodeHandler.rb file of the OMF EC:

Near the top, require the oml4r gem:

require 'oml4r'

class OmfNodeMP < OML4R::MPBase
  name :node
  param :user
  param :expID
  param :node
  param :property
  param :value
  param :status
end
class OmfEventMP < OML4R::MPBase
  name :event
  param :user
  param :expID
  param :event
  param :hash
end
class OmfExperimentMP < OML4R::MPBase
  name :experiment
  param :user
  param :expID
  param :property
  param :value
end
class OmfNetMP < OML4R::MPBase
  name :net
  param :expID
  param :node
  param :property
  param :value
  param :status
end
class OmfNodeStatMP < OML4R::MPBase
  name :nodestat
  param :expID
  param :node
  param :status
end

After def interactive? definition, add

  #
  # Return the OMLize state of the Node Handler
  #
  # [Return] true/false
  # 
  def self.omlize?
    self.instance.omlize?
  end

  def omlize?
    @omlize
  end

  def self.injectNode node,property,status,value
    OmfNodeMP.inject("#{Experiment.User}","#{Experiment.ID}","#{node}","#{property}","#{value}","#{status}")
  end

  def self.injectEvent  event, hash
    OmfEventMP.inject("#{Experiment.User}","#{Experiment.ID}", "#{event}", "#{hash}")
  end

  def self.injectExperiment  property, value
    OmfExperimentMP.inject("#{Experiment.User}","#{Experiment.ID}", "#{property}", "#{value}")
  end

  def self.injectNet node, property, value, status
    OmfNetMP.inject("#{Experiment.ID}", "#{node}", "#{property}", "#{value}", "#{status}")
  end

  def self.injectNodeStat node, status
    OmfNodeStatMP.inject("#{Experiment.ID}", "#{node}", "#{status}")
  end

Before the Signal.trap('SIGINT') {Experiment.interrupt}, add

    if omlize?
      opts = {:appName => 'omf',
        :id => 'expctl',
        :domain => 'experiment',
        :omlCollectUri => 'tcp:localhost:3003'}
      OML4R::init([], opts)
    end

After the other opts, add

    opts.on("--omlize",
    "Report the experiment controller status to OML") { @omlize = true }

In experiment.rb:

Add before attr_reader :domain:

  @@user = %x(whoami).downcase.chomp

and after def Experiment.ID:

  # 
  # Return the user
  #
  def Experiment.User
    return @@user
  end

In traceState.rb,

Inside the definition of self.experiment, at the beginning,

    if NodeHandler.instance.omlize?
      NodeHandler.injectExperiment command, arg
    end

Inside the definition of self.nodeStatus at the end,

    if NodeHandler.instance.omlize?
      NodeHandler.injectNodeStat node, status
    end

Inside the definition of self.nodeConfigure, at the end,

    if NodeHandler.instance.omlize?
      NodeHandler.injectNode node, name, value, status
    end

Inside the definition of self.nodeOnAppEvent, at the end,

    if NodeHandler.instance.omlize?
      NodeHandler.injectNode node, appName, eventName, message
    end

Inside the first definition of onEvent, at the top,

    if NodeHandler.instance.omlize?
      NodeHandler.injectNode node, "event", eventName, message
    end

Inside the second definition of onEvent, at the top,

    if NodeHandler.instance.omlize?
      NodeHandler.injectNode node, op, message, eventName
    end

In event.rb,

After if @@events[@name][:fired],

            if NodeHandler.instance.omlize?
              NodeHandler.injectEvent @name, @@events[@name][:actionOptions].inspect
            end

Usage instructions

Experiment dashboard

Installing/configuring packages on Ubuntu

Install a JavaScript runtime:

sudo apt-get install nodejs

Get the dashing gem:

gem install dashing
gem install bundler
gem install execjs --version 1.4.0  # see https://github.com/Shopify/dashing/issues/195
gem uninstall execjs --version 2.0.2

In the dashing gemspec (wherever it is located), you may have to change the version for the "execjs" dependency to 1.4.0.

Get the source code for the dashboard. cd to that directory, then run

bundle
dashing start

Source code for experiment dashboard

Usage

To see the dashboard, set up an SSH tunnel through gw.orbit-lab.org:

ssh -L 9000:grid.orbit-lab.org:3030 gw.orbit-lab.org

(specify your username if necessary),

then visit http://localhost:9000 in a browser.

Last modified 9 years ago Last modified on Jun 28, 2016, 5:45:16 PM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text