HOW TO USE THE SEYCHELLES CLUSTER
Access to the SEYCHELLES CLUSTER is restricted to approved accounts.
To get your account approved, send email to help@nmr.mgh.harvard.edu
asking for access and outline what you plan to use the cluster for
IMPORTANT! Before using the cluster, make sure to subscribe to the
batch-users mailing list so you get email about changes and outages
on the cluster. Go to:
https://mail.nmr.mgh.harvard.edu/mailman/listinfo/batch-users
To use the cluster, you must first ssh to seychelles which is the cluster
master. DO NOT run any analysis programs on seychelles itself. Instead,
you must submit jobs into the PBS queue system to be run on the batch
nodes when it is your jobs "turn". Normal jobs are submitted with the PBS
command 'qsub' though most users will want to use the 'pbsubmit' wrapper
script described below. Here is a quick list of commonly used commands:
qsub Submit (i.e. create) jobs
qdel Delete jobs
qalter Modify attributes of job in queue
qhold Put job on hold
qrls Release hold on job
qstat Check status of job(s)
Each command has a man page to get more info (i.e. 'man qsub')
There are five queues in the seychelles PBS system the define the
priority given to jobs. These are:
general -> captain -> lieutenant -> sergeant -> corporal -> private
By default jobs are run in the corporal queue. Other queues can
be selected by giving the '-q queuename' option to qsub or pbsubmit.
To select a queue higher than sergeant you will need to be given
special permission.
The batch nodes available breakdown as follows:
Description Properties Qty
----------- ---------- ---
AMD 1.57GHz (1GB RAM) amd 4
P4 3.0GHZ (2GB RAM) P4, new 3
Dual Xeon 2.2GHz (3GB RAM) xeon, new 14
Dual Opteron 246 (4GB) opteron, new 90
Dual Opteron 248 (16GB) opteron, bigmem, new 13
UPDATE: As of June 2010, only opteron nodes still exist in
the seychelles cluster.
The properties can be used to choose a certain set of nodes
by giving qsub or pbsubmit the '-l nodes=1:property' option replacing
property with one of P4, xeon, opteron or new from above.
Therefore if you want your jobs only to run on the faster opteron nodes
that have more RAM, submit your jobs with the option '-l nodes=1:opteron'.
Realize if everyone starts doing this and the queue gets full, your jobs
will stay queued and the P4/Xeon nodes will just sit idle. So only do
this only (a) the queue is mostly unused, or (2) your jobs require the
larger RAM.
Two nodes, a dual Xeon box named 'reward' and a dual Opteron box
named 'oct' have been taken out of the queue and setup for any user
to ssh directly in. These are to be used for testing your jobs
interactively (32bit and 64bit respectively) before submitting
to the main queue.
We have written a script called pbsubmit as a wrapper around the PBS
qsub command. To submit a job (that must be non-interactive) to the
queue, run
pbsubmit -c 'cmd arg ...'
where 'cmd arg ...' is the command line you want run. It will report
two numbers back to you: a batch queue number and a pbsjob number.
You can check the status of your job by running 'qstat'. The output
and error from the job will be in ~/.pbsrc (only accessible on
seychelles) in files marked by the pbsjob number. At the job end,
there should be 4 files. For example:
-rwxr-xr-x 1 raines raines 543 Feb 10 12:40 pbsjob_3
-rw------- 1 raines raines 0 Feb 10 12:47 pbsjob_3.e16996
-rw------- 1 raines raines 1140 Feb 10 12:47 pbsjob_3.o16996
-rw-r--r-- 1 raines raines 377 Feb 10 12:47 pbsjob_3.status
The first is the script pbsubmit actually submits to the queue sytem
by the 'qsub' command. The second is the stderr output, the third is
the stdout output, and the last is job status log. Run 'pbsubmit -h'
to get a list of other options like getting email notification.
NOTE, the files in ~/.pbsrc are erased on a regular basis. If for
some reason, you want to keep the output or other files for longer
than a few days after the job finished, then copy them elsewhere.
The 'jobinfo' command can be used to get some basic info on both
queued, running and completed jobs (qstat only works on queued
and running jobs). Run 'jobinfo '. Also, to see the stdout
of a running job, run 'jobinfo -o ' and to see teh stderr
run 'jobinfo -e '.
Jobs are by default limited to 72 hours. If you need more time,
you will have to email help@nmr to ask for their time to be extended.
==== MATLAB
The center has limited numbers of matlab licenses. You can run no more
than 10 Matlab jobs at once on any location including launchpad, seychelles
or your group workstations. You many also not use toolboxes in your
jobs on the cluster.
To automate matlab jobs on the cluster, first create a *.m file with your
actual matlab commands to run. The last line of the script should be 'exit'.
Set up your environment and then give a command like the following to
pbsubmit's -c option
matlab.new -nodisplay -nodesktop -nojvm -r matlabfile
Do not include the ".m" from your matlab script. IN the example above,
the script is "matlabfile.m" and this must exist in the current directory
or in your MATLAB path
==== INTERACTIVE JOBS
If you have a program you would like to run on a node that requires
interactive use such as something using the Matlab GUI, then instead of
pbsubmit run:
qsub -I -X
Your shell will go into the queue and when a node is free you will get a
shell login on a node in the cluster. In that shell, you should be able
to run stuff with GUI/X-windows. When you are done, exit off the node.
Do not leave these interactive node logins running when you are not actively
using them. Save your state, exit your program(s) and the node. Then go
through the qsub process again the next day or when you next need to run an
interactive job. This is because if you are not actively using the program or
having it do a calculation, you are wasting compute cycles that could be used
by others waiting in the batch queue. Normally you should run interactive
jobs like these on your own local Linux workstation at your desk if you have
one.
==== IMPORTANT INFO
The PBS commands hate it when your .cshrc or .login files produce
output on login. In fact they will fail. So you will need to modify
these files if any lines are written to the terminal but your
prompt on login.
Keep in mind these rules:
* Never run jobs on seychelles itself
* Never directly ssh to any of the nodes to run jobs
(you can ssh to them solely to check on things. Run 'qstat -n'
to see the node your job is running on, ssh to it and look in
/var/spool/PBS/spool).
* If you have jobs that do a huge amount of I/O at the start, space
out the submissions.
* DO NOT USE MATLAB unless you can limit it to 10 jobs or less at one
time (the seychelles cluster is limited to 60 Matlab jobs at once
for all users). The limit for EXTERNAL users is 5 jobs.
* DO NOT use any MATLAB toolboxes, just base MATLAB. If you are not
using a toolbox on your desktop, you can use then on JUST ONE
of the nodes.
* Fix your .cshrc/.login to have no output to stdout or stderr
* Jobs are limited to 72 hours.
* Keep usage to 15 jobs during the day and 40 at night. The limit
for EXTERNAL users is 8 jobs durin the day and 20 at night. If you