MARTINOS CENTER COMPUTE CLUSTER

NOTICE

Access to the Compute Clusters is restricted to approved accounts. To get your account approved, send email to help@nmr.mgh.harvard.edu asking for access and outline what you plan to use the clusters for.

Before using the cluster, make sure to subscribe to the batch-users mailing list so you get email about changes and outages on the cluster. Go to:    https://mail.nmr.mgh.harvard.edu/mailman/listinfo/batch-users

Table of Contents

Introduction
Creating/Preparing Jobs
Submitting Jobs
Memory and CPU Limitations on Jobs
Number and Time Limitations on Jobs
How Queue Priority Works
File I/O Limitations of the Cluster
Using MATLAB on the Cluster
Temporary Disk Space
Interactive Jobs
Launchpad Use Summary
Launchpad FAQ

INTRODUCTION

The Martinos Center has two compute cluster named launchpad and tensor (formerly seychelles). launchpad is the much newer of the two with fasters CPUs and more memory. It should normally be the cluster everyone uses. tensor should be used for special I/O needs discussed below or when the queue on launchpad is backed up.

To use either cluster, you must first ssh to the master node which is named the same as the cluster (e.g. ssh launchpad). DO NOT run any analysis programs on the master node itself. Instead, you must submit jobs into the PBS/Torque queue system to be run on the batch nodes when it is your jobs "turn". Normally jobs are submitted with the PBS/Torque command 'qsub' though most users will want to use the 'pbsubmit' wrapper script described below. Here is a quick list of commonly used commands:

pbsubmit Submit jobs using "easy mode" wrapper
qsub Submit jobs directly into system (advanced mode)
qstat List status of running and queued jobs
qdel Delete jobs
qalter Modify attributes of job in queue
qhold Put job on hold
qrls Release hold on job
jobinfo Get a short report/stats on completed jobs

Each 'q' command has a man page to get more info (i.e. 'man qsub'). Run 'pbsubmit -h' for help on pbsubmit and see below.

There are five generic queues in the PBS system that define the priority given to jobs. These are:

p60 -> p50 -> p40 -> p30 -> p20 -> p10 -> p5

By default jobs are run in the default queue which has the same priority as the p10 queue. Other queues can be selected by giving the '-q queuename' option to qsub or pbsubmit. Before submitting jobs to a queue higher than p30 you should email help@nmr first for special permission. [Discussion of priority system is on the TODO list ]

The launchpad cluster built in 2009 has 127 batch nodes which each have two quad-core Xeon 5472 3.0GHz CPUs and 32GB of RAM. All nodes are running the 64-bit version CentOS 6. Each node can run 8 jobs at once so there can be up to 1016 jobs running at one time. Sixteen of the nodes are reserved for GPU computing only and have NVIDIA C2050 Tesla GPUs in them. The queue GPU needs to be specified to run jobs on these nodes.

The tensor cluster built in 2005 has 108 batch nodes which each have two Opteron 246 CPUs. Most nodes have only 4GB of RAM, but eighteen of them have 16GB (given property bigmem). Each node can run 2 jobs at once.

For users who do not have their own 64-bit workstations in their group to use for compiling and testing jobs, you can use the 'tensor' master node for this purpose.

The rest of this page is specific to the launchpad cluster. The final section on tensor will discuss the differences to be aware of when using the tensor cluster.


CREATING/PREPARING JOBS

A job that you wish to run on the launchpad cluster should have the following characteristics:

There is a way to run interactive programs which is described in the Interative Jobs section below, but this is generally discouraged.

If you expect your jobs to write to standard out/error more than 100MB, then you should make a job script that runs the underlying programs of your job and redirects standard out/error to log files in your own file space.

The PBS commands hate it when your .cshrc or .login files produce copious output on login. In fact they often will fail. This issue is quite common when users automatially source the Freesurfer environment and other such things in their .cshrc files. You may need to modify .cshrc and .login if any lines are written to the terminal but your prompt on login and you have trouble submitting jobs.


SUBMITTING JOBS

Jobs can be submitted using either the Torque qsub command or a script called pbsubmit written specifically at this center as a wrapper around qsub for easier submission. Using qsub directly is an advanced topic that is not covered in this web page. The user should read its man page and other online resources.

The syntax for submitting a job with pbsubmit is:

pbsubmit [<options>] -c "cmd arg ..."

where 'cmd arg ...' is the command line you want run. It will report two numbers back to you: a batch queue number and a pbsjob number. In most cases you should give the full path to your command/script.

You can check the status of your job by running qstat. The output and error from the job will be in /pbs/USERNAME (only accessible on the master node) in files marked by the pbsjob number. At the job end, there should be 5 files. For example:

-rwxr-xr-x    1 raines   raines        543 Feb 10 12:40 pbsjob_3
-rw-------    1 raines   raines          0 Feb 10 12:47 pbsjob_3.e16996
-rw-------    1 raines   raines        112 Feb 10 12:47 pbsjob_3.env
-rw-------    1 raines   raines       1140 Feb 10 12:47 pbsjob_3.o16996
-rw-r--r--    1 raines   raines        377 Feb 10 12:47 pbsjob_3.status

The first is the script pbsubmit actually submits to the queue system by the 'qsub' command. The second is the stderr output, the third is a list of environment variables at time of submit, the fourth is the stdout output, and the last is job the status log. Run 'pbsubmit -h' to get a list of other options like getting e-mail notification.

NOTE, the files in /pbs are erased on a regular basis. If for some reason, you want to keep the output or other files for longer than a few days after the job finished, then copy them elsewhere.

The jobinfo command can be used to get some basic info on both queued, running and completed jobs (qstat only works on queued and running jobs). Run 'jobinfo '. Also, to see the stdout of a running job, run 'jobinfo -o ' and to see the stderr of a running job, run 'jobinfo -e '. For completed jobs, look in your /pbs/USERNAME directory as stated above.

If you want to create dependencies on jobs such that one submitted job does not run until a previously submitted job completes, you can use the "-W" option with the "afterok" keyword as described in the qsub man page.


MEMORY AND CPU LIMITATIONS ON JOBS

Each launchpad node has 32GB RAM and 8 CPU cores. By default a job is assumed to need only a max of 7GB virtual memory and one CPU core. If your jobs need more memory and/or CPUs than that, you will have to specify that when you submit your job. The defaults in the Torque system assume and enforce a 7GB virtual memory limit unless you ask for more when submitting the job.

For those using pbsubmit there is a "-n" option you can use to increase your limit by asking for more "job slots". Each slot corresponds to requesting +7GB virtual memory and +1 CPU core. So if you specify "-n 4" for example, you will be requesting 28GB & 4 CPU cores for your job. The jobinfo command will list the memory used (resident & virtual) by completed jobs. It is virtual memory that the limit applies to although pbsubmit will try to use the ulimit command to also limit resident memory to 4GB per slot requested.

If you use qsub directly or you are using the "-l" option to pbsubmit, you will need to specify the number of CPUs and amount of virtual memory you require explicitly in the "-l" option. For example, to specify you want 4 CPUs and 50GB of virtual memory, you would give the option:

-l nodes=1:ppn=4,vmem=50gb

Nothing enforces the CPU limit on nodes. So if your jobs uses more CPU resources than you request, you just cause more CPU sharing with other jobs making your jobs and everyone else's jobs on the node run slower. However, memory limits are enforced and your jobs will be killed automatically if it goes over the limit.

IMPORTANT: do not ask for more vmem or CPUs(npp) in your job than you really need or you will waste the resources of launchpad and possible block other user's jobs from running when they could. If you want to test your job, you can ssh to hydra or trabant and run the job with /usr/bin/time (you must type the complete path to override the shell time command). Ignore the result you see for maxresident as that does not work on Linux. Instead take the number of minor pagefaults and multiple by 4kbytes. This is a fair estimate of the amount of total virtual memory you job took.


NUMBER AND TIME LIMITS ON JOBS

At present only some queues (see table in next section) have hardcoded limits to the number of jobs users can run. This may change. For now we want users to police themselves using as a general rule to have no more than 100 CPU jobs during the day and 200 CPU jobs at night. The limit for external/guest users is 40 CPU jobs during the day and 100 CPU jobs at night. Use the showq command and look at the bottom of the ACTIVE JOBS section to see how busy the cluster is. If the cluster seems less than 70% busy, you can submit more than the limit above. However we reserve the right to kill excess jobs at anytime if we see the queue backing up. Email us if you have a huge number of jobs to run and we will access the situation and tell you how to proceed.

Five queues named max20, max50, max75, max100 and max200 exist so that users can queue a bunch of jobs to these queues and guarantee that only X CPU jobs ever run at one time where X is the respecitive number in the queue name. Also the highest priority queues have max CPU job limits so users do not abuse them.

By default in most queues jobs have a limit of 96 hours to complete before they are killed. If you need more time than that then submit to the extended queue which has a default limit of 196 hours. If you know your jobs needs more time than that, you can appened to the "-l" options a higher number of hours. For example:

-q extended -l nodes=1:walltime=272:00:00

will request 272 hours to run the job.


HOW QUEUE PRIORITY WORKS

The following table lists all the queues in the launchpad system and their configuration:

QueuePriorityNodes Max CPU/UserMax CPU TotalDefault Time Limit
p10, default10100nonGPU 150Unlimited96 Hours
p2010200nonGPU UnlimitedUnlimited96 Hours
p3010300nonGPU UnlimitedUnlimited96 Hours
p4010400nonGPU 50Unlimited96 Hours
p5010500nonGPU 30Unlimited96 Hours
p6010600nonGPU 20Unlimited96 Hours
max2010100nonGPU 20Unlimited96 Hours
max5010100nonGPU 50Unlimited96 Hours
max7510100nonGPU 75Unlimited96 Hours
max10010100nonGPU 100Unlimited96 Hours
max2008000nonGPU 200Unlimited96 Hours
matlab10100nonGPU 20 jobs60 jobs96 Hours
extended8000nonGPU 50250196 Hours
GPU90GPU UnlimitedUnlimited96 Hours

The key to understanding the job queue and priority is running the command showq. You will see three sections of jobs: ACTIVE, IDLE, and BLOCKED. ACTIVE holds the currently running jobs. IDLE+BLOCKED hold those queued. Only those in IDLE are being considered for running according to priority. Users can have a total of 4 jobs in the IDLE section at one time.

When a user has less than 4 jobs in the IDLE section, their oldest job in the BLOCKED section will move to the IDLE section at the priority of the queue they submitted to. From then on the job gains priority at a rate of 1 per minute. The job with the highest priority in the IDLE section will be the next job run when enough node slots become free for it.

For example, lets say UserA submits 100 one CPU jobs and UserB submits 50 2 CPU jobs a split second later, all in the same priority queue. The first 8 of UserA jobs will run, then the first 4 of UserB, then the next 8 of UserA, then next 4 of UserB, and so on.

Using different queues with different priorities adds complexity to this and allows newer jobs to jump over older ones that have not accrued enough priority yet. Jobs submitted to p10 will start with 10100, p20 with 10200, p30 with 10300 and so on. Jobs accrue 1 more priority point for every minute in the queue, but only once in the IDLE section of the queue.

One can run showq -i to see just the jobs in the IDLE section of the queue with the accrued priority value shown. The job at the top is the one waiting to run next. And if you check back a minute later and the same job is at the top, then it is 'stuck' and this is because there are no nodes available with the resources the job is asking for.

What we typically see on launchpad causing a 'stuck' job at the head of the IDLE section is one of the following:

  1. the job asks for 4 or more CPUs (e.g. npp=8) and there are no nodes with that many CPU slots free.
  2. the jobs asks for more than 7GB of vmem and there are no nodes with the requested amount of vmem free

By default in pbsubmit, if no option otherwise is given, a job will ask for just 1 CPU slot and 7GB of vmem. Using the "-n X" option the job will ask for X CPU slots (i.e. ppn=X) and X * 7GB of vmem. Using the "-l " option one can independently set ppn and vmem requests.

Now, if you know your job really might use as much as 56GB of virtual memory, then by all means, you have to submit your job with vmem=56GB

BUT DO NOT SUBMIT jobs with large vmem values unless you really know your jobs needs that as this results in wasted jobs slots not being available to run other jobs. Same goes for CPU slots (ppn).


FILE I/O LIMITATIONS OF THE CLUSTER

The launchpad cluster is located at Needham with only a 1 gigabit connection to the Partners network on the master node. All batch nodes have to funnel through that single 1 gig connection on the master to access any network file system not on the cluster's own private network. This includes both storage on your group workstations as well as central servers located on any server except the new GPFS storage cluster.

The upshot is that any jobs that you would run on launchpad now that do heavy I/O to files outside the cluster, such as data format conversion jobs, will overload the 1 gig uplink and extremely negatively impact the cluster for everyone.

How do you know if your jobs are I/O intensive? Well I cannot tell you. We at the IT group know very little about analysis jobs internals. You should know what kind of input and output is going into or out of your jobs. If not, ask someone in your group who does.

A lot of jobs only do a bit of heavy I/O at the start and end of their jobs. I am told 'recon-all' falls into this category. For these jobs, space out their submission by a few seconds between each pbsubmit/qsub and things are usually fine.

Remember, you are allowed to ssh to nodes your jobs are running on (run 'qstat -n' to find the node a job is running on) in order to check progress and state. So you can ssh to a node and run 'top' to see if your job is using close to 100% CPU. If not, then it is probably bogged down in this I/O bottleneck and should not be run on the cluster.

The best solution to this I/O issue is to put your data on the GPFS storage cluster. This avoids the 1 gigabit bottleneck on the master node as all nodes access data on the GPFS storage cluster directly. Still, the storage cluster is not infinitely powerful. It still has its limits and if you have heavy I/O jobs you should limit yourself to running 20 or less at a time. Talk to the IT Support group first if you want to run more.

Also, each node has a /scratch directory that can be used for temporarly local disk space. For example, your job extracts files from a zip/tar.gz archive file, processes them generating lots of other small files which are then zipped up into a new archive file. All that can be done on /scratch with your job copying the final archive file to your normal group storage space. The /scratch area is cleaned of files unused after 21 days.


USING MATLAB ON THE CLUSTER

The center has limited numbers of MATLAB licenses. Submit any jobs that use matlab to the queue matlab. All users are limited to no more than 20 MATLAB licenses in use at once over all locations (launchpad, tensor or your group workstations). The MATLAB license server enforces a strict limit of no more than 60 MATLAB licenses checked out over all launchpad and seychelles nodes. So if 60 jobs using licenses are already running on the cluster, all additional jobs attempting to run MATLAB will fail with license failures.

Also, if your job requires any toolbox licenses, then you are limited to just ONE such job running on the cluster. And if you do use a toolbox license on the cluster, you are not allowed to use that same toolbox at the same time anywhere else such as your desktop. So essentially the only reason to run a matlab job that uses a toolbox on the cluster instead of your desktop is if the cluster is faster than your desktop.

To automate MATLAB jobs on the cluster, first create a *.m file with your actual matlab commands to run. The last line of the script should be 'exit'. Set up your environment and then give a command like the following to pbsubmit's -c option

matlab.new -nodisplay -nodesktop -nojvm -r matlabfile

Do not include the ".m" from your matlab script. In the example above, the script is matlabfile.m and this must exist in the current directory or in your MATLAB path

Another option is to "compile" your Matlab program into a stand-alone executable. This will not use up a license normally. Use 'lmstat -a' to check that it doesn't after running it once. If you see no licenses in use while it runs, either base or toolbox, then you can run the job like it was a non-matlab job in any of the other batch queues. Details on how to use deploytool to compile your code have been put together by Jean-Philippe Coutu on this page.


TEMPORARY DISK SPACE

Each node has a /scratch directory that can be used for temporarly local disk space. For example, your job extracts files from a zip/tar.gz archive file, processes them generating lots of other small files which are then zipped up into a new archive file. All that can be done on /scratch with your job copying the final archive file to your normal group storage space. The /scratch area is cleaned of files unused after 21 days.

For temporary files that you want seen by multiple jobs on all nodes, there is a scratch volume on the storage cluster you can use located at /cluster/scratch. Create your files in the day-of-the-week named directories you find there. Those days represent when those directories get cleaned of files over 21 days old.


INTERACTIVE JOBS

If you have a program you would like to run on a node that requires interactive use such as something using the Matlab GUI, then run:

qsub -I -X

Your shell will wait on the scheduler and when a node is free you will get a shell login on a node in the cluster. In that shell, you should be able to run stuff with GUI/X-windows. When you are done, exit off the node and this will end the job.

When there is a long wait for a node (run 'qstat' before you run qsub to check), you might want to add the option '-m b' so you get an email when you do finally get a node. Also, if you are running just one job it is okay to give it '-q p50' to get a high priority on your job.

If you need more than 1 CPU and/or more than 7GB of virtual memory, you need to use the "-l" option as discussed above. For example you would need to run the following to get the node all to yourself

qsub -I -X -l nodes=1:ppn=8,vmem=56gb

Do not leave these interactive node logins running when you are not actively using them. Save your state, exit your program(s) and the node. Then go through the qsub process again the next day or when you next need to run an interactive job. This is because if you are not actively using the program or having it do a calculation, you are wasting compute cycles that could be used by others waiting in the batch queue. Normally you should run interactive jobs like these on your own local Linux workstation at your desk if you have one.

If the cluster is especially busy, we reserve the right to kill interactive jobs without warning were we see the login being idle


LAUNCHPAD USE SUMMARY

Keep in mind these rules:


TENSOR CLUSTER DIFFERENCES

On tensor, the primary differences compared to launchpad to be aware of are

  1. There are only two CPUs per node. SO the -n argument cannot be larger than 2.
  2. Most nodes only have 4GB of RAM. If you need all of it, make sure you request the whole node with "-n 2". Some nodes have 16GB. To request those nodes specify the bigmem property.
  3. There are no GPU nodes
  4. All I/O has to bottleneck through the master so tensor is NOT the place to run I/O intensive jobs.

LAUNCHPAD FAQ

If the job at the top of the priority queue is asking for 8 CPUS (or 56GB vmem) and there are tons of jobs behind it needing only 1 CPU (or 7GB vmem) and tons of nodes with 7 or more free CPUS, why not just run those 1 CPU jobs and let them skip over the 8 CPU job?

If we did that, the 8 CPU job would never run until it is the only job left. Because a node would never become full free with all 8 job slots.

How do I know how much memory my job needs?

For past jobs you can run 'jobinfo' on the job number and see how much memory it used. This should give you a clue what vmem to ask for for future jobs with no more than a 3-4GB extra buffer. If it is the first time you have ever run a particular job submit just one with an overestimate for vmem and wait for it to finish to run jobinfo to see how much it really used. Or better yet run it on your workstation and monitor it with top.

How do I know how many CPUs I need

I cannot really tell you this. Run 'top' while the job is running and see if it looks like it is is multi-threaded at all. IF you never see more than one CPU in use by the process, then it is almost certainly a one CPU job.

I sometimes see other users jobs with state HOLD on them. Are they holding up the queue?

No. Only holding up themselves. Usually means they are waiting on a previous submitted job they depend on to complete.