Department of Information Technology

High Performance Research Cluster

Research Cluster General Information

Cluster hardware

The cluster runs Red Hat Enterprise Linux version 5.4 on Intel processors with 64 bit processing capability.

The cluster contains 83 nodes/362 CPUs, each with two processors: 17 nodes with single core Xeon 3.2 GHz processors and 6 Gb RAM for the node, 20 nodes with single core Xeon 3.2 GHz processors and 8 Gb RAM for the node, 20 nodes with dual core Xeon 2.33 GHz processors with 8 Gb RAM, and 26 nodes with quad core Xeon 2.33 GHz processors with 16 Gb RAM.

Only one job is run on a given core at one time to ensure that each job finishes as fast as possible. This gives the cluster the capacity to run 362 jobs simultaneously.

There is a 8 Gb memory limit per job.

Questions, email cluster_admin@hsph.harvard.edu

Cluster access privileges

To use the cluster, you must be affiliated with a faculty member  or department who has sponsored a compute nodes on the cluster. Contact cluster_admin for more information.

Using the cluster

Accessing the cluster

To access the cluster you will need to ssh (see section below) to hpcc.sph.harvard.edu.  You should change your password with the passwd command.

Copying your files to the cluster

To copy files to the cluster you must use the sftp protocol, i.e., using an ftp client that supports secure file transfer (see next section).

SSH/SFTP client software

To access the cluster for submitting jobs and to transfer files you need client software on your local machine that uses the ssh and sftp protocols.

For Windows, Putty is an ssh client available from http://www.chiark.greenend.org.uk/~sgtatham/putty/

WinSCP is a Windows ftp client available from http://winscp.net/eng/index.php

From another Linux/UNIX machine, you can use the UNIX ssh and scp commands. These should also work from an X terminal on Mac OS X.

Displaying graphical output

To be able to open X windows (UNIX graphical windows) from the cluster on your desktop, you need X windows server software on your machine.

For Windows, you’ll need to install X server software such as  Xwin32.

From Linux/UNIX/Mac OS X, just use the command ‘ssh –X –l hpcc_login hpcc.sph.harvard.edu’ to enable X11 forwarding.

Running jobs

The cluster uses a scheduling program called LSF from Platform Computing. A comprehensive list of commands for the scheduler is available at http://my.platform.com/docs/lsf/7.0/reference.  Here is an overview of the key ones.

When you are submitting a job to the cluster use the bsub command and then your usual UNIX batch commands.

bsub R CMD BATCH –no-save job1.R job1.out

In this case submitting an R job called job1.R

What jobs of mine are running?

Once you have submitted jobs you can see what is running by using the bjobs command. bjobs

JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME

1014 gmazzu RUN preemptabl hpcc compute-0-2 *.2 1 1000 Aug 31 10:22
1011 gmazzu RUN preemptabl hpcc compute-0-1 *.2 1 1000 Aug 31 10:22
1015 gmazzu RUN preemptabl hpcc compute-0-0 *.2 1 1000 Aug 31 10:22
1018 gmazzu RUN preemptabl hpcc compute-0-2 *.2 1 1000 Aug 31 10:22

Stopping jobs

If you would like to stop a job, use the bkill command. You would use bkill and then the JOBID which you can see in the example above.

bkill 1018

Choosing a queue for your job

There are currently 3 queues available for submission, in addition to  group-specific queues (see below for information about these). They can be viewed by typing bqueues.

bqueues

Each queue has special attributes and can be used for different purposes. Users should be using the normal, preemptable, and long queues unless they are in a group that has set up their own queue.

normal(default queue) – This queue is best used for jobs that are time-sensitive and need to run without being paused. The time limit for jobs in this queue is 5 days; jobs running longer will be killed. Jobs here have the highest priority and users can run up to 4 simultaneous jobs. At any one time each group can only run as many normal jobs as the number of slots leased by the group owner, so an individual user may not be able to run 4 simultaneous jobs if other users in the group are tying up the slots.

preemptable – This queue is for submitting jobs on other owners’ nodes when they are not in use or to run more than 4 jobs. Users can run up to 22 simultaneous jobs.  There is no time limit for jobs being running in this queue. Jobs here can potentially be paused by the normal queue if a user submits a job in the normal queue. When a job is paused it will restart when the normal queue job finishes.

long – The long queue has no time limit an can run up to 40 simultaneous jobs running. Normal can pause jobs in this queue. This is used when you have used up all of your normal job slots but there are still open slots on the cluster.  Also, if you have jobs that are low priority, you may wish to use this queue solely to allow other jobs of your own or others to run at higher priority. Please understand that jobs in this queue can be suspended for a long time.

To submit a job to another queue other than normal, use bsub -q “queue name”

bsub -q long R CMD BATCH –no-save job1.R job1.out

In this case job1.R was submitted to the long queue.

Prioritization of jobs

The queuing software accounts for the number of jobs that users are currently running as a function of their group. The PENDING list of jobs will be ordered so that the first job (within a given queue) to start when a slot on the cluster becomes available will go to a user whose group is using the lowest proportion of its leased portion of the cluster.  Note that there is no prioritization within a group; jobs run as first come, first serve.

Running Interactive Jobs

Interactive jobs are run through the normal queue so users can only run 4 at once. Please make sure to shut down the job when you’re done or you will tie up a job slot and prevent someone else from running a job in that slot.

The basic syntax is:

bsub -Ip R

bsub -Ip –XF sas   (the option –XF is needed for X windows)

bsub -Ip bash

depending on what you want to run.  Matlab should NOT be run interactively as the licensing is such that the single interactive job will prevent any new Matlab jobs from being submitted by any other user.

For applications that start graphical windows(e.g. R when making a graphic, or SAS) see the information above on X windows, as your local machine needs to be set up to display the X window.

64-bit processing

The cluster processors can handle 64 bit processing. This means, among other things, that applications can address more than the 4 Gb of RAM that one is limited to in 32 bit processing.

All core software on the cluster runs in 64 bit mode, including R, SAS and Matlab.

Software

R

Many R packages have been installed by the administrators and are accessible to all users.  For additional packages, we recommend that you email cluster_admin to have them install other packages you may need.

SAS

To run SAS in batch mode:

bsub sas -noterminal code.sas -log file.log

Matlab

Batch mode:

To properly submit a batch job that does not tie up the licenses except during the submission process, you must follow the instructions in the readme.txt file in the attached zip file. Submitting jobs in other ways may tie up the licenses and prevent other users from starting jobs.

Interactive use:

We have bought only two licenses for Matlab, which means that if there were two interactive jobs running at once, no one could submit a batch job, so we need to make sure only one interactive job is running at once, keeping a license free for batch mode submission. The way to check this is to type the following after logging on:
bjobs -u all -l | grep ‘matlab’
The ‘-u all’ lists all jobs, the ‘-l’ spits out the full information on each job and the grep part searches for matlab jobs.

If you see something like
“terminal mode, Command ”
that means there is another interactive Matlab job running and you should not submit an interactive job.

If you do this and find no other interactive jobs running, you can start an interactive session using only the command line as
bsub -Ip matlab -nodisplay
or if you have an Xwindows server set up on your local machine, you can start the Matlab GUI as
bsub -Ip matlab

OpenBUGS

Users can run OpenBUGS, the open source version of WinBUGS, through the command ‘linbugs’. This uses the old command line functionality of BUGS. The easiest way to use this is to run from R using code prepared by Chris Paciorek. This allows one to run batch jobs. Please see this zip file for instructions and template code.

Other software

Other installed software includes Mathematica, octave (an open source version of Matlab), PBAT, FBAT, SaTScan, and Splus.

If you need to run other software, you can install it locally in your home directory or contact cluster_admin about having it installed on the cluster for anyone’s use.

The HSPH IT Department does not support third party software, check with your department or sponsor, for support.

Compilation

To compile C, C++, or FORTRAN code, you can use either the gnu compilers or the Intel compilers (icc and ifort) from the command line on the head node when you log in to the cluster (i.e., don’t submit a compilation job using bsub). The Intel compilers are generally expected to give faster code as they are optimized to the Intel processors.

Disk space

Users can email cluster_admin to find out what their individual quota is.  If a user goes above their quota, they will receive an email notifying them. After approximately 5 days over the limit, the user will not be able to write to disk until they have removed files to get below their quota. Users can check their disk usage by typing “du -s” from their home directory.

The disk space is backed up to tape weekly.

Additional disk space will be leased to users for a cost of $1.76/Gb ($1760/Tb) annually. Contact cluster_admin@hsph.harvard.edu.