Research Cluster General Information
Cluster hardware
The cluster runs Red Hat Enterprise Linux version 4.5 on Intel processors with 64 bit processing capability.
The cluster contains 81 nodes/262 CPUs, each with two processors: 31 nodes with single core Xeon 3.2 GHz processors and 6 Gb RAM for the node, 20 nodes with single core Xeon 3.2 GHz processors and 8 Gb RAM for the node, 20 nodes with dual core Xeon 2.33 GHz processors with 8 Gb RAM, and 10 nodes with quad core Xeon 2.33 GHz processors with 16 Gb RAM.
Only one job is run on a given core at one time to ensure that each job finishes as fast as possible. This gives the cluster the capacity to run 262 jobs simultaneously.
There is a 6 Gb memory limit per job.
Email aliases for questions and communications are
cluster_admin: cluster administrators
cluster_users: all individuals with accounts
cluster_owners: all individuals who have purchased nodes
Cluster access privileges
To use the cluster, you must be affiliated with a faculty member who has leased nodes on the cluster. Contact cluster_admin for more information. For biostatistics department members, the department has leased nodes for general use - contact the department computing committee chair for access.
Using the cluster
Accessing the cluster
To access the cluster you will need to ssh (see section below) to hpcc.sph.harvard.edu. Your user name will be the first initial of your first name and your last name up to eight characters. Your password will be hu + first 6 numbers of your Harvard ID. You should change your password with the passwd command.
Copying your files to the cluster
To copy files to the cluster you must use the sftp protocol, i.e., using
an ftp client that supports secure file transfer (see next section).
SSH/SFTP client software
To access the cluster for submitting jobs and to transfer files you need client software on your local machine that uses the ssh and sftp protocols.
For Windows, Putty is an ssh client available from the IT downloads page (http://www.hsph.harvard.edu/administrative-offices/information-technology/downloads/).
WinSCP is a Windows ftp client that is also available on the HSPH IT download page.
From another Linux/UNIX machine, you can use the UNIX ssh and scp commands. These should also work from an X terminal on Mac OS X.
Displaying graphical output
To be able to open X windows (UNIX graphical windows) from the cluster on your desktop, you need X windows server software on your machine.
For Windows, you'll need to install X server software such as cygwin or Xwin32. Start the program. Then in your ssh software (e.g., putty), enable X11 forwarding (under 'Connection', then 'SSH', then 'X11'). Then ssh into hpcc.sph.harvard.edu and open an X window (e.g., using xterm or starting SAS or creating an R graphic).
From Linux/UNIX/Mac OS X, just use the command 'ssh -X hpcc.sph.harvard.edu' to enable X11 forwarding.
Running jobs
The cluster uses a scheduling program called LSF from Platform Computing. A comprehensive list of commands for the scheduler is available at http://my.platform.com/docs/lsf/7.0/reference/index.html. Here is an overview of the key ones.
When you are submitting a job
to the cluster use the bsub command and then your usual UNIX batch commands.
[gmazzu@hpcc /]$bsub R CMD BATCH --no-save job1.R job1.out
In this case user gmazzu is submitting an R job called job1.R
What jobs of mine are running?
Once you have submitted jobs you can see what is running by using the
bjobs command.
[gmazzu@hpcc /]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1014 gmazzu RUN preemptabl hpcc compute-0-2 *.2 1 1000 Aug 31 10:22
1011 gmazzu RUN preemptabl hpcc compute-0-1 *.2 1 1000 Aug 31 10:22
1015 gmazzu RUN preemptabl hpcc compute-0-0 *.2 1 1000 Aug 31 10:22
1018 gmazzu RUN preemptabl hpcc compute-0-2 *.2 1 1000 Aug 31 10:22
Stopping jobs
If you would like to stop a job, use the bkill command. You would use
bkill and then the JOBID which you can see in the example above.
[gmazzu@hpcc /]$ bkill 1018
Choosing a queue for your job
There are currently 3 queues available for submission, in addition
to group-specific queues (see below for
information about these). They can be viewed by typing bqueues.
[gmazzu@hpcc /]$ bqueues
Each queue has special attributes and can be used for different purposes. Users should be using the normal, preemptable, and long queues unless they are in a group that has set up their own queue.
normal(default queue) - This queue is best used for jobs that are time-sensitive and need to run without being paused. The time limit for jobs in this queue is 5 days; jobs running longer will be killed. Jobs here have the highest priority and users can run up to 4 simultaneous jobs. At any one time each group can only run as many normal jobs as the number of slots leased by the group owner, so an individual user may not be able to run 4 simultaneous jobs if other users in the group are tying up the slots.
preemptable - This queue is for submitting jobs on other owners' nodes when they are not in use or to run more than 4 jobs. Users can run up to 16 simultaneous jobs. There is no time limit for jobs being running in this queue. Jobs here can potentially be paused by the normal queue if a user submits a job in the normal queue. When a job is paused it will restart when the normal queue job finishes.
long - The long queue is unique because it has no time limit or limit on the number of jobs running. If no one else is using the cluster, you can run as many jobs as there are job slots on the cluster. It also has the lowest priority of all the queues. Both normal and preemptable can pause jobs in this queue. This is used when you have used up all of your normal and preemptable job slots but there are still open slots on the cluster. Also, if you have jobs that are low priority, you may wish to use this queue solely to allow other jobs of your own or others to run at higher priority. Please understand that jobs in this queue can be suspended for a long time.
To submit a job to another queue other than normal, use bsub -q "queue name"
[gmazzu@hpcc /]$ bsub -q long R CMD BATCH --no-save job1.R job1.out
In this case job1.R was submitted to the long queue.
Prioritization of jobs
The queuing software accounts for the number of jobs that users are currently running as a function of their group. The PENDING list of jobs will be ordered so that the first job (within a given queue) to start when a slot on the cluster becomes available will go to a user whose group is using the lowest proportion of its leased portion of the cluster. Note that there is no prioritization within a group; jobs run as first come, first serve.
Switching queues after job submission
You can also move a running or pending jobs from one queue to another using the bswitch command.
[gmazzu@hpcc]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME
143217 gmazzu RUN normal hpcc compute-0-7 * Sens12.R Sep 14 14:08
143218 gmazzu RUN normal hpcc compute-0-1 * Sens12.R Sep 14 14:08
143219 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
143220 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
143221 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
143222 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
[gmazzu@hpcc]$ bswitch long 143219
Job <143219> is switched to queue <long>
[gmazzu@hpcc]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME
143217 gmazzu RUN normal hpcc compute-0-7 * Sens12.R Sep 14 14:08
143218 gmazzu RUN normal hpcc compute-0-1 * Sens12.R Sep 14 14:08
143219 gmazzu RUN long hpcc compute-1-6 * Sens12.R Sep 14 14:08
143220 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
143221 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
143222 gmazzu PEND normal hpcc * Sens12.R Sep 14 14:08
In this example job 143219 was moved from the normal queue to the long queue.
Running Interactive Jobs
Interactive jobs are run through the normal queue so users can only run 4 at once. Please make sure to shut down the job when you're done or you will tie up a job slot and prevent someone else from running a job in that slot.
The basic syntax is:
bsub -Ip Rbsub -Ip sas
bsub -Ip bash
depending on what you want to run. Matlab should NOT be run interactively as the licensing is such that the single interactive job will prevent any new Matlab jobs from being submitted by any other user.
For applications that start graphical windows(e.g. R when making a graphic, or SAS) see the information above on X windows, as your local machine needs to be set up to display the X window.
64-bit processing
The cluster processors can handle 64 bit processing. This means, among other things, that applications can address more than the 4 Gb of RAM that one is limited to in 32 bit processing.
All core software on the cluster runs in 64 bit mode, including R, SAS and Matlab.
Software
R
R is compiled using the Intel compilers and uses Goto's BLAS, both of which greatly improve speed, particularly of basic linear algebra operations.
Many R packages have been installed by the administrators and are accessible to all users. For additional packages, we recommend that you email cluster_admin to have them install other packages you may need. However, if you need the package right away or it is a specialized package that others are unlikely to need, you can install the package locally in your home directory as follows.
1.) create a directory to store the packages, e.g., 'mkdir ~/Rlibs'
2.) start R on the head node
3.) > install.packages('packageName',lib='~/Rlibs')
4.) quit R
5.) start R using bsub and load the package as '> library(packageName,lib.loc='~/Rlibs')'
Note that this is the only time you should run R from the head node without invoking the bsub command. Package installation often involves compiling C or Fortran code, which under our setup requires the Intel Fortran or C compilers, which are only available on the head node.
SAS
Owners have purchased SAS licenses for 14 nodes. In general to use SAS we ask that you be affiliated with an owner who has contributed to the cost of the software. Email cluster_admin for more information. For biostat department members, the department has purchased SAS licenses for moderate amounts of use. If you are a biostat department member and you use SAS intensively we ask that you talk to the department computing committee chair about contributing to the cost of the licensing, which is approximately $900 per node per year.
To run SAS in batch mode:
[gmazzu@hpcc /]$ bsub sas -noterminal code.sas -log file.log
Matlab
Batch mode:
To properly submit a batch job that does not tie up the licenses except during the submission process, you must follow the instructions in the readme.txt file in the attached zip file. Submitting jobs in other ways may tie up the licenses and prevent other users from starting jobs.
Interactive use:
We have bought only two licenses for Matlab, which means that if there were two interactive jobs running at once, no one could submit a batch job, so we need to make sure only one interactive job is running at once, keeping a license free for batch mode submission. The way to check this is to type the following after logging on:
bjobs -u all -l | grep 'matlab'
The '-u all' lists all jobs, the '-l' spits out the full information on each job and the grep part searches for matlab jobs.
If you see something like
"terminal mode, Command "
that means there is another interactive Matlab job running and you should not submit an interactive job.
If you do this and find no other interactive jobs running, you can start an interactive session using only the command line as
bsub -Ip matlab -nodisplay
or if you have an Xwindows server set up on your local machine, you can start the Matlab GUI as
bsub -Ip matlab
OpenBUGS
Users can run OpenBUGS, the open source version of WinBUGS, through the command 'linbugs'. This uses the old command line functionality of BUGS. The easiest way to use this is to run from R using code prepared by Chris Paciorek. This allows one to run batch jobs. Please see this zip file for instructions and template code.
Other software
Other installed software includes Mathematica, octave (an open source version of Matlab), PBAT, FBAT, SaTScan, and Splus.
If you need to run other software, you can install it locally in your home directory or contact cluster_admin about having it installed on the cluster for anyone's use.
The HSPH IT Department does not support third party software, check with your department or sponsor, for support.
Compilation
To compile C, C++, or FORTRAN code, you can use either the gnu compilers or the Intel compilers (icc and ifort) from the command line on the head node when you log in to the cluster (i.e., don't submit a compilation job using bsub). The Intel compilers are generally expected to give faster code as they are optimized to the Intel processors.
Disk space
Each node is allotted 5 Gb of disk space, so if you belong to a group that owns 4 nodes, your group has access to 20Gb of space, to be split amongst users in the group as per the owner's request. Users can email cluster_admin to find out what their individual quota is. If a user goes above their quota, they will receive an email notifying them. After approximately 5 days over the limit, the user will not be able to write to disk until they have removed files to get below their quota. Users can check their disk usage by typing "du -s" from their home directory.
The disk space is backed up to
tape daily.
Additional disk space will be leased to users for a cost of $2.58/Gb
($2580/Tb) annually. Contact cluster_admin.
Purchasing Nodes
Only SPH-affiliated researchers are allowed to purchase nodes in the cluster. If interested, please contact Bill Mahoney, Assistant Director of Information Technology, 432-1751, cluster_admin@hsph.harvard.edu.
Setting up a queue specific for your group
If you are the owner of nodes, you can choose to have your nodes enter the overall pool of nodes with your group assigned the number of job slots that you have leased in the normal queue, or you can set up your own queue. With your own queue, you guarantee access at any time to as many job slots as you have leased and your individual users can run more than 4 jobs in the normal queue and run high priority jobs for longer than 5 days, if you so choose. Please contact cluster_admin to better understand the tradeoffs involved in setting up your own queue.
Cluster policies
Cluster policies were initially determined based on conversations between node owners and IT personnel. At this stage, there is no formal procedure for determining policies, but owners and users can email cluster_admin or cluster_owners with comments and suggestions.