Cluster computing: when your dinky laptop just isn't enough

We sometimes run into situations where it's necessary to, say, fit a whole series of mixed-effects regression models to a very large dataset, or do lots of sampling to estimate a Bayesian model.

Our personal computers are generally good enough for simple tasks, but for these computationally intensive tasks they're not always big or fast enough. For this, we have clusters, many smaller computers which have been taped together. There are two clusters that people in the lab generally use: the Computer Science cluster and the University's Center for Integrated Research Computing cluster (Bluehive) and supercomputer (Bluegene, which we probably will never be allowed to use).

I'll assume for the purposes of this tutorial that you already have an account, or know how to get one, and will walk through logging into the cluster/setting up your ssh keys, running things interactively, submitting batch jobs, and writing (at least minimally) parallel code. There's information elsewhere on the wiki about getting a CS cluster account, and there is a form on the CRC website to register for an account there.

Logging in

You'll use ssh, because it is the secure shell. On a Mac, this is as easy as opening up a Terminal, and typing in ssh username@bluehive.crc.rochester.edu (to log into Bluehive), and then typing in your password when prompted. If you're off-campus, you'll need to first connect to the University's VPN.

This process can be made much more secure (and faster, if you like to live dangerously or get a kick out of editing your ~/.profile) by using ssh keys. ssh keys replace the normal password exchange that takes place when you log in to a server using ssh with public key cryptography security. You will have two keys: a public key that everyone can see, and a private key that no one can ever be allowed to see. You will distribute your public key far and wide, putting a copy of it on every server you want to log into, while your private key will sit securely (and if you're really paranoid, encrypted-ly) on your computer. The public key is a very large number, which is used to encode a small message by the server which is very very difficult to decode without the private key (unless you have a quantum computer).

If you haven't already done so, you need to create your public-private key pair:

cd ~/.ssh
ssh-keygen

This will prompt you to enter a passphrase. If you do enter a passphrase, you will have to either type this passphrase in every time you log in to a server using these keys, or will have to futz with ssh agents. If you don't enter a passphrase, then you will never have to enter your password again when using these keys. The trade off is that anyone who can read your private key file (aka someone logged into your computer as you) can then log into every server which uses that key pair and cause all kinds of mischief.

The next step is to distribute your public key, which is probably in a file called ~/.ssh/id_rsa.pub, to the server, and put it in a file called authorized_keys in the .ssh directory:

scp ~/.ssh/id_rsa.pub username@bluehive.crc.rochester.edu:pubkey.txt
ssh user@remote.host
mkdir ~/.ssh
chmod 700 ~/.ssh
cat pubkey.txt >> ~/.ssh/authorized_keys
rm ~/pubkey.txt
chmod 600 ~/.ssh/*
exit

Now you can log in using the keys instead of your password: ssh username@bluehive.crc.rochester.edu. If you used a passphrase to create the private key, you'll have to enter it again here.

Another, optional step is to set up even more shortcuts. You can specify nicknames for commonly used ssh servers, and set default usernames (since you still have to type in your username every time if it's not the same as the one on your local machine, which it probably isn't for us). This information goes in the file ~/.ssh/config. Here are the contents of mine:

Host bluehive
  HostName bluehive.crc.rochester.edu
  User dkleins2

Host slate
  HostName slate.hlp.rochester.edu

Because my username on my laptop is dkleinschmidt, this saves me from having to type in my netID every time I log into bluehive, and it saves me from having to type that .crc.rochester.edu garbage. Instead, I type ssh bluehive, and I'm in!

Getting code and data files to the cluster

Use scp to transfer files from your local machine to whichever server you're using. There are some GUIs which support scp protocol, but the command line interface is the simplest:

scp sourcefile username@remote.host:~/remote/directory

Getting files back can be a bit trickier, since you need to either know the external hostname of your local machine (which may not even exist in any reasonable form), or remember/cut and paste the path to the remote files. From your local machine, you'd run:

scp username@remote.host:~/remote/sourcefile ~/local/destination/newfilename

For clarity's sake, ~ is shorthand for your local user directory, which is of the form /Users/username/ on a Mac, but this shorthand works on any unix system (including the various clusters).

If you have an ongoing project with lots of different code modules which you are updating all the time, it may make sense to use version-control software like Git to keep things synchronized across your local and remote machines. This is a topic for another day, although I would be thrilled to share how wonderful Git is.

Running interactive jobs

Bluehive

On Bluehive, running R interactively is really easy. ssh into Bluehive and run the command qR:

[dkleins2@bluehive ~]$ qR

qR: Starting R with 1 processors and 2 GB of RAM..

If you want more than one node (up to 8), type, for instance:

qR 8

to reserve n nodes.

You can now interact with R in the same way would would running it in the terminal on your laptop. You can also install your own site packages exactly as you do on your own computer, using the install.packages() command. There was at one point some funny business about lme4 not installing correctly because of a mismatched version of the Matrix library, but hopefully that has been resolved (DAN??).

SSH display tunneling

It's even possible to interact with R (or whatever other software you please) graphically using X-windows tunneling. Make sure X11 is running (if you're on a Mac), and when you connect to the server, add the -Y flag to the ssh command:

ssh -Y username@bluehive.crc.rochester.edu

Not quite as pretty as the Mac-native Quartz graphics, and probably a little sluggish, but it gets the job done.

Batch jobs

Batch jobs are a little more complicated. Bluehive and the CS cluster both use the TORQUE queueing system. Many users want to use the clusters at the same time, and TORQUE provides a system for more-or-less fairly dividing the computing resources among the different users. In order to reserve time on the cluster, you submit a small script with some information about how much computing power you need, how long you need it for, etc., as well as the shell commands that actually run your job. There is a good tutorial on the CRC website about how to submit jobs to Bluehive, but we'll go over an example here which should tell you everything you need to know to run R scripts.

All of the files for this section are in this tarball: demo.tar.gz If you have access to the cluster and want to follow along, copy the archive to the cluster and expand it, like so:

scp ~/demo.tar.gz username@bluehive.crc.rochester.edu:~
ssh username@bluehive.crc.rochester.edu
## ...logs into Bluehive...
tar -xvzf demo.tar.gz
cd cluster_demo/

Example PBS file

=== ~/cluster_demo/sampler-example.pbs ===

#!/bin/bash
#PBS -q standard
#PBS -l nodes=1:ppn=1
#PBS -l walltime=1:00:00
#PBS -o samp-examp.log
#PBS -N samp-examp
#PBS -j oe
#
cd $PBS_O_WORKDIR
echo Running on host `hostname`
echo Time is `date`
. /usr/local/modules/init/bash
module load R

R CMD BATCH ~/cluster_demo/sampler-example.R 

exit

The PBS file looks generally like a bash script, with the only real differences being the #PBS lines at the beginning. These tell the queueing system what kind of resources this job requires. The things you will most likely want to change are the second and third lines. The second line, #PBS -l nodes=1:ppn=1, controls how many nodes and how many processors-per-node (ppn) you require. For simple parallelism (e.g., splitting up a big job into smaller, independent parts that are still pretty large themselves), you probably only need one processor per node, so just change nodes=1 to however many cores you want. The third line, #PBS -l walltime=1:00:00, asks for one hour of runtime. It is very, very important not to underestimate how long your job will take, since if it goes over then the whole job gets unceremoniously killed, and unless your script has been saving data incrementally, all will be lost (from that run, anyway). For more advanced info on these settings (like allocating more memory), see the CRC guide.

The fourth and fifth lines control the name of the log file that the shell script's output will be written to, and the name that the job will appear under in the queue listing.

The next (comment-free) section is boilerplate code which changes the working directory, prints some helpful information (to the log file named above), and loads the bash and R modules. Finally, we have the meat of the script: R CMD BATCH ~/cluster_demo/sampler-example.R. This executes the named R script in batch mode, exactly as would happen on your computer. The output from R is redirected to the file sampler-example.Rout in the directory where the job was submitted from.

To submit this job, we use the qsub command:

[dkleins2@bluehive ~]$ cd ~/cluster_demo
[dkleins2@bluehive cluster_demo]$ qsub sampler-demo.pbs
1432442.bhsn-int.bluehive.crc.private

This doesn't really produce much output, just a confirmation that it was received by the queueing system. You can check your jobs' status by using the qstat command, and limiting the results to jobs with your username using grep (there is almost certainly a better way to do this, but this is just the hacked-together strategy I've settled on):

[dkleins2@bluehive cluster_demo]$ qstat | grep dkleins2
1432445.bhsn-int           samp-examp       dkleins2               0 R standard

Depending on how much use the cluster is getting, your job may run immediately (most of the time this is the case for small jobs on Bluehive). A status of R means running, Q means queued, and C means completed (or died from an error). The number is how long it's been running (or how long it ran, before it finished).

Parallelizing code

This particular code samples from an unusual probability distribution (a plateau, with positive likelihood only between -2 and 2, gradually increasing from -2 to -1, steady until 1, and decreasing between 1 and 2), via slice sampling. It samples five independent chains, for 10,000 iterations each. Since what happens in each chain is (by definition) independent of what happens in all the others, this is a perfect opportunity for parallelization. Parallel code takes a job, splits it up into smaller, independent sub-pieces, and runs each of those on its own processor. This is a great way to make use of more of the computational power available on large clusters, and can really speed up long jobs. However, it can be difficult to convert "regular" code into parallel code.

Luckily, R developers have made things pretty easy for us, and there are a variety of libraries out there which provide simple front-end interfaces for parallelizing code. The one I've used the most is the multicore package (CRAN). This package happens to be compatible with the particular parallelization backend available on the CRC cluster, but crashes my laptop, so be forewarned.

The multicore package provides two ways to parallelize code. The first, and probably easiest if you're already writing code this way, replaces the lapply (list-apply) command with the mclapply (multi-core list apply), which takes a list and a function, and applies the function separately to each item in the list, returning a list of outputs. For instance, we can convert the R script above by replacing the statement which runs the sampler for five chains:

samps <- lapply(1:5, function(n) slice(x0=0, g=Lg, ss=10000))

with the mclapply equivalent:

samps <- mclapply(1:5, function(n) slice(x0=0, g=Lg, ss=10000))

The file cluster_demo/sampler-example-parallel.R has this parallelized version, and the PBS file of the same name will run that job in batch mode. Alternatively, you can always run R interactively (with qR), and then call the script with source('~/cluster_demo/sampler-example-parallel.R).

Another interface provided by multicore is the parallel command, which is used to start jobs, and the collect command, used to collect the results. In this example, we might say:

jobs <- list
for (i in 1:5) {
  jobs[[i]] <- parallel(slice(x0=0, g=Lg, ss=10000))
}
samps <- collect(jobs)

Each time parallel is called, it spins off a separate job, and only once collect is called does the script stop and wait for these jobs to finish.

These methods can speed up your code quite a bit, but keep in mind that you have to specifically request more nodes in your PBS script.

CS and CRC Cluster Tutorial (last edited 2012-03-21 18:31:02 by cpe-66-66-13-133)