View Source

The FLARECAST node is the computing serveur named cluster-r730-1.u-psud.fr which is part of the IAS (PSUD) cluster. This cluster is managed by the SLURM system, which should be used for launching any computing-intensive task (in particular a Docker container doing computations).

SLURM can be used as a queuing system, but at PSUD it is only used as a resource allocation system at the moment (as of 2016-04-22).

Connection

Please use your PSUD login/password to connect to cluster-head.ias.u-psud.fr

Please note that a connection from out of PSUD requires using the ias-ssh.ias.u-psud.fr gateway.

Information about current resource usage

sinfo: information about SLURM queues
squeue: list of jobs in queues

The FLARECAST queue (or "partition" in SLURM) is called flarecast, and corresponds to the FLARECAST node of the cluster (cluster-r730-1). Other queues (and nodes) can be used if requirements exceed the FLARECAST node availability, but these are shared with other projects, so please be considerate in your usage of other nodes. Some of these queues have nodes with GPGPUs (Nvidia K20) or Xeon Phi processors.

flarecast is a high-priority queue, meaning that jobs (from other projects) running on nodes including cluster-r730-1 will be suspended if you start a job in the flarecast queue.

Launching an interactive session within the SLURM system

The salloc command allocates resources within a queue and opens a shell from which commands using these resources can be launched with the srun command. This is mainly used for testing. Exiting the shell will cancel the resource allocation. If you want to keep the resource allocation (and job running) and logout from your terminal, you will need to run salloc within a screen command (then don't forget to cancel the resource allocation later!).

ebuchlin@cluster-head:~$ salloc -p flarecast -n 2  # 2 jobs in partition "flarecast"
salloc: Granted job allocation 2765
ebuchlin@cluster-head:~$ srun hostname
cluster-r730-1
cluster-r730-1
ebuchlin@cluster-head:~$ exit
salloc: Relinquishing job allocation 2765
salloc: Job allocation 2765 has been revoked.
ebuchlin@cluster-head:~$

(hostname is used as an example, to show that it is run on the FLARECAST node; in practice you will use your own command: python script, mpirun, ...)

srun can also be used without salloc, but you then need to specify the SLURM options to each srun:

ebuchlin@cluster-head:~$ srun -p flarecast -n 2 hostname
cluster-r730-1
cluster-r730-1
ebuchlin@cluster-head:~$

Again, please use screen if you plan to logout after launching the job.

srun and salloc have different options for selecting a queue, the desired number of nodes or processes per node, etc.

Launching a batch job

A batch job can be launched using

ebuchlin@cluster-head:~$ sbatch script.sh

where script.sh is a shell script including (a) line(s) with #SBATCH followed by SLURM options.

For example, for 10 independent tasks, script.sh can be:

#!/bin/bash
#SBATCH -n 10 -p flarecast
cd some_directory
srun ./my_executable

For MPI parallization on 12 processors:

#!/bin/bash
#SBATCH --jobname my_job_name
#SBATCH -n 12
echo "$SLURM_NNODES nodes: $SLURM_NODELIST"
cd my_directory
mpirun ./my_executable

For an IDL job (using the full node, otherwise please change !cpu.tpool_nthreads):

#!/bin/bash
#SBATCH -N 1 -p flarecast
cat > idlscript.pro << EOF
my_idl_command1
my_idl_command2
EOF
idl idlscript.pro
rm idlscript.pro