It is forbidden to execute computations directly on the frontal
genossh.genouest.org). You MUST first connect to a node (using srun) or
submit a job to a node (using sbatch).
Listing availables nodes
When you submit a job, it is dispatched on one of the computing nodes of the cluster.
Those nodes have different characteristics (cpu, ram). We have servers from 128G up to 755G RAM on the nodes, with 8 to 40 cores each. Launch the following command to display the list of available nodes and their characteristics and load (memory in MB):
sinfo -N -O nodelist,partition,cpusstate,memory,allocmem,freemem
- Column 1: the node name
- Column 2: the partition the node belongs to
- Column 3: number of cpus of the node (allocated/idle/other/total)
- Column 4: total amount of memory (in Mb)
- Column 5: total amount of allocated memory (in Mb)
- Column 6: total amount of unused (but potentially allocated) memory (in Mb)
("allocated" means "reserved by someone")
Creating a job
You can launch a shell on a computing node using:
srun --pty bash
You can submit a job with the
You can add submission options in the header of the script using SBATCH directives:
#!/bin/bash #SBATCH --job-name=test #SBATCH --chdir=workingdirectory #SBATCH --output=res.txt #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100
You can submit jobs to a specific partition:
sbatch -p genouest my_script.sh
By default, jobs are submitted to the main partition (
You only need to use this option for very specific cases.
You can monitor your jobs with the
(lists all the jobs by default, restrict to a specific user with
the -u option):
squeue squeue -u username
Reserving CPU and memory
By default each job will be limited to 1 CPU and 6GB memory. If you need more (or less) ressources, you need to add the following options to srun or sbatch commands (or using SBATCH directives):
sbatch --cpus-per-task=8 --mem=50G job_script.sh
In this example, we request 8 CPU and 50G memory on a node to execute the bash
job_script.sh. Many options are available to finely tune the amount
of cpus and memory reserved for your job, have a look at the srun manual.
If at least 1 CPU and 6GB are not available on one node, you may have to wait
to be placed. You can use the same options when using srun.
These limits are strict, your job will not be allowed to use more than was requested. If you use more than selected RAM, your job will be killed.
If your job is stuck with the message “srun: job xxxx queued and waiting for resources” and nothing happens, it means there are no more ressources available on the cluster. In this case, you can try to use the “tiny” partition where you can launch very short jobs with limited resources:
srun -p tiny --pty bash
Your job will get limited resources with this partition: at most 2Gb and 2 cpus, and a time limit of 2h. But these tiny jobs will have a higher priority. There is a limit of 2 simultaneous jobs per user on this partition. These limits are set to make sure anyone can have a slot to connect to a node for very short works at any time. Please don’t abuse.
Execution time limit
All the jobs have a default maximum runtime of 15 days. If one of your jobs is still running after 15 days, it will automatically be stopped by the system.
If you know your job will take longer than 15 days, you can ask for more (e.g. 25 days) when launching your job:
sbatch --time 25-00:00:00 my_short_script.sh
You can also modify the time limit while a submitted job is still pending (PD state):
scontrol update JobId=<job-id> TimeLimit=25-00:00:00
You will get a "permission denied" error if you try to run this command on a running job. In this case, contact us and we will do it for you (keep in mind that we might not be available immediately, don't ask us 2 minutes before the job reaches its time limit).
There is a hard limit of 30 days for all jobs. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"
If you know one of your job will be finished well before 15 days, you can use the
--time option when using sbatch or srun:
sbatch --time 60 my_short_script.sh
This command means that the job will have a maximum lifetime of 60 minutes. If it lasts more than that, it will be killed.
--time have a big advantage: if the cluster is under heavy usage, your job will have a higher chance to be executed quickly (the job scheduler takes into account the expected maximum run time).
Monitoring resource usage
To get more information on resource usage, you can:
- display more information for running jobs using squeue:
squeue -l -o "%.18i %.9P %.70j %.8u %.2t %.10M %.6D %4C %10m %15R %20p %7q %Z"
Note that if you used --mem-per-cpu option, the MIN_MEMORY will not take that into account, you need to multiply it by the number of cpus reserved to get the value really reserved. Alternatively, you can use this command, and look at the TRES column to see what is really reserved:
squeue -l -O "JobID:18,partition:10,name:40,username:9,state:10,timeused:11,tres:45,NodeList:11,reason:20,priority,qos:10,workdir:"
- get information on a specific job:
scontrol show job <job_id>
- check the maximum memory used by a running job:
sstat -j <job_id>.batch --format="JobID,MaxRSS"
- check the cpu time and maximum memory used by a finished job:
sacct -j <job_id> --format="JobID,CPUTime,MaxRSS,ReqMem"
Many other options to squeue, scontrol, sacct, and sstat are available, you can consult their manual by running them with the --help option.
You can find a quick tutorial on Slurm on this web site.
Killing a job
To kill a job, simply execute:
It is possible to submit many similar jobs at once using job arrays. See the job arrays documentation for more details. Briefly, if you launch this command:
sbatch --array=1-50%5 my_script.sh
an array of 50 jobs will be created for the script
my_script.sh, with a maximum of 5 jobs running simultaneously. In my_script.sh, you have access to the SLURM_ARRAY_TASK_ID environment variable which corresponds to the index of the task between 1 and 50.
Our cluster is a shared and limited resource. Some limits are enforced, to avoid all the resources to be monopolized by a single user:
- Jobs are automatically killed after 15 days of computing
- No more than 25 jobs per user can run simultaneously
- Jobs in pending queue are prioritized automatically by Slurm, depending on the asked resources, and the resources used in the past by each user
These limits can be modified without warning depending on the load of the cluster, and the available physical resources.
Please refrain from launching insane amounts of jobs that would block other users for too long.
Please be patient if your job is in queue, it will be executed soon or later.
Long-running interactive jobs (srun)
If you want to create an interactive job (
srun --pty bash), run a long-running command, but you need to disconnect before it is finished, unfortunately the job will be killed, and your command stopped.
There is a solution to avoid that: use
tmux, which is a terminal multiplexer (just as
First connect to genossh.genouest.org as usual, then start a tmux session:
Then you can connect to a node using srun, and launch the commands you like. When you need to disconnect, you need to detach from your tmux session by typing
Ctrl+B on your keyboard, then the letter
d. You can then safely disconnect from genossh (and the internet).
Later, when you want to reconnect to your interactive job, just connect to genossh, and attach to your tmux session you created before by running:
You will be be able to continue your work just as if you never disconnected from the cluster.
Tmux allows allows to manage many multiple parallel sessions like this, look at the documentation for more advanced usage.
Two compute nodes with Nvidia GPUs are available on the Slurm cluster. To use it, you will need to use
srun commands as for a normal Slurm job, but with 2 specific options:
srun --gres=gpu:1 -p gpu --pty bash
The -p option allows to select one of the nodes equipped with GPU. The
--gres option determines the number of GPUs which will be reserved for you by Slurm. Slurm automatically populates an environment variable (
CUDA_VISIBLE_DEVICES) with the id of the GPU that you can use. This environment variable will be used by CUDA applications to use the reserved GPU(s).
Note that access to GPU Performance Counters is not restricted to admin, which means that when you compute data using GPUs, other users of the GPUs can potentially gain access to the data treated by your process. If this is a problem and you absolutely need data privacy, please contact us. For more background on this, have a looke at the correspondng Security advisory.
Some tools require to run on recent processors supporting specific instruction sets like AVX or AVX2. A few old compute nodes don’t support these instructions. To make sure that your job will be run on a recent node supporting these instructions, you can add the --constraint option to srun or sbatch:
sbatch --constraint avx2 my_script.sh
If you want to run a software that requires access to an X11 server, you can enable X forwarding by following these steps:
First, connect to the cluster with the -XC options (X is to enable X forwarding, C is to enable compression):
ssh -XC <your-login>@genossh.genouest.org
You need first to setup a specific ssh key (you only need to do it once, the first time you try to use X11 forwarding). Do it like this:
ssh-keygen -f ~/.ssh/id_slurm -t rsa -b 4096 cat ~/.ssh/id_slurm >> ~/.ssh/authorized_keys
You must not protect this ssh key with a a password (just type enter when it is asked). This will create 2 files in your home (
~/.ssh/id_slurm.pub) that you must not share with anyone.
You can then simply run the following commands to start using an X application:
ssh -X <your-login>@genossh.genouest.org srun --x11 --pty bash
If you need to use the DRMAA library (to launch jobs from python code for example), you’ll need to define these environment variables:
export LD_LIBRARY_PATH=/data1/slurm/drmaa/lib/:$LD_LIBRARY_PATH export DRMAA_LIBRARY_PATH=/data1/slurm/drmaa/lib/libdrmaa.so
Singularity is a new technology allowing to use containers in a High-Performance Computing environment.
Just as Docker, it allows you to launch applications inside containers, completely isolated from the rest of the system. However, unlike Docker, you don’t have access to the root account inside the container. This makes it possible to use it on a standard cluster like the GenOuest one.
Singularity is installed on the newest computing nodes of the cluster. To use it, you need to source it:
Then you can launch any singularity container, for example:
singularity run library://sylabsed/examples/lolcow
Singularity is compatible with Docker images, you can run one like this:
singularity shell docker://quay.io/biocontainers/bowtie2:188.8.131.52--py35h2d50403_1
If you want to have access to some specific directories from the cluster, you can use the -B option like this:
singularity shell -B /db:/db -B /omaha-beach:/omaha-beach docker://quay.io/biocontainers/bowtie2:184.108.40.206--py35h2d50403_1
See the official website for more information on how to use Singularity.
You can use Jupyter in multiple ways using the GenOuest resources:
- By launching a VM in the Genostack cloud
- By running it inside a Docker container with GO-Docker
- By running it on the Slurm cluster
Here’s some help to run it on our cluster (inspired by https://alexanderlabwhoi.github.io/post/2019-03-08_jpn-slurm/)
First, connect to the cluster and connect to a compute node:
ssh <login>@genossh.genouest.org srun --pty bash
Then source the preinstalled Jupyter:
Then run a jupyter notebook, with the option --no-browser as no web browser is installed on our cluster. In the following commands we use the port 8888, but you should use another port of your choice between 10000 and 20000 for example.
jupyter notebook --no-browser --port 8888
Then, open another console on your local machine (laptop), and create an ssh bridge like this:
ssh -A -t -t <login>@genossh.genouest.org -L 8888:localhost:8888 ssh cl1nXXX -L 8888:localhost:8888
cl1nXXX by the name of the node where the Jupyter notebook is running.
Then you can use your favorite web browser and connect to http://localhost:8888/
The port you chose can already be used by someone else, in this case, you’ll get an “Address already in use” error). In this case, choose another port and rerun everything with the changed port number.
If you want to use the brand new JupyterLab instead, do the same, but source jupyterlab instead: