Greetings HPC User,

We would like to announce a new change on the HPC cluster, mainly on the resource allocation and usage for interactive jobs. Please bear with us that this will be a lengthy writeup.

What is the problem?

Over the past few weeks, we noticed that a lot of interactive jobs started and became idle for prolonged periods of time. These jobs were mostly submitted during daytime hours but were queued for many hours due to insufficient resources. When resources are finally available, these jobs start during the midnight hours when the job owner is already asleep. This means that the job that occupies the resources will become idle until the owner wakes up and attends to it. There are also cases that users just simply request resources but never come back and use it. All the wasted resources could have been better used for other jobs to complete the research.

Solution and Justification

As HPC resources are very limited in terms of capacity, we want to make sure every single piece of resources are being utilised properly. Since interactive jobs are mainly meant for debugging purposes, we now want to make sure that no interactive jobs will start at odd hours when the owner is away from the keyboard (specifically sleeping), as no one will be using it when the jobs start at the time the job owner is sleeping. Based on our collected cluster usage, we believe that if the system is unable to serve you within 60 minutes when the interactive job is requested, it is very unlikely that it can serve you in the next few hours.

What will be changed?

So, all interactive jobs that have been queueing for more than 60 minutes will be terminated by the system now. This rule will apply to all the interactive jobs from terminal and OnDemand portal. This includes jobs submitted using “srun” and “salloc” commands, and all interactive jobs launched from the OnDemand Portal such as Jupyter and Rstudio. In case you are still in front of the PC waiting for resources, you may resubmit your interactive jobs again to attempt to allocate resources.

Should you worry?

If you only submit jobs with a batch submission script, we believe this has no direct impact on your end. You may continue as it is, provided no violation against our policy.

If you are a frequent Interactive Jobs user, for example heavily using Jupyter, your jobs will now queue until 1 hour maximum, and no longer queue until resources are available. If the resources are free at the time you request for resources, you will be granted the allocation provided your priority is good. Otherwise, you will need several attempts to get your interactive jobs running if the cluster is very busy. We would suggest you convert your jobs into batch jobs if possible to avoid such problems.

What is not covered?

However, this does not completely resolve the issues where users submit the interactive jobs and the jobs start immediately, but the users have already left the PC. Since this is a different issue, we will have different measures to deal with users that occupy resources for no usage. There is at least a one hour buffer time we will monitor those jobs before deciding to terminate the jobs to release the resources.

Please keep in mind that wasting expensive resources is a serious offence and we do not tolerate such behaviour in the cluster, and such could result in account suspension if resource usage does not improve.

If you worry this change will massively impact your daily work, please let us know by raising a ticket at our service desk. Our team will be there to assist you.

Thank you.

Categories: HPCNews