Greetings HPC User,

We would like to give some updates on some changes and new stuff introduced in the HPC cluster in June 2024.

Cluster Node Information

Querying information of the cluster nodes using sinfo or scontrol has been temporarily disabled due to some misuse cases found. We found that some users have been trying to use the remaining idle resources information from the command output to ‘squeeze’ their jobs. This has indirectly caused inefficiency in job scheduling, where jobs with large amounts of requested CPUs are forced to queue until jobs with an unusual number of resources (6 CPUs, 9 CPUs, etc) to complete. 

We will not support or recommend users to do job ‘squeezing’ in HPC since the beginning, and shall leave all the scheduling to the HPC scheduler. However, with the reduction in maximum wall time from 7 days to 3 days, this has somehow reduced the impact of the job ‘squeezing’ effect. However, this decision is not permanent and may be subject to change based on feedback. If you think the disable of node information query may heavily impact your works in HPC cluster, please let us know.

Migration of Documentation Site

Due to the fact that our Atlassian Confluence license has expired and is no longer renewable, it has proved difficult for us to maintain a safe and secure service with the current version of Confluence. There were a few cases of compromise for the Confluence since last year, and luckily we had backup and were still able to recover it.

We have been looking for an alternative solution, and have decided to migrate everything to our new platform “Outline”. The new documentation site will now be available at https://docs.dicc.um.edu.my/s/hpc. There should be no login required to access the new documentation site. After this, the previous Confluence site will no longer be accessible to the public starting July 2024. If you have any problem accessing the new documentation site, please let us know.

Introduction of MATLAB in HPC Cluster

We have just released our first MATLAB integration in the HPC cluster. The integrated MATLAB is able to utilise up to 1024 CPU cores to run parallel MATLAB code based on our internal testing. If you have any parallel MATLAB code to run previously, you may try out the MATLAB cluster. Please refer to our documentation for more information at https://docs.dicc.um.edu.my/s/hpc/doc/matlab-MwM0X6I1CB.

If you have any colleagues that wish to run MATLAB in the HPC cluster previously, you may forward this news to your colleagues.

Scratch Cleanup Policy

We will re-enable scratch cleanup again starting July 2024. Files that have not been accessed for 90 days will be wiped. Files that were created or unzipped within 90 days will remain untouched. If you need a longer cleanup period, please talk to us via the service desk.

Resources in Job Submission

We would like to remind again on the allocation of resources during job submission. Please do not submit CPU only jobs to GPU partitions as this will block other users from utilising GPUs. If such jobs are found, we will immediately terminate the jobs and suspend the user account.

Also, for GPU jobs, please make sure a proper amount of CPUs to GPUs ratio is allocated for your jobs. For example in the gpu-a100 partition, do not allocate 1 GPU with 64 out of 128 CPUs of the server. We will terminate such jobs if found, and if such jobs are submitted repeatedly, we will have no choice but to suspend the user account temporarily.

We have also seen users submitting jobs that spawn processes without proper control. These jobs were spawning an excessive amount of processes, causing CPU loads way higher than the allocated CPUs. These jobs indirectly will cause performance issues, and if not intervened will crash the affected nodes.

Since the HPC cluster is a shared resource that is available for free to all the UM research communities, please make sure you are responsible to each other when using the HPC cluster.

MIG Profile Removal

We are planning to remove MIG Profile from one of the DGX servers. Initially, we thought it was a good idea to enable MIG on one of the DGX to support more GPU jobs. However, based on the information of the job queue for the past 2 months after the new cluster is available, there have not been enough jobs submitted to justify MIG usage. 

We understand that part of the reason is because of the CUDA limitation within MIG profile which disallows multiple GPUs to be utilised on certain applications. We are planning to make multi-node, multi-GPU execution possible with both DGX servers in the near future to allow larger jobs to execute.

Clarification on Job Scheduling

We would like to clarify some doubts that some users might have heard. We do not provide VIP treatment for job scheduling (in terms of cutting queue or giving high priority) for any users. We treat and serve all users the same in terms of providing support. If you heard some claims that specific nodes are reserved for some parties, please be assured that those claims are indeed untrue.

Thank you and have a nice day.

Categories: HPCNews