We would like to inform you that our HPC service is now accessible again starting 1.00pm, 30 September 2021.
We managed to identify the root cause of the internal service which is causing random failure on the HPC core component. We have applied fixes to that internal service, and did not notice any more failure occurred during the past 2 days observation.
Some of the users’ jobs have been requeued, and some of the jobs have failed due to service failure. Users will need to resubmit jobs if necessary. We also updated the memory configuration on all the compute nodes, so all the compute nodes will have less memory available as those memory are to be reserved for system stability.
If you are unable to access the HPC service after reopen, and you attempted to access the service multiple times before we reopen the service, your IP might have been blocked automatically by the system. In this case, please try again in one hour and if the problem persists, please let us know at the service desk.
Thank you for your patience.