Greetings HPC User,

We have been experiencing major Scratch Storage service disruption since Monday. While we are trying to resolve the issue without bringing down the cluster, unfortunately the storage has been unable to heal properly and in a timely manner due to the high load coming from the cluster continuously.

We have no other choice but to bring down the HPC cluster temporarily until we can confirm the issue is gone. We believe this would be the best way to allow storage to heal faster and to avoid any potential data loss. We are sorry for the unscheduled downtime and please rest assured that we will try to resolve the issue as soon as possible.

We will keep everyone updated as soon as we have something to update. Thank you for your understanding.

Update 1 (15 August 2024, 3:42pm):

We have reopened the access to the HPC cluster. For the time being, we will continue to monitor the status and stability of the scratch storage, and take action accordingly if we observe anything unusual. Please let us know if you encountered any issues accessing the Scratch Storage in the HPC cluster.

Thank you for your patience.

Categories: HPCIncident