Greetings HPC User,

We would like to give an update on what was happening with our HPC Scratch Storage in the past few weeks. 

There have been multiple disruptions on the scratch storage service where the storage becomes unresponsive to any read write operation, causing all the jobs in the cluster to hang eventually. We observed that the scratch storage service became relatively unstable for general computing usage, and also noticed very few of the files become corrupted and inaccessible.

In order to prevent further data loss and prolonged service disruption, we have come to a final decision to scrap the current scratch storage service and reimplement a new layout by sacrificing some of the features we thought were good in the long run, but proved unnecessary for current usage.

There will be scheduled maintenance on 27 August 2024, Tuesday at 8am to bring down the scratch storage service. All the data in scratch will be temporarily moved to our current CephFS storage. All the current scratch storage usage will be temporarily swapped to CephFS storage until the new scratch storage layout is ready, so there will be some performance drop for jobs expected.

The cluster is expected to reopen after all the current scratch storage data is migrated to the CephFS, which should take a day or two. We will post an update once the cluster is reopened.

We are sincerely sorry for the service disruption that happened in the past two weeks. We appreciate your patience while we work on the storage issues.

Categories: HPCIncident