HPC CPU Node Crashed

Dear all HPC Users,

There was an incident where one of the CPU compute nodes in the HPC pool, cpu06 crashed due to high CPU load caused by some processes stuck in the machine indefinitely. This incident has caused all the jobs running in cpu06 to fail as the worker daemon was not able to communicate with the scheduler due to high CPU load.

The machine was rebooted physically this morning around 7.45am. If you were running some jobs in the affected machine, please verify and resubmit your job if necessary. We are still investigating the root causes of the incident and will take appropriate action to prevent this issue from happening again.

If you have any issue or question, please do not hesitate to contact us through the service desk.

Thank you.

Published by DICC on January 5, 2021January 5, 2021

Incidents on Scratch Storage Cleanup

HPC Service Degradation

Network Service Disruption

HPC CPU Node Crashed

Published by DICC on January 5, 2021January 5, 2021

Related Posts

Incidents on Scratch Storage Cleanup

HPC Service Degradation

Network Service Disruption