2022

Status Type Start End Scope User Impact Reason
Resolved Service alert 2022-11-21 09:00 GMT 2022-11-21 11:00 GMT Login nodes The Cirrus login nodes are currently unavailable to users The Ceph home file system has issues due to a failure at the data centre
Resolved Service alert 2022-11-13 10:00 GMT 2022-11-14 09:00 GMT SAFE website Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; Cirrus load plot on status page will not work The website certificate has expired
Resolved Service alert 2022-10-24 08:00 BST 2022-10-24 09:30 BST Login nodes Login access to Cirrus is currently unavailable Login nodes became unresponsive
Resolved Service alert 2022-09-12 07:00 BST 2022-09-14 14:30 BST Login nodes, compute nodes Access to solid state storage (/scratch) is now available from login nodes, CPUs and GPUs nodes. Software modules are now available and are loaded from lustre (/work). The RPOOL solid state storage has an error on one of the NVMe devices
Resolved Partial 2022-08-24 09:00 2022-08-24 13:00 No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start. Up to 4 hours loss of connection to Cirrus login nodes, data and SAFE access Essential updates to the network configuration. Users will be notified when service is resumed.
Resolved At-Risk 2022-07-21 09:00 2022-07-21 10:50 Slurm control and accounting services Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work.
Running jobs will continue without interruption.
Work was completed, ahead of schedule, at 10:50
Update to Slurm
Resolved At-Risk 2022-07-06 09:00 2022-07-06 11:25 RPOOL /scratch filesystem No user impact expected RAID Check to be performed on a disk reporting issues
Resolved Full 2022-06-23 15:00 2022-06-24 15:00 Full Cirrus Service - login nodes, compute node (CPU and GPU), all filesystems Users will not be able to connect to the service, no jobs will run and filesystems are unavailable. Issues with the scratch filesystem following a successful RAID card replacement
Resolved At-Risk 2022-06-08 09:00 2022-06-08 17:00 Servers which provide the /scratch filesystem No user impact is expected. Two configuration changes are required to resolve reliability and integrity issues
Resolved Issue 2022-05-18 09:00 2022-05-18 12:00 Cirrus compute and login nodes including the /scratch servers Users will not be able to connect to Cirrus and no jobs will run. The system has been drained ahead of this maintenance. Once this is confirmed a user mailing will be sent out. A configuration setting change is required for /scratch to resolve the issues which were impacting users' work last week.
Resolved Issue 2022-05-09 09:00 2022-05-09 17:00 Cirrus /scratch servers All user activity as modules could not be loaded. Partial resolution in the afternoon allowed users to access Cirrus but modules still could not be loaded, and new jobs would not run correctly. Under investigation by systems and HPE.
Resolved Issue 2022-04-21 10:30 2022-04-21 11:30 Issues with /home which is also impacting the module environment Users unable to access their data within /home and cannot load module environment. Issues with the E1000 server.
Resolved Issue 2022-06-22 09:00 2022-06-23 15:00 Rack containing 280 CPU Nodes The RAID Card has been replaced successfully. There are a few issues on the /scratch filesystem which are not related to the RAID card so further testing will be performed before the full service is resumed. A new alert has been created for this. Replacement of RAID Card in the main CPU rack.
Resolved Issue 2022-03-28 16:50 2022-03-28 16:53 Cirrus Users logged out of login node 1. Login node 1 shut down due to operator error. Users can reconnect and will be automatically directed to other login nodes.
Resolved Issue 2022-01-09 10:00 2022-01-10 10:30 Cirrus, SAFE Outgoing network access from Cirrus to external sites was not working. SAFE response was slow or degraded.compute nodes to lose power. DNS issue at datacentre

2021

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-11-05 08:00 2021-11-05 11:30 Cirrus: CPU compute nodes A small number of CPU compute nodes were unavailable to run jobs A power incident in the Edinburgh area caused a number of CPU compute nodes to lose power.
Resolved Issue 2021-10-28 11:30 2021-10-28 16:00 Cirrus Some external connections are unavailable such as git pull, pushes, licence servers Network changes at the ACF have caused issues. The University network team have resolved the issues.

2022

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-07-01 09:00 2022-09-13 Object Store (WoS) The WoS has been removed from service. We are working with the hardware vendor to restore the WoS to service again.