2024

Status Type Start End Scope User Impact Reason
Resolved Service Alert 2024-04-17 10:30 2024-04-18 09:00 RPOOL, Solid state file system There should be no user imapct due to resilency of the system Swapping and commissioning of one of the disks within the solid state file system
Resolved Service Alert 2024-04-15 10:00 2024-04-15 10:30 Outage to DNS server which will impact Cirrus and Cirrus SAFE Users can still connect to service but may be unable to access external websites (eg GitLab) Migration of server in preparation of the wider power work affecting site the following week
Resolved Service Alert 2024-04-11 10:00 2024-04-11 10:40 Cirrus ticketing server May be a delay in processing new user requests via SAFE Migration of the rundeck ticketing system
Resolved Service Alert 2024-03-24 23:00 2024-03-25 11:00 CPU compute nodes Around half of the CPU compute nodes on Cirrus are unavailable. Jobs running on affected compute nodes will have failed. A power incident caused CPU compute nodes to fail.
Resolved Service Alert 2024-03-21 10:15 2024-03-21 17:30 /scratch unavailable but login and compute nodes are now restored to service. Work is running once more and new jobs can be submitted. System has been restarted but without /scratch.
Resolved Service Alert 2024-03-04 10:00 2024-03-06 12:10 /scratch (solid state storage) /scratch (solid state storage) is unavailable from compute and login nodes A disk has failed within the solid state storage appliance
Resolved Service Alert 2024-02-02 10:15 2024-02-05 14:00 All jobs The failed cooling system has been fixed and the system returned to service. The cooling system has failed and all compute nodes (CPU and GPU) have been shut down. No jobs are able to run and jobs that were running at the time should be automatically refunded. Users may still access the login nodes, access their data and submit jobs. The short QoS will not work. Queued work will queue until the compute nodes are returned. The hardware vendor is coming onsite to repair Cirrus on Monday and we will issue a further update then.
Resolved Service Alert 2024-01-29 14:00 2024-02-01 10:00 /scratch (solid state storage) /scratch (solid state storage) is unavailable from compute and login nodes Issue with file system under investigation
Resolved Service Alert 2024-01-30 2024-03-12 New accounts will appear in SAFE Users may notice that they have duplicate accounts within SAFE. username@cirrus will be replicated and a username@eidf will also appear Cirrus service is transitioning from using ldap to ipa

2023

Status Type Start End Scope User Impact Reason
Resolved At-Risk 2023-12-21 09:00 2023-12-21 12:00 Slurm Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. Update to Slurm
Resolved At-Risk 2023-11-29 09:00 2023-11-29 17:00 Slurm Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. Update to Slurm
Resolved Service alert 2023-11-15 0900 2023-11-15 0930 Cirrus The process to connect to Cirrus will change from ssh key and password to ssh key and TOTP factor Implementing MFA on Cirrus
Resolved Service alert 2023-10-27 0930 2023-10-27 /work file system Cannot create new files on /work The /work file system has reached the maximum number of inodes (number of files) . Users have been asked to reduce their number of files by tarring up directories, deleting or moving data that is no longer required on Cirrus.
Resolved Service alert 2023-08-07 2023-08-08 14:10 /scratch solid state storage /scratch unavailable Update 1400 BST 8th August: System restored and work running once more
Update 1150 BST: Following the reboot, access to login nodes has now been restored. Compute nodes are currently rebooting and work will restart once this has completed.
Full system reboot taking place starting at 1130 BST Tues 8th August.
Our systems team are working with the vendor to understand and try to resolve the issue.
Resolved Service alert 2023-07-08 10:30 BST 2023-07-11 18:30 BST /scratch solid state storage /scratch unavailable Update Tuesday 11th 18:30 BST: /scratch has been restored to service.
Update Monday 10th 18:00 BST: Systems team have been working with HPE to understand and resolve the issue. Running work is continuing, but new work is blocked from starting. Investigations will continue overnight and we will post further updates tomorrow.
Systems team are investigating.
Resolved Service alert 2023-06-18 15:30 BST 2023-06-19 11:15 BST Compute nodes All compute nodes unavailable Due to a power event at the site that hosts Cirrus there are cooling issues that required the Cirrus compute nodes to be taken offline.
Resolved Service alert 2023-06-03 21:10 BST 2023-06-05 13:30 BST CPU compute nodes Majority of CPU compute nodes are now returned to service (8 nodes rmain out of service) Due to a power event at the site that hosts Cirrus most of the CPU compute nodes were unavailable.
Resolved Service alert 2023-05-30 12:30 BST 2023-06-02 10:00 BST Cirrus /home filesystem Issue on the network which impacts the Ceph filesystem underpinning /home on Cirrus. /home is currently unavailable to users. Update 2023-06-01 1800 Progress has been made understanding the complex issues and the vendor and our systems team belive that we will be able to restore a stable network tomorrow. Once we have validation of a stable network we will restore the Cirrus /home filesystem.
Resolved Service alert 2023-05-26 10:00 BST 2023-06-01 15:00 BST Slurm scheduler Slurm commands (e.g. `squeue`, `sbatch`) may not work. Update 2023-06-01 1500 The validations tests and configuration are now complete. Update 2023-05-31 1730 BST The replacement for the faulty hardware component on a Cirrus administration node has now arrived on site and has been swapped successfully. Our team now need to perform validation tests, remove the workaround and complete the configuartion. The service continues to run at risk until the work has been completed.
Resolved Service alert 2023-05-21 19:00 BST 2023-05-25 10:30 BST /scratch solid state storage /scratch is now available on login and compute nodes. HPE were investigating issues with the stability of the /scratch file system
Resolved Service alert 2023-05-21 19:00 BST 2023-05-22 13:30 BST Login, compute, file systems Users will not be able to connect to Cirrus, all running jobs will have failed Power and cooling issues at the ACF data centre
Resolved Service alert 2023-05-19 14:00 BST 2023-06-12 12:00 BST Arm Forge (MAP/DDT) tools Arm Forge software is not available on Cirrus Licence server is hosted by ARCHER2 which is down for multiple weeks for upgrade
Resolved Service alert 2023-03-13 21:00 GMT 2023-03-14 15:25:00 GMT Work (Lustre) parallel file system Users may see lack of responsiveness on Cirrus login nodes and reduced IO performance Heavy load on the Lustre file system is causing contention for shared resources

2022

Status Type Start End Scope User Impact Reason
Resolved Service alert 2022-11-21 09:00 GMT 2022-11-21 11:00 GMT Login nodes The Cirrus login nodes are currently unavailable to users The Ceph home file system has issues due to a failure at the data centre
Resolved Service alert 2022-11-13 10:00 GMT 2022-11-14 09:00 GMT SAFE website Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; Cirrus load plot on status page will not work The website certificate has expired
Resolved Service alert 2022-10-24 08:00 BST 2022-10-24 09:30 BST Login nodes Login access to Cirrus is currently unavailable Login nodes became unresponsive
Resolved Service alert 2022-09-12 07:00 BST 2022-09-14 14:30 BST Login nodes, compute nodes Access to solid state storage (/scratch) is now available from login nodes, CPUs and GPUs nodes. Software modules are now available and are loaded from lustre (/work). The RPOOL solid state storage has an error on one of the NVMe devices
Resolved Partial 2022-08-24 09:00 2022-08-24 13:00 No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start. Up to 4 hours loss of connection to Cirrus login nodes, data and SAFE access Essential updates to the network configuration. Users will be notified when service is resumed.
Resolved At-Risk 2022-07-21 09:00 2022-07-21 10:50 Slurm control and accounting services Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work.
Running jobs will continue without interruption.
Work was completed, ahead of schedule, at 10:50
Update to Slurm
Resolved At-Risk 2022-07-06 09:00 2022-07-06 11:25 RPOOL /scratch filesystem No user impact expected RAID Check to be performed on a disk reporting issues
Resolved Full 2022-06-23 15:00 2022-06-24 15:00 Full Cirrus Service - login nodes, compute node (CPU and GPU), all filesystems Users will not be able to connect to the service, no jobs will run and filesystems are unavailable. Issues with the scratch filesystem following a successful RAID card replacement
Resolved At-Risk 2022-06-08 09:00 2022-06-08 17:00 Servers which provide the /scratch filesystem No user impact is expected. Two configuration changes are required to resolve reliability and integrity issues
Resolved Issue 2022-05-18 09:00 2022-05-18 12:00 Cirrus compute and login nodes including the /scratch servers Users will not be able to connect to Cirrus and no jobs will run. The system has been drained ahead of this maintenance. Once this is confirmed a user mailing will be sent out. A configuration setting change is required for /scratch to resolve the issues which were impacting users' work last week.
Resolved Issue 2022-05-09 09:00 2022-05-09 17:00 Cirrus /scratch servers All user activity as modules could not be loaded. Partial resolution in the afternoon allowed users to access Cirrus but modules still could not be loaded, and new jobs would not run correctly. Under investigation by systems and HPE.
Resolved Issue 2022-04-21 10:30 2022-04-21 11:30 Issues with /home which is also impacting the module environment Users unable to access their data within /home and cannot load module environment. Issues with the E1000 server.
Resolved Issue 2022-06-22 09:00 2022-06-23 15:00 Rack containing 280 CPU Nodes The RAID Card has been replaced successfully. There are a few issues on the /scratch filesystem which are not related to the RAID card so further testing will be performed before the full service is resumed. A new alert has been created for this. Replacement of RAID Card in the main CPU rack.
Resolved Issue 2022-03-28 16:50 2022-03-28 16:53 Cirrus Users logged out of login node 1. Login node 1 shut down due to operator error. Users can reconnect and will be automatically directed to other login nodes.
Resolved Issue 2022-01-09 10:00 2022-01-10 10:30 Cirrus, SAFE Outgoing network access from Cirrus to external sites was not working. SAFE response was slow or degraded.compute nodes to lose power. DNS issue at datacentre

2021

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-11-05 08:00 2021-11-05 11:30 Cirrus: CPU compute nodes A small number of CPU compute nodes were unavailable to run jobs A power incident in the Edinburgh area caused a number of CPU compute nodes to lose power.
Resolved Issue 2021-10-28 11:30 2021-10-28 16:00 Cirrus Some external connections are unavailable such as git pull, pushes, licence servers Network changes at the ACF have caused issues. The University network team have resolved the issues.

2022

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-07-01 09:00 2022-09-13 Object Store (WoS) The WoS has been removed from service. We are working with the hardware vendor to restore the WoS to service again.