2025
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | At-Risk | 2025-01-08 09:00 | 2025-01-08 17:00 | Scratch (RPOOL) Solid state file system | Scratch (RPOOL) solid state file system will be unavailable. | Replacement of Infiniband card by vendor |
Resolved | Service Alert | 2025-03-12 08:53 | 2025-03-12 10:00 | Issues with the slurm controller have been observed | Users can connect to the login node but jobs will not start on the compute nodes. Users will not be able to issue slurm commands. | Systems team are investigating the issue. |
Resolved | Service Alert | 2025-02-26 09:30 | 2025-02-26 12:00 | A group of nodes on Cirrus developed a technical fault. | Work has been prevented from starting on the affected nodes. Work already running on these nodes may fail but should be uncharged. | Our systems team have identified a technical fault with some Cirrus nodes. These nodes have now been restored. |
Resolved | Service Alert | 2025-02-13 12:30 | 2025-02-25 09:00 | Solid state (/scratch) RPOOL file system | Any jobs using /scratch file system will fail | /scratch file system is 100% full |
Resolved | Service Alert | 2025-01-27 12:00 | 2025-01-29 12:00 | Cirrus service | Service at higher risk of disruption than usual | Work to repair power grid following storm in Edinburgh area means that power issues more likely during this period. |
Resolved | Service Alert | 2025-01-24 10:00 | 2025-01-24 17:00 | Whole Cirrus service | Service at higher risk of disruption than usual. If issues arise, service may take longer to restore. | Red weather warning for high winds in Edinburgh area lead to travel restrictions and higher than usual risk of power/building damage issues. |
2024
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service Alert | 2024-12-11 09:00 | 2024-12-11 16:00 | work file system | We do not expect any impact to users. | Hardware vendor is working on E1000 where the work file system is hosted |
Resolved | Service Alert | 2024-11-29 11:30 | 2024-11-29 18:00 | Access to /work. Slurm. | Users unable to access their data within /work and cannot load module environment. No new jobs starting. Running jobs may fail. If login shell hangs, use "ctrl+c" to get a shell to access data on /home. | Hardware failure in E1000 Lustre storage |
Resolved | Issue | 2024-11-27 13:00 | 2024-11-27 18:00 | Issues with /work which is also impacting the module environment | Users unable to access their data within /work and cannot load module environment. | Issues with the E1000 server. |
Resolved | Service Alert | 2024-11-21 11:40 GMT | 2024-11-21 12:00 GMT | Access to Cirrus from outside University of Edinburgh | No users will be able to access Cirrus from outside the University of Edinburgh network. Running/queued jobs are unaffected. | Loss of power to part of the communication network |
Resolved | Service Alert | 2024-11-21 11:40 GMT | 2024-11-21 15:25 | Cirrus compute nodes | Cirrus compute nodes are down. Login nodes and the /home and /work file systems are available. | Power incident at the ACF |
Resolved | Service Alert | 2024-10-28 11:15:00 +0000 | 2024-10-10 12:15:00 +0000 | EPCC SAFE | SAFE DNS name will be migrated - SAFE and TOTP availability at-risk | Migration to new SAFE Containers |
Resolved | Service alert | 2024-10-03 16:00 | 2024-10-04 19:20 | Slurm |
Slurm commands on the login nodes fail. New work cannot be started, already running work will continue but will be held until slurm is functional again. Logins and filesystems remain accessible. |
System restored to normal function. |
Resolved | At-Risk | 2024-09-10 09:00 | 2024-09-11 17:00 | Scratch (RPOOL) Solid state file system | Scratch (RPOOL) solid state file system will be unavailable. | Update to firmware |
Resolved | Service Alert | 2024-09-12 17:15 | 2024-09-13 10:30 | RPOOL /scratch filesystem | /scratch is unavailable | Cause is being investigated with HPE |
Resolved | At-Risk | 2024-09-11 12:15 | 2024-09-11 14:40 | Slurm scheduler | All jobs may have stopped running. New jobs would not start. | Returning the scratch file system to service caused issues with the Slurm scheduler |
Resolved | At-Risk | 2024-08-14 10:00 | 2024-08-14 11:00 | Slurm | No user impact expected. Small risk that the slurm deamon may be unavailable briefly but if you experience any issue, please wait a few minutes and resubmit. Running jobs will continue without interruption. | Update to Slurm to reduce maximum memory available to jobs. |
Resolved | At-Risk | 2024-08-28 11:00 | 2024-08-28 12:00 | E1000 work file system | We do not expect any user impact but if there is an issue it will be a short connectivity outage | Changing power supply for the JANET CIENA unit |
Resolved | At-Risk | 2024-08-27 12:00 | 2024-08-27 15:00 | ceph home file system | We do not expect any user impact but the file system may appear slower than usual | Rebalancing the ceph file system |
Resolved | At-Risk | 2024-08-21 11:00 | 2024-08-21 12:00 | E1000 work file system | We do not expect any user impact but if there is an issue it will be a short connectivity outage | Changing power supply for the JANET CIENA unit |
Resolved | Service Alert | 2024-07-28 20:55 | 2024-07-29 08:44 | Cirrus login access | Cirrus login unavailable | Login nodes exhausted memory |
Resolved | Service Alert | 2024-06-10 10:50 | 2024-06-11 14:00 | Solid State file system, RPOOL | Scratch file system unavailable. | Unknown - systems teams are investigating |
Resolved | Service Alert | 2024-05-27 03:09 | 2024-05-27 14:45 | Cirrus compute nodes | Compute nodes are unavailable, any jobs running at the time of the power incident will have failed | Power incident on UK national grid in the Edinburgh area resulted in loss of power to Cirrus compute nodes |
Resolved | Service Alert | 2024-05-22 11:30 | 2024-05-22 14:15 | Access to License server | The License server is inaccessible - our team are working to restore | |
Resolved | Service Alert | 2024-05-02 10:50 | 2024-05-03 10:00 | Solid State file system, RPOOL | Scratch file system unavailable. | Failed disk, systems are investigating. |
Resolved | Service Alert | 2024-04-30 10:00 | 2024-04-30 10:00 | Cirrus nodes | We do not expect it to have any user impact | Configuration change to enable a specific workflow for a project |
Resolved | Service Alert | 2024-04-30 08:35 | 2024-04-30 10:00 | Cirrus compute nodes and scratch (solid state RPOOL) file system | Jobs may have failed. Please contact the service desk if you think you require a refund | A switch which is connected to the slurm controller is down, this is causing lots of hangs on all nodes. Systems team are investigating but it may mean that all running jobs have failed. |
Resolved | Service Alert | 2024-04-22 07:00 | 2024-04-26 17:00 | Full Cirrus service - No service available | Users will be unable to connect to Cirrus and will not be able to access their data. Jobs will not run. | Work at the Advanced Computing Facility (ACF) where Cirrus is located is taking place and the power will be removed from Cirrus. |
Resolved | Service Alert | 2024-04-17 10:30 | 2024-04-18 09:00 | RPOOL, Solid state file system | There should be no user imapct due to resilency of the system | Swapping and commissioning of one of the disks within the solid state file system |
Resolved | Service Alert | 2024-04-15 10:00 | 2024-04-15 10:30 | Outage to DNS server which will impact Cirrus and Cirrus SAFE | Users can still connect to service but may be unable to access external websites (eg GitLab) | Migration of server in preparation of the wider power work affecting site the following week |
Resolved | Service Alert | 2024-04-11 10:00 | 2024-04-11 10:40 | Cirrus ticketing server | May be a delay in processing new user requests via SAFE | Migration of the rundeck ticketing system |
Resolved | Service Alert | 2024-03-24 23:00 | 2024-03-25 11:00 | CPU compute nodes | Around half of the CPU compute nodes on Cirrus are unavailable. Jobs running on affected compute nodes will have failed. | A power incident caused CPU compute nodes to fail. |
Resolved | Service Alert | 2024-03-21 10:15 | 2024-03-21 17:30 | /scratch unavailable but login and compute nodes are now restored to service. | Work is running once more and new jobs can be submitted. | System has been restarted but without /scratch. |
Resolved | Service Alert | 2024-03-04 10:00 | 2024-03-06 12:10 | /scratch (solid state storage) | /scratch (solid state storage) is unavailable from compute and login nodes | A disk has failed within the solid state storage appliance |
Resolved | Service Alert | 2024-02-02 10:15 | 2024-02-05 14:00 | All jobs | The failed cooling system has been fixed and the system returned to service. The cooling system has failed and all compute nodes (CPU and GPU) have been shut down. No jobs are able to run and jobs that were running at the time should be automatically refunded. Users may still access the login nodes, access their data and submit jobs. The short QoS will not work. Queued work will queue until the compute nodes are returned. The hardware vendor is coming onsite to repair Cirrus on Monday and we will issue a further update then. | |
Resolved | Service Alert | 2024-01-29 14:00 | 2024-02-01 10:00 | /scratch (solid state storage) | /scratch (solid state storage) is unavailable from compute and login nodes | Issue with file system under investigation |
Resolved | Service Alert | 2024-01-30 | 2024-03-12 | New accounts will appear in SAFE | Users may notice that they have duplicate accounts within SAFE. username@cirrus will be replicated and a username@eidf will also appear | Cirrus service is transitioning from using ldap to ipa |
2023
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | At-Risk | 2023-12-21 09:00 | 2023-12-21 12:00 | Slurm | Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. | Update to Slurm |
Resolved | At-Risk | 2023-11-29 09:00 | 2023-11-29 17:00 | Slurm | Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. | Update to Slurm |
Resolved | Service alert | 2023-11-15 0900 | 2023-11-15 0930 | Cirrus | The process to connect to Cirrus will change from ssh key and password to ssh key and TOTP factor | Implementing MFA on Cirrus |
Resolved | Service alert | 2023-10-27 0930 | 2023-10-27 | /work file system | Cannot create new files on /work | The /work file system has reached the maximum number of inodes (number of files) . Users have been asked to reduce their number of files by tarring up directories, deleting or moving data that is no longer required on Cirrus. |
Resolved | Service alert | 2023-08-07 | 2023-08-08 14:10 | /scratch solid state storage | /scratch unavailable |
Update 1400 BST 8th August: System restored and work running once more Update 1150 BST: Following the reboot, access to login nodes has now been restored. Compute nodes are currently rebooting and work will restart once this has completed. Full system reboot taking place starting at 1130 BST Tues 8th August. Our systems team are working with the vendor to understand and try to resolve the issue. |
Resolved | Service alert | 2023-07-08 10:30 BST | 2023-07-11 18:30 BST | /scratch solid state storage | /scratch unavailable |
Update Tuesday 11th 18:30 BST: /scratch has been restored to service. Update Monday 10th 18:00 BST: Systems team have been working with HPE to understand and resolve the issue. Running work is continuing, but new work is blocked from starting. Investigations will continue overnight and we will post further updates tomorrow. Systems team are investigating. |
Resolved | Service alert | 2023-06-18 15:30 BST | 2023-06-19 11:15 BST | Compute nodes | All compute nodes unavailable | Due to a power event at the site that hosts Cirrus there are cooling issues that required the Cirrus compute nodes to be taken offline. |
Resolved | Service alert | 2023-06-03 21:10 BST | 2023-06-05 13:30 BST | CPU compute nodes | Majority of CPU compute nodes are now returned to service (8 nodes rmain out of service) | Due to a power event at the site that hosts Cirrus most of the CPU compute nodes were unavailable. |
Resolved | Service alert | 2023-05-30 12:30 BST | 2023-06-02 10:00 BST | Cirrus /home filesystem | Issue on the network which impacts the Ceph filesystem underpinning /home on Cirrus. /home is currently unavailable to users. | Update 2023-06-01 1800 Progress has been made understanding the complex issues and the vendor and our systems team belive that we will be able to restore a stable network tomorrow. Once we have validation of a stable network we will restore the Cirrus /home filesystem. |
Resolved | Service alert | 2023-05-26 10:00 BST | 2023-06-01 15:00 BST | Slurm scheduler | Slurm commands (e.g. `squeue`, `sbatch`) may not work. | Update 2023-06-01 1500 The validations tests and configuration are now complete. Update 2023-05-31 1730 BST The replacement for the faulty hardware component on a Cirrus administration node has now arrived on site and has been swapped successfully. Our team now need to perform validation tests, remove the workaround and complete the configuartion. The service continues to run at risk until the work has been completed. |
Resolved | Service alert | 2023-05-21 19:00 BST | 2023-05-25 10:30 BST | /scratch solid state storage | /scratch is now available on login and compute nodes. | HPE were investigating issues with the stability of the /scratch file system |
Resolved | Service alert | 2023-05-21 19:00 BST | 2023-05-22 13:30 BST | Login, compute, file systems | Users will not be able to connect to Cirrus, all running jobs will have failed | Power and cooling issues at the ACF data centre |
Resolved | Service alert | 2023-05-19 14:00 BST | 2023-06-12 12:00 BST | Arm Forge (MAP/DDT) tools | Arm Forge software is not available on Cirrus | Licence server is hosted by ARCHER2 which is down for multiple weeks for upgrade |
Resolved | Service alert | 2023-03-13 21:00 GMT | 2023-03-14 15:25:00 GMT | Work (Lustre) parallel file system | Users may see lack of responsiveness on Cirrus login nodes and reduced IO performance | Heavy load on the Lustre file system is causing contention for shared resources |
2022
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service alert | 2022-11-21 09:00 GMT | 2022-11-21 11:00 GMT | Login nodes | The Cirrus login nodes are currently unavailable to users | The Ceph home file system has issues due to a failure at the data centre |
Resolved | Service alert | 2022-11-13 10:00 GMT | 2022-11-14 09:00 GMT | SAFE website | Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; Cirrus load plot on status page will not work | The website certificate has expired |
Resolved | Service alert | 2022-10-24 08:00 BST | 2022-10-24 09:30 BST | Login nodes | Login access to Cirrus is currently unavailable | Login nodes became unresponsive |
Resolved | Service alert | 2022-09-12 07:00 BST | 2022-09-14 14:30 BST | Login nodes, compute nodes | Access to solid state storage (/scratch) is now available from login nodes, CPUs and GPUs nodes. Software modules are now available and are loaded from lustre (/work). | The RPOOL solid state storage has an error on one of the NVMe devices |
Resolved | Partial | 2022-08-24 09:00 | 2022-08-24 13:00 | No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start. | Up to 4 hours loss of connection to Cirrus login nodes, data and SAFE access | Essential updates to the network configuration. Users will be notified when service is resumed. |
Resolved | At-Risk | 2022-07-21 09:00 | 2022-07-21 10:50 | Slurm control and accounting services |
Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. Work was completed, ahead of schedule, at 10:50 |
Update to Slurm |
Resolved | At-Risk | 2022-07-06 09:00 | 2022-07-06 11:25 | RPOOL /scratch filesystem | No user impact expected | RAID Check to be performed on a disk reporting issues |
Resolved | Full | 2022-06-23 15:00 | 2022-06-24 15:00 | Full Cirrus Service - login nodes, compute node (CPU and GPU), all filesystems | Users will not be able to connect to the service, no jobs will run and filesystems are unavailable. | Issues with the scratch filesystem following a successful RAID card replacement |
Resolved | At-Risk | 2022-06-08 09:00 | 2022-06-08 17:00 | Servers which provide the /scratch filesystem | No user impact is expected. | Two configuration changes are required to resolve reliability and integrity issues |
Resolved | Issue | 2022-05-18 09:00 | 2022-05-18 12:00 | Cirrus compute and login nodes including the /scratch servers | Users will not be able to connect to Cirrus and no jobs will run. The system has been drained ahead of this maintenance. Once this is confirmed a user mailing will be sent out. | A configuration setting change is required for /scratch to resolve the issues which were impacting users' work last week. |
Resolved | Issue | 2022-05-09 09:00 | 2022-05-09 17:00 | Cirrus /scratch servers | All user activity as modules could not be loaded. Partial resolution in the afternoon allowed users to access Cirrus but modules still could not be loaded, and new jobs would not run correctly. | Under investigation by systems and HPE. |
Resolved | Issue | 2022-04-21 10:30 | 2022-04-21 11:30 | Issues with /home which is also impacting the module environment | Users unable to access their data within /home and cannot load module environment. | Issues with the E1000 server. |
Resolved | Issue | 2022-06-22 09:00 | 2022-06-23 15:00 | Rack containing 280 CPU Nodes | The RAID Card has been replaced successfully. There are a few issues on the /scratch filesystem which are not related to the RAID card so further testing will be performed before the full service is resumed. A new alert has been created for this. | Replacement of RAID Card in the main CPU rack. |
Resolved | Issue | 2022-03-28 16:50 | 2022-03-28 16:53 | Cirrus | Users logged out of login node 1. | Login node 1 shut down due to operator error. Users can reconnect and will be automatically directed to other login nodes. |
Resolved | Issue | 2022-01-09 10:00 | 2022-01-10 10:30 | Cirrus, SAFE | Outgoing network access from Cirrus to external sites was not working. SAFE response was slow or degraded.compute nodes to lose power. | DNS issue at datacentre |
2021
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Issue | 2021-11-05 08:00 | 2021-11-05 11:30 | Cirrus: CPU compute nodes | A small number of CPU compute nodes were unavailable to run jobs | A power incident in the Edinburgh area caused a number of CPU compute nodes to lose power. |
Resolved | Issue | 2021-10-28 11:30 | 2021-10-28 16:00 | Cirrus | Some external connections are unavailable such as git pull, pushes, licence servers | Network changes at the ACF have caused issues. The University network team have resolved the issues. |
2022
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Issue | 2021-07-01 09:00 | 2022-09-13 | Object Store (WoS) | The WoS has been removed from service. | We are working with the hardware vendor to restore the WoS to service again. |