Cirrus Service Alerts history

2025

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2025-10-15 08:00 BST	2025-10-16 14:30 BST	Login nodes	Risk of unexpected issues with new accounts on 15/16 Oct	Essential upgrade of authorisation servers
Resolved	Service Alert	2025-08-11 13:50	2025-08-11 14:17	SAFE, MFA at login	Login not accessible, SAFE not accessible	Due to work on SAFE database, SAFE and Cirrus login MFA are currently unavailable
Resolved	At-Risk	2025-05-05 08:00	2025-05-05 10:30	Slurm batch system, lustre file system (work) and the solid state file system (RPOOL).	Users can connect to login nodes and access their data. No new jobs will start until further investigations take place.	Issues with switch on Cirrus front end
Resolved	Service Alert	2025-03-12 08:53	2025-03-12 10:00	Issues with the slurm controller have been observed	Users can connect to the login node but jobs will not start on the compute nodes. Users will not be able to issue slurm commands.	Systems team are investigating the issue.
Resolved	Service Alert	2025-02-26 09:30	2025-02-26 12:00	A group of nodes on Cirrus developed a technical fault.	Work has been prevented from starting on the affected nodes. Work already running on these nodes may fail but should be uncharged.	Our systems team have identified a technical fault with some Cirrus nodes. These nodes have now been restored.
Resolved	Service Alert	2025-02-13 12:30	2025-02-25 09:00	Solid state (/scratch) RPOOL file system	Any jobs using /scratch file system will fail	/scratch file system is 100% full
Resolved	Service Alert	2025-01-27 12:00	2025-01-29 12:00	Cirrus service	Service at higher risk of disruption than usual	Work to repair power grid following storm in Edinburgh area means that power issues more likely during this period.
Resolved	Service Alert	2025-01-24 10:00	2025-01-24 17:00	Whole Cirrus service	Service at higher risk of disruption than usual. If issues arise, service may take longer to restore.	Red weather warning for high winds in Edinburgh area lead to travel restrictions and higher than usual risk of power/building damage issues.
Resolved	At-Risk	2025-01-08 09:00	2025-01-08 17:00	Scratch (RPOOL) Solid state file system	Scratch (RPOOL) solid state file system will be unavailable.	Replacement of Infiniband card by vendor

2024

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2024-12-11 09:00	2024-12-11 16:00	work file system	We do not expect any impact to users.	Hardware vendor is working on E1000 where the work file system is hosted
Resolved	Service Alert	2024-11-29 11:30	2024-11-29 18:00	Access to /work. Slurm.	Users unable to access their data within /work and cannot load module environment. No new jobs starting. Running jobs may fail. If login shell hangs, use "ctrl+c" to get a shell to access data on /home.	Hardware failure in E1000 Lustre storage
Resolved	Issue	2024-11-27 13:00	2024-11-27 18:00	Issues with /work which is also impacting the module environment	Users unable to access their data within /work and cannot load module environment.	Issues with the E1000 server.
Resolved	Service Alert	2024-11-21 11:40 GMT	2024-11-21 12:00 GMT	Access to Cirrus from outside University of Edinburgh	No users will be able to access Cirrus from outside the University of Edinburgh network. Running/queued jobs are unaffected.	Loss of power to part of the communication network
Resolved	Service Alert	2024-11-21 11:40 GMT	2024-11-21 15:25	Cirrus compute nodes	Cirrus compute nodes are down. Login nodes and the /home and /work file systems are available.	Power incident at the ACF
Resolved	Service Alert	2024-10-28 11:15:00 +0000	2024-10-10 12:15:00 +0000	EPCC SAFE	SAFE DNS name will be migrated - SAFE and TOTP availability at-risk	Migration to new SAFE Containers
Resolved	Service alert	2024-10-03 16:00	2024-10-04 19:20	Slurm	Slurm commands on the login nodes fail. New work cannot be started, already running work will continue but will be held until slurm is functional again. Logins and filesystems remain accessible.	System restored to normal function.
Resolved	At-Risk	2024-09-10 09:00	2024-09-11 17:00	Scratch (RPOOL) Solid state file system	Scratch (RPOOL) solid state file system will be unavailable.	Update to firmware
Resolved	Service Alert	2024-09-12 17:15	2024-09-13 10:30	RPOOL /scratch filesystem	/scratch is unavailable	Cause is being investigated with HPE
Resolved	At-Risk	2024-09-11 12:15	2024-09-11 14:40	Slurm scheduler	All jobs may have stopped running. New jobs would not start.	Returning the scratch file system to service caused issues with the Slurm scheduler
Resolved	At-Risk	2024-08-14 10:00	2024-08-14 11:00	Slurm	No user impact expected. Small risk that the slurm deamon may be unavailable briefly but if you experience any issue, please wait a few minutes and resubmit. Running jobs will continue without interruption.	Update to Slurm to reduce maximum memory available to jobs.
Resolved	At-Risk	2024-08-28 11:00	2024-08-28 12:00	E1000 work file system	We do not expect any user impact but if there is an issue it will be a short connectivity outage	Changing power supply for the JANET CIENA unit
Resolved	At-Risk	2024-08-27 12:00	2024-08-27 15:00	ceph home file system	We do not expect any user impact but the file system may appear slower than usual	Rebalancing the ceph file system
Resolved	At-Risk	2024-08-21 11:00	2024-08-21 12:00	E1000 work file system	We do not expect any user impact but if there is an issue it will be a short connectivity outage	Changing power supply for the JANET CIENA unit
Resolved	Service Alert	2024-07-28 20:55	2024-07-29 08:44	Cirrus login access	Cirrus login unavailable	Login nodes exhausted memory
Resolved	Service Alert	2024-06-10 10:50	2024-06-11 14:00	Solid State file system, RPOOL	Scratch file system unavailable.	Unknown - systems teams are investigating
Resolved	Service Alert	2024-05-27 03:09	2024-05-27 14:45	Cirrus compute nodes	Compute nodes are unavailable, any jobs running at the time of the power incident will have failed	Power incident on UK national grid in the Edinburgh area resulted in loss of power to Cirrus compute nodes
Resolved	Service Alert	2024-05-22 11:30	2024-05-22 14:15	Access to License server	The License server is inaccessible - our team are working to restore
Resolved	Service Alert	2024-05-02 10:50	2024-05-03 10:00	Solid State file system, RPOOL	Scratch file system unavailable.	Failed disk, systems are investigating.
Resolved	Service Alert	2024-04-30 10:00	2024-04-30 10:00	Cirrus nodes	We do not expect it to have any user impact	Configuration change to enable a specific workflow for a project
Resolved	Service Alert	2024-04-30 08:35	2024-04-30 10:00	Cirrus compute nodes and scratch (solid state RPOOL) file system	Jobs may have failed. Please contact the service desk if you think you require a refund	A switch which is connected to the slurm controller is down, this is causing lots of hangs on all nodes. Systems team are investigating but it may mean that all running jobs have failed.
Resolved	Service Alert	2024-04-22 07:00	2024-04-26 17:00	Full Cirrus service - No service available	Users will be unable to connect to Cirrus and will not be able to access their data. Jobs will not run.	Work at the Advanced Computing Facility (ACF) where Cirrus is located is taking place and the power will be removed from Cirrus.
Resolved	Service Alert	2024-04-17 10:30	2024-04-18 09:00	RPOOL, Solid state file system	There should be no user imapct due to resilency of the system	Swapping and commissioning of one of the disks within the solid state file system
Resolved	Service Alert	2024-04-15 10:00	2024-04-15 10:30	Outage to DNS server which will impact Cirrus and Cirrus SAFE	Users can still connect to service but may be unable to access external websites (eg GitLab)	Migration of server in preparation of the wider power work affecting site the following week
Resolved	Service Alert	2024-04-11 10:00	2024-04-11 10:40	Cirrus ticketing server	May be a delay in processing new user requests via SAFE	Migration of the rundeck ticketing system
Resolved	Service Alert	2024-03-24 23:00	2024-03-25 11:00	CPU compute nodes	Around half of the CPU compute nodes on Cirrus are unavailable. Jobs running on affected compute nodes will have failed.	A power incident caused CPU compute nodes to fail.
Resolved	Service Alert	2024-03-21 10:15	2024-03-21 17:30	/scratch unavailable but login and compute nodes are now restored to service.	Work is running once more and new jobs can be submitted.	System has been restarted but without /scratch.
Resolved	Service Alert	2024-03-04 10:00	2024-03-06 12:10	/scratch (solid state storage)	/scratch (solid state storage) is unavailable from compute and login nodes	A disk has failed within the solid state storage appliance
Resolved	Service Alert	2024-02-02 10:15	2024-02-05 14:00	All jobs	The failed cooling system has been fixed and the system returned to service. The cooling system has failed and all compute nodes (CPU and GPU) have been shut down. No jobs are able to run and jobs that were running at the time should be automatically refunded. Users may still access the login nodes, access their data and submit jobs. The short QoS will not work. Queued work will queue until the compute nodes are returned. The hardware vendor is coming onsite to repair Cirrus on Monday and we will issue a further update then.
Resolved	Service Alert	2024-01-29 14:00	2024-02-01 10:00	/scratch (solid state storage)	/scratch (solid state storage) is unavailable from compute and login nodes	Issue with file system under investigation
Resolved	Service Alert	2024-01-30	2024-03-12	New accounts will appear in SAFE	Users may notice that they have duplicate accounts within SAFE. username@cirrus will be replicated and a username@eidf will also appear	Cirrus service is transitioning from using ldap to ipa

2023

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	At-Risk	2023-12-21 09:00	2023-12-21 12:00	Slurm	Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption.	Update to Slurm
Resolved	At-Risk	2023-11-29 09:00	2023-11-29 17:00	Slurm	Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption.	Update to Slurm
Resolved	Service alert	2023-11-15 0900	2023-11-15 0930	Cirrus	The process to connect to Cirrus will change from ssh key and password to ssh key and TOTP factor	Implementing MFA on Cirrus
Resolved	Service alert	2023-10-27 0930	2023-10-27	/work file system	Cannot create new files on /work	The /work file system has reached the maximum number of inodes (number of files) . Users have been asked to reduce their number of files by tarring up directories, deleting or moving data that is no longer required on Cirrus.
Resolved	Service alert	2023-08-07	2023-08-08 14:10	/scratch solid state storage	/scratch unavailable	Update 1400 BST 8th August: System restored and work running once more Update 1150 BST: Following the reboot, access to login nodes has now been restored. Compute nodes are currently rebooting and work will restart once this has completed. Full system reboot taking place starting at 1130 BST Tues 8th August. Our systems team are working with the vendor to understand and try to resolve the issue.
Resolved	Service alert	2023-07-08 10:30 BST	2023-07-11 18:30 BST	/scratch solid state storage	/scratch unavailable	Update Tuesday 11th 18:30 BST: /scratch has been restored to service. Update Monday 10th 18:00 BST: Systems team have been working with HPE to understand and resolve the issue. Running work is continuing, but new work is blocked from starting. Investigations will continue overnight and we will post further updates tomorrow. Systems team are investigating.
Resolved	Service alert	2023-06-18 15:30 BST	2023-06-19 11:15 BST	Compute nodes	All compute nodes unavailable	Due to a power event at the site that hosts Cirrus there are cooling issues that required the Cirrus compute nodes to be taken offline.
Resolved	Service alert	2023-06-03 21:10 BST	2023-06-05 13:30 BST	CPU compute nodes	Majority of CPU compute nodes are now returned to service (8 nodes rmain out of service)	Due to a power event at the site that hosts Cirrus most of the CPU compute nodes were unavailable.
Resolved	Service alert	2023-05-30 12:30 BST	2023-06-02 10:00 BST	Cirrus /home filesystem	Issue on the network which impacts the Ceph filesystem underpinning /home on Cirrus. /home is currently unavailable to users.	Update 2023-06-01 1800 Progress has been made understanding the complex issues and the vendor and our systems team belive that we will be able to restore a stable network tomorrow. Once we have validation of a stable network we will restore the Cirrus /home filesystem.
Resolved	Service alert	2023-05-26 10:00 BST	2023-06-01 15:00 BST	Slurm scheduler	Slurm commands (e.g. `squeue`, `sbatch`) may not work.	Update 2023-06-01 1500 The validations tests and configuration are now complete. Update 2023-05-31 1730 BST The replacement for the faulty hardware component on a Cirrus administration node has now arrived on site and has been swapped successfully. Our team now need to perform validation tests, remove the workaround and complete the configuartion. The service continues to run at risk until the work has been completed.
Resolved	Service alert	2023-05-21 19:00 BST	2023-05-25 10:30 BST	/scratch solid state storage	/scratch is now available on login and compute nodes.	HPE were investigating issues with the stability of the /scratch file system
Resolved	Service alert	2023-05-21 19:00 BST	2023-05-22 13:30 BST	Login, compute, file systems	Users will not be able to connect to Cirrus, all running jobs will have failed	Power and cooling issues at the ACF data centre
Resolved	Service alert	2023-05-19 14:00 BST	2023-06-12 12:00 BST	Arm Forge (MAP/DDT) tools	Arm Forge software is not available on Cirrus	Licence server is hosted by ARCHER2 which is down for multiple weeks for upgrade
Resolved	Service alert	2023-03-13 21:00 GMT	2023-03-14 15:25:00 GMT	Work (Lustre) parallel file system	Users may see lack of responsiveness on Cirrus login nodes and reduced IO performance	Heavy load on the Lustre file system is causing contention for shared resources

2022

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service alert	2022-11-21 09:00 GMT	2022-11-21 11:00 GMT	Login nodes	The Cirrus login nodes are currently unavailable to users	The Ceph home file system has issues due to a failure at the data centre
Resolved	Service alert	2022-11-13 10:00 GMT	2022-11-14 09:00 GMT	SAFE website	Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; Cirrus load plot on status page will not work	The website certificate has expired
Resolved	Service alert	2022-10-24 08:00 BST	2022-10-24 09:30 BST	Login nodes	Login access to Cirrus is currently unavailable	Login nodes became unresponsive
Resolved	Service alert	2022-09-12 07:00 BST	2022-09-14 14:30 BST	Login nodes, compute nodes	Access to solid state storage (/scratch) is now available from login nodes, CPUs and GPUs nodes. Software modules are now available and are loaded from lustre (/work).	The RPOOL solid state storage has an error on one of the NVMe devices
Resolved	Partial	2022-08-24 09:00	2022-08-24 13:00	No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start.	Up to 4 hours loss of connection to Cirrus login nodes, data and SAFE access	Essential updates to the network configuration. Users will be notified when service is resumed.
Resolved	At-Risk	2022-07-21 09:00	2022-07-21 10:50	Slurm control and accounting services	Submitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption. Work was completed, ahead of schedule, at 10:50	Update to Slurm
Resolved	At-Risk	2022-07-06 09:00	2022-07-06 11:25	RPOOL /scratch filesystem	No user impact expected	RAID Check to be performed on a disk reporting issues
Resolved	Full	2022-06-23 15:00	2022-06-24 15:00	Full Cirrus Service - login nodes, compute node (CPU and GPU), all filesystems	Users will not be able to connect to the service, no jobs will run and filesystems are unavailable.	Issues with the scratch filesystem following a successful RAID card replacement
Resolved	At-Risk	2022-06-08 09:00	2022-06-08 17:00	Servers which provide the /scratch filesystem	No user impact is expected.	Two configuration changes are required to resolve reliability and integrity issues
Resolved	Issue	2022-05-18 09:00	2022-05-18 12:00	Cirrus compute and login nodes including the /scratch servers	Users will not be able to connect to Cirrus and no jobs will run. The system has been drained ahead of this maintenance. Once this is confirmed a user mailing will be sent out.	A configuration setting change is required for /scratch to resolve the issues which were impacting users' work last week.
Resolved	Issue	2022-05-09 09:00	2022-05-09 17:00	Cirrus /scratch servers	All user activity as modules could not be loaded. Partial resolution in the afternoon allowed users to access Cirrus but modules still could not be loaded, and new jobs would not run correctly.	Under investigation by systems and HPE.
Resolved	Issue	2022-04-21 10:30	2022-04-21 11:30	Issues with /home which is also impacting the module environment	Users unable to access their data within /home and cannot load module environment.	Issues with the E1000 server.
Resolved	Issue	2022-06-22 09:00	2022-06-23 15:00	Rack containing 280 CPU Nodes	The RAID Card has been replaced successfully. There are a few issues on the /scratch filesystem which are not related to the RAID card so further testing will be performed before the full service is resumed. A new alert has been created for this.	Replacement of RAID Card in the main CPU rack.
Resolved	Issue	2022-03-28 16:50	2022-03-28 16:53	Cirrus	Users logged out of login node 1.	Login node 1 shut down due to operator error. Users can reconnect and will be automatically directed to other login nodes.
Resolved	Issue	2022-01-09 10:00	2022-01-10 10:30	Cirrus, SAFE	Outgoing network access from Cirrus to external sites was not working. SAFE response was slow or degraded.compute nodes to lose power.	DNS issue at datacentre

2021

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2021-11-05 08:00	2021-11-05 11:30	Cirrus: CPU compute nodes	A small number of CPU compute nodes were unavailable to run jobs	A power incident in the Edinburgh area caused a number of CPU compute nodes to lose power.
Resolved	Issue	2021-10-28 11:30	2021-10-28 16:00	Cirrus	Some external connections are unavailable such as git pull, pushes, licence servers	Network changes at the ACF have caused issues. The University network team have resolved the issues.

2022

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2021-07-01 09:00	2022-09-13	Object Store (WoS)	The WoS has been removed from service.	We are working with the hardware vendor to restore the WoS to service again.