Cirrus Service Alert History

Service Alerts

List of all service alerts for Cirrus

StatusStartEndScopeImpactReason
Ongoing2025-10-16 12:002025-10-31 18:00/work file systemRisk of unexpected I/O performance issuesCommissioning/testing of new Cirrus hardware sharing same file system
Resolved2025-10-15 08:002025-10-16 14:30Login nodesRisk of unexpected issues with new accounts on 15/16 OctEssential upgrade of authorisation servers
Resolved2025-08-11 13:502025-08-11 14:17SAFE, MFA at loginLogin not accessible, SAFE not accessibleDue to work on SAFE database, SAFE and Cirrus login MFA are currently unavailable
Resolved2025-05-05 08:002025-05-05 10:30Slurm batch system, lustre file system (work) and the solid state file system (RPOOL).Users can connect to login nodes and access their data. No new jobs will start until further investigations take place.Issues with switch on Cirrus front end
Resolved2025-03-12 08:532025-03-12 10:00Issues with the slurm controller have been observedUsers can connect to the login node but jobs will not start on the compute nodes. Users will not be able to issue slurm commands.Systems team are investigating the issue.
Resolved2025-02-26 09:302025-02-26 12:00A group of nodes on Cirrus developed a technical fault.Work has been prevented from starting on the affected nodes. Work already running on these nodes may fail but should be uncharged.Our systems team have identified a technical fault with some Cirrus nodes. These nodes have now been restored.
On-going2025-02-19 09:002025-02-19 17:00work file systemWe do not expect any impact to users.Hardware vendor is working on E1000 where the work file system is hosted
Resolved2025-02-13 12:302025-02-25 09:00Solid state (/scratch) RPOOL file systemAny jobs using /scratch file system will fail/scratch file system is 100% full
Resolved2025-01-27 12:002025-01-29 12:00Cirrus serviceService at higher risk of disruption than usualWork to repair power grid following storm in Edinburgh area means that power issues more likely during this period.
Resolved2025-01-24 10:002025-01-24 17:00Whole Cirrus serviceService at higher risk of disruption than usual. If issues arise, service may take longer to restore.Red weather warning for high winds in Edinburgh area lead to travel restrictions and higher than usual risk of power/building damage issues.
Resolved2025-01-08 09:002025-01-08 17:00Scratch (RPOOL) Solid state file systemScratch (RPOOL) solid state file system will be unavailable.Replacement of Infiniband card by vendor
Resolved2024-12-11 09:002024-12-11 16:00work file systemWe do not expect any impact to users.Hardware vendor is working on E1000 where the work file system is hosted
Resolved2024-11-29 11:302024-11-29 18:00Access to /work. Slurm.Users unable to access their data within /work and cannot load module environment. No new jobs starting. Running jobs may fail. If login shell hangs, use "ctrl+c" to get a shell to access data on /home.Hardware failure in E1000 Lustre storage
Resolved2024-11-27 13:002024-11-27 18:00Issues with /work which is also impacting the module environmentUsers unable to access their data within /work and cannot load module environment.Issues with the E1000 server.
Resolved2024-11-21 11:402024-11-21 15:25Cirrus compute nodesCirrus compute nodes are down. Login nodes and the /home and /work file systems are available.Power incident at the ACF
Resolved2024-11-21 11:402024-11-21 12:00Access to Cirrus from outside University of EdinburghNo users will be able to access Cirrus from outside the University of Edinburgh network. Running/queued jobs are unaffected.Loss of power to part of the communication network
Resolved2024-10-28 11:152024-10-10 12:15EPCC SAFESAFE DNS name will be migrated - SAFE and TOTP availability at-riskMigration to new SAFE Containers
Resolved2024-10-03 16:002024-10-04 19:20SlurmSlurm commands on the login nodes fail. New work cannot be started, already running work will continue but will be held until slurm is functional again.
Logins and filesystems remain accessible.
System restored to normal function.
Resolved2024-09-12 17:152024-09-13 10:30RPOOL /scratch filesystem/scratch is unavailableCause is being investigated with HPE
Resolved2024-09-11 12:152024-09-11 14:40Slurm schedulerAll jobs may have stopped running. New jobs would not start.Returning the scratch file system to service caused issues with the Slurm scheduler
Resolved2024-09-10 09:002024-09-11 17:00Scratch (RPOOL) Solid state file systemScratch (RPOOL) solid state file system will be unavailable.Update to firmware
Resolved2024-08-28 11:002024-08-28 12:00E1000 work file systemWe do not expect any user impact but if there is an issue it will be a short connectivity outageChanging power supply for the JANET CIENA unit
Resolved2024-08-27 12:002024-08-27 15:00ceph home file systemWe do not expect any user impact but the file system may appear slower than usualRebalancing the ceph file system
Resolved2024-08-21 11:002024-08-21 12:00E1000 work file systemWe do not expect any user impact but if there is an issue it will be a short connectivity outageChanging power supply for the JANET CIENA unit
Resolved2024-08-14 10:002024-08-14 11:00SlurmNo user impact expected. Small risk that the slurm deamon may be unavailable briefly but if you experience any issue, please wait a few minutes and resubmit. Running jobs will continue without interruption.Update to Slurm to reduce maximum memory available to jobs.
Resolved2024-07-28 20:552024-07-29 08:44Cirrus login accessCirrus login unavailableLogin nodes exhausted memory
Resolved2024-06-10 10:502024-06-11 14:00Solid State file system, RPOOLScratch file system unavailable.Unknown - systems teams are investigating
Resolved2024-05-27 03:092024-05-27 14:45Cirrus compute nodesCompute nodes are unavailable, any jobs running at the time of the power incident will have failedPower incident on UK national grid in the Edinburgh area resulted in loss of power to Cirrus compute nodes
Resolved2024-05-22 11:302024-05-22 14:15Access to License serverThe License server is inaccessible - our team are working to restore
Resolved2024-05-16 10:002024-05-16 17:00home (ceph) file systemNo user impact expected although users may see some connection issuesUpdate to file system operating system and administration package updates
Resolved2024-05-02 10:502024-05-03 10:00Solid State file system, RPOOLScratch file system unavailable.Failed disk, systems are investigating.
Resolved2024-04-30 10:002024-04-30 10:00Cirrus nodesWe do not expect it to have any user impactConfiguration change to enable a specific workflow for a project
Resolved2024-04-30 08:352024-04-30 10:00Cirrus compute nodes and scratch (solid state RPOOL) file systemJobs may have failed. Please contact the service desk if you think you require a refundA switch which is connected to the slurm controller is down, this is causing lots of hangs on all nodes. Systems team are investigating but it may mean that all running jobs have failed.
Resolved2024-04-22 07:002024-04-26 17:00Full Cirrus service - No service availableUsers will be unable to connect to Cirrus and will not be able to access their data. Jobs will not run.Work at the Advanced Computing Facility (ACF) where Cirrus is located is taking place and the power will be removed from Cirrus.
Resolved2024-04-17 10:302024-04-18 09:00RPOOL, Solid state file systemThere should be no user imapct due to resilency of the systemSwapping and commissioning of one of the disks within the solid state file system
Resolved2024-04-15 10:002024-04-15 10:30Outage to DNS server which will impact Cirrus and Cirrus SAFEUsers can still connect to service but may be unable to access external websites (eg GitLab)Migration of server in preparation of the wider power work affecting site the following week
Resolved2024-04-11 10:002024-04-11 10:40Cirrus ticketing serverMay be a delay in processing new user requests via SAFEMigration of the rundeck ticketing system
Resolved2024-03-24 23:002024-03-25 11:00CPU compute nodesAround half of the CPU compute nodes on Cirrus are unavailable. Jobs running on affected compute nodes will have failed.A power incident caused CPU compute nodes to fail.
Resolved2024-03-21 10:152024-03-21 17:30/scratch unavailable but login and compute nodes are now restored to service.Work is running once more and new jobs can be submitted.System has been restarted but without /scratch.
Resolved2024-03-04 10:002024-03-06 12:10/scratch (solid state storage)/scratch (solid state storage) is unavailable from compute and login nodesA disk has failed within the solid state storage appliance
Resolved2024-02-02 10:152024-02-05 14:00All jobsThe failed cooling system has been fixed and the system returned to service. The cooling system has failed and all compute nodes (CPU and GPU) have been shut down. No jobs are able to run and jobs that were running at the time should be automatically refunded. Users may still access the login nodes, access their data and submit jobs. The short QoS will not work. Queued work will queue until the compute nodes are returned. The hardware vendor is coming onsite to repair Cirrus on Monday and we will issue a further update then.
Resolved2024-01-30 08:002024-03-12 18:00New accounts will appear in SAFEUsers may notice that they have duplicate accounts within SAFE. username@cirrus will be replicated and a username@eidf will also appearCirrus service is transitioning from using ldap to ipa
Resolved2024-01-29 14:002024-02-01 10:00/scratch (solid state storage)/scratch (solid state storage) is unavailable from compute and login nodesIssue with file system under investigation
Resolved2023-12-21 09:002023-12-21 12:00SlurmSubmitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption.Update to Slurm
Resolved2023-11-29 09:002023-11-29 17:00SlurmSubmitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work. Running jobs will continue without interruption.Update to Slurm
Resolved2023-11-15 09:002023-11-15 09:30CirrusThe process to connect to Cirrus will change from ssh key and password to ssh key and TOTP factorImplementing MFA on Cirrus
Resolved2023-10-27 09:302023-10-27 12:00/work file systemCannot create new files on /workThe /work file system has reached the maximum number of inodes (number of files) . Users have been asked to reduce their number of files by tarring up directories, deleting or moving data that is no longer required on Cirrus.
Resolved2023-08-07 12:002023-08-08 14:10/scratch solid state storage/scratch unavailableUpdate 1400 8th August: System restored and work running once more
Update 1150 : Following the reboot, access to login nodes has now been restored. Compute nodes are currently rebooting and work will restart once this has completed.
Full system reboot taking place starting at 1130 Tues 8th August.
Our systems team are working with the vendor to understand and try to resolve the issue.
Resolved2023-07-08 10:302023-07-11 18:30/scratch solid state storage/scratch unavailableUpdate Tuesday 11th 18:30 : /scratch has been restored to service.
Update Monday 10th 18:00 : Systems team have been working with HPE to understand and resolve the issue. Running work is continuing, but new work is blocked from starting. Investigations will continue overnight and we will post further updates tomorrow.
Systems team are investigating.
Resolved2023-06-18 15:302023-06-19 11:15Compute nodesAll compute nodes unavailableDue to a power event at the site that hosts Cirrus there are cooling issues that required the Cirrus compute nodes to be taken offline.
Resolved2023-06-03 21:102023-06-05 13:30CPU compute nodesMajority of CPU compute nodes are now returned to service (8 nodes rmain out of service)Due to a power event at the site that hosts Cirrus most of the CPU compute nodes were unavailable.
Resolved2023-05-30 12:302023-06-02 10:00Cirrus /home filesystemIssue on the network which impacts the Ceph filesystem underpinning /home on Cirrus. /home is currently unavailable to users.Update 2023-06-01 1800 Progress has been made understanding the complex issues and the vendor and our systems team belive that we will be able to restore a stable network tomorrow. Once we have validation of a stable network we will restore the Cirrus /home filesystem.
Resolved2023-05-26 10:002023-06-01 15:00Slurm schedulerSlurm commands (e.g. `squeue`, `sbatch`) may not work.Update 2023-06-01 1500 The validations tests and configuration are now complete. Update 2023-05-31 1730 The replacement for the faulty hardware component on a Cirrus administration node has now arrived on site and has been swapped successfully. Our team now need to perform validation tests, remove the workaround and complete the configuartion. The service continues to run at risk until the work has been completed.
Resolved2023-05-21 19:002023-05-22 13:30Login, compute, file systemsUsers will not be able to connect to Cirrus, all running jobs will have failedPower and cooling issues at the ACF data centre
Resolved2023-05-21 19:002023-05-25 10:30/scratch solid state storage/scratch is now available on login and compute nodes.HPE were investigating issues with the stability of the /scratch file system
Resolved2023-05-19 14:002023-06-12 12:00Arm Forge (MAP/DDT) toolsArm Forge software is not available on CirrusLicence server is hosted by ARCHER2 which is down for multiple weeks for upgrade
Resolved2023-03-13 21:002023-03-14 15:25Work (Lustre) parallel file systemUsers may see lack of responsiveness on Cirrus login nodes and reduced IO performanceHeavy load on the Lustre file system is causing contention for shared resources
Resolved2022-11-21 09:002022-11-21 11:00Login nodesThe Cirrus login nodes are currently unavailable to usersThe Ceph home file system has issues due to a failure at the data centre
Resolved2022-11-13 10:002022-11-14 09:00SAFE websiteUsers will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; Cirrus load plot on status page will not workThe website certificate has expired
Resolved2022-10-24 08:002022-10-24 09:30Login nodesLogin access to Cirrus is currently unavailableLogin nodes became unresponsive
Resolved2022-09-12 07:002022-09-14 14:30Login nodes, compute nodesAccess to solid state storage (/scratch) is now available from login nodes, CPUs and GPUs nodes. Software modules are now available and are loaded from lustre (/work).The RPOOL solid state storage has an error on one of the NVMe devices
Resolved2022-08-24 09:002022-08-24 13:00No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start.Up to 4 hours loss of connection to Cirrus login nodes, data and SAFE accessEssential updates to the network configuration. Users will be notified when service is resumed.
Resolved2022-07-21 09:002022-07-21 10:50Slurm control and accounting servicesSubmitting new jobs to the scheduler, starting of jobs and querying of jobs will be unavailable for duration of the work.
Running jobs will continue without interruption.
Work was completed, ahead of schedule, at 10:50
Update to Slurm
Resolved2022-07-06 09:002022-07-06 11:25RPOOL /scratch filesystemNo user impact expectedRAID Check to be performed on a disk reporting issues
Resolved2022-06-23 15:002022-06-24 15:00Full Cirrus Service - login nodes, compute node (CPU and GPU), all filesystemsUsers will not be able to connect to the service, no jobs will run and filesystems are unavailable.Issues with the scratch filesystem following a successful RAID card replacement
Resolved2022-06-22 09:002022-06-23 15:00Rack containing 280 CPU NodesThe RAID Card has been replaced successfully. There are a few issues on the /scratch filesystem which are not related to the RAID card so further testing will be performed before the full service is resumed. A new alert has been created for this.Replacement of RAID Card in the main CPU rack.
Resolved2022-06-08 09:002022-06-08 17:00Servers which provide the /scratch filesystemNo user impact is expected.Two configuration changes are required to resolve reliability and integrity issues
Resolved2022-05-18 09:002022-05-18 12:00Cirrus compute and login nodes including the /scratch serversUsers will not be able to connect to Cirrus and no jobs will run. The system has been drained ahead of this maintenance. Once this is confirmed a user mailing will be sent out.A configuration setting change is required for /scratch to resolve the issues which were impacting users' work last week.
Resolved2022-05-09 09:002022-05-09 17:00Cirrus /scratch serversAll user activity as modules could not be loaded. Partial resolution in the afternoon allowed users to access Cirrus but modules still could not be loaded, and new jobs would not run correctly.Under investigation by systems and HPE.
Resolved2022-04-21 10:302022-04-21 11:30Issues with /home which is also impacting the module environmentUsers unable to access their data within /home and cannot load module environment.Issues with the E1000 server.
Resolved2022-03-28 16:502022-03-28 16:53CirrusUsers logged out of login node 1.Login node 1 shut down due to operator error. Users can reconnect and will be automatically directed to other login nodes.
Resolved2022-01-09 10:002022-01-10 10:30Cirrus, SAFEOutgoing network access from Cirrus to external sites was not working. SAFE response was slow or degraded.compute nodes to lose power.DNS issue at datacentre
Resolved2021-11-05 08:002021-11-05 11:30Cirrus: CPU compute nodesA small number of CPU compute nodes were unavailable to run jobsA power incident in the Edinburgh area caused a number of CPU compute nodes to lose power.
Resolved2021-10-28 11:302021-10-28 16:00CirrusSome external connections are unavailable such as git pull, pushes, licence serversNetwork changes at the ACF have caused issues. The University network team have resolved the issues.