Current System Load

The plot below shows the status of the CPU nodes on the current Cirrus service for the past day (note: the Cirrus GPU nodes are not included in this plot).

A description of each of the status types is provided below the plot.

CPU

Cirrus Node Status graph

GPU

Cirrus GPU Node Status graph

Known Issues

We are experiening a heavy load on the metadata server. Our systems team are investigating but we suspect this is due to user(s) performing many I/O operations. We apologise for the inconvenience this is causing users.

Service Alerts

Status Type Start End Scope User Impact Reason
Ongoing Service Alert 2024-03-21 10:15 /scratch unavailable but login and compute nodes are now restored to service. Work is running once more and new jobs can be submitted. System has been restarted but without /scratch.
Ongoing Service Alert 2024-01-30 00:00 New accounts will appear in SAFE Users may notice that they have duplicate accounts within SAFE. username@cirrus will be replicated and a username@eidf will also appear Cirrus service is transitioning from using ldap to ipa

Recently Resolved Service Alerts

This table lists resolved service alerts from the past 30 days. A full list of historical resolved service alerts is available.

Status Type Start End Scope User Impact Reason
Resolved Service Alert 2024-03-24 23:00 2024-03-25 11:00 CPU compute nodes Around half of the CPU compute nodes on Cirrus are unavailable. Jobs running on affected compute nodes will have failed. A power incident caused CPU compute nodes to fail.
Resolved Service Alert 2024-03-04 10:00 2024-03-06 12:10 /scratch (solid state storage) /scratch (solid state storage) is unavailable from compute and login nodes A disk has failed within the solid state storage appliance

Service Calendar and Maintenance

This section lists recent and upcoming maintenance sessions. A full list of past maintenance sessions is available.

Status Type Start End Scope User Impact Reason
Planned Full 12 March 2024 09:00 12 March 2024 17:00 Cirrus No login access
No access to any data on the system
Jobs will not run, and queued jobs will be deleted.
Migration to E1000 including the change in authentication protocol and addition of new file system.

Maintenance Logs for previous periods

Previous maintenance logs

At Risk Maintenance Sessions

There is an ‘At-Risk’ Session provisionally booked every Wednesday from 1000 - 1200. A user mailing will be sent if any work is going to take place which may impact users.

Service Calendar

We maintain a calendar for the Cirrus service that lists upcoming events (such as training courses and maintenance sessions):

We keep maintenance downtime to a minimum on the service but do occaisionally need to perform essential work on the system. Maintenance sessions are used to ensure that:

Additional maintenance sessions can be scheduled for major hardware or software updates; major upgrades to facility plant and infrastructure; acceptance testing following major service upgrades and statutory electrical testing.