- Current System Load
- Known Issues
- Current Issues
- Recent Issues
- Maintenance
- Module Updates
- Service Calendar and Maintenance
Current System Load
The plot below shows the status of the CPU nodes on the current Cirrus service for the past day (note: the Cirrus GPU nodes are not included in this plot).
A description of each of the status types is provided below the plot.
CPU
- alloc: Nodes running user jobs
- idle: Nodes available for user jobs
- resv: Nodes in reservation and not available for standard user jobs
- down, drain, maint, drng, comp: Nodes unavailable for user jobs
- mix: Nodes in multiple states
GPU
- alloc: Nodes running user jobs
- idle: Nodes available for user jobs
- resv: Nodes in reservation and not available for standard user jobs
- down, drain, maint, drng, comp: Nodes unavailable for user jobs
- mix: Nodes in multiple states
Known Issues
We are experiening a heavy load on the metadata server. Our systems team are investigating but we suspect this is due to user(s) performing many I/O operations. We apologise for the inconvenience this is causing users.
Service Alerts
No current service alerts
Recently Resolved Service Alerts
This table lists resolved service alerts from the past 30 days. A full list of historical resolved service alerts is available.
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service alert | 2023-03-13 21:00 | 2023-03-14 15:25 | Work (Lustre) parallel file system | Users may see lack of responsiveness on Cirrus login nodes and reduced IO performance | Heavy load on the Lustre file system is causing contention for shared resources |
Service Calendar and Maintenance
Maintenance Sessions:Quarter 4 2022 (1st October - 31st December 2022)
Quarter 1 2023
Status | Type | Start | End | System | User Impact | Reason |
---|---|---|---|---|---|---|
Planned | Partial Maintenance | 2023-02-07 09:00 | 2023-02-07 17:00 | Cirrus | CPU and GPU compute nodes will be unavailable. Login access and access to data will still be available. | Essential maintenance to the Cirrus liquid cooling system. |
Maintenance Logs for previous periods
Module Updates
Module Update following Cirrus Upgrade September 2022
Description | Reason | Advice |
---|---|---|
Removed Molpro module and user doc section | No longer functional | No longer centrally supported on Cirrus |
Forge to be updated | v20.0.3 found to have security flaw | Pending. Newer version will be installed as a replacement. |
Updated mpi4py | All the mpi4py modules are tied to a particular version of python, 3.8.12. More flexibility is required such that users can run python-based parallel code using different python versions. | The mpi4py modules have been replaced by a suite of python modules: python/3.8.13, python/3.8.13-gpu, python/3.9.12, and python/3.9.12-gpu. The gpu modules load a miniconda3 python environment containing mpi4py 3.1.3 linked with OpenMPI 4.1.x and CUDA 11.6; whereas the cpu modules (no -gpu suffix) load a python environment containing mpi4py 3.1.3 linked with HPE MPT 2.25. (The python/3.8.13-gpu module is linked with OpenMPI 4.1.2 and the python/3.9.12-gpu module is linked with OpenMPI 4.1.4.) |
Updated horovod | Updated | Module version 0.24.2-gpu has been replaced by 0.25.0-gpu. |
Updated pytorch | Updated | Module version 1.11.0-gpu has been replaced by 1.12.0-gpu. |
Updated tensorflow | Updated | Module version 2.8.0-gpu has been replaced by 2.9.1-gpu. |
Updated scalasca | Version 2.5 no longer functional. | Please use 2.6-gcc8-mpt225 or 2.6-intel19-mpt225 instead. |
Removed spack/2020 module | Not used. | Not required. Please contact the service desk if Spack installation is needed. |
Updated tmux | Version 3.1b no longer functional. | Version 3.3a provided as replacement. |
At Risk Maintenance Sessions
There is an ‘At-Risk’ Session provisionally booked every Wednesday from 1000 - 1200. A user mailing will be sent if any work is going to take place which may impact users.
Service Calendar
We maintain a calendar for the Cirrus service that lists upcoming events (such as training courses and maintenance sessions):
We keep maintenance downtime to a minimum on the service but do occaisionally need to perform essential work on the system. Maintenance sessions are used to ensure that:
- software versions are kept up to date;
- firmware levels on HPE and third-party peripheral equipment are kept up to date; essential security patches are applied;
- failed/suspect hardware can be replaced;
- new software can be installed; periodic essential maintenance on HPE electrical and mechanical support equipment (refrigeration systems, air blowers and power distribution units) can be undertaken safely.
Additional maintenance sessions can be scheduled for major hardware or software updates; major upgrades to facility plant and infrastructure; acceptance testing following major service upgrades and statutory electrical testing.