
Farm HPC cluster Status
Real-time updates of Farm HPC cluster issues and outages
Farm HPC cluster status is Minor Service Outage
Farm HPC cluster Slurm
Active Incidents
Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.
"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""
We have a support case open with SchedMD and will update this issue as we learn more.
Recently Resolved Incidents
No recent incidents
Farm HPC cluster Outage Survival Guide
Farm HPC cluster Components
Farm HPC cluster Login
Farm HPC cluster Storage
Farm HPC cluster File transfer node
Farm HPC cluster high2,med2,low2
Farm HPC cluster high,med,low
Farm HPC cluster bmh,bmm
Farm HPC cluster bigmemh,bigmemm
Farm HPC cluster bgpu
Farm HPC cluster gpuh,gpum
Farm HPC cluster Email
Farm HPC cluster Virtualization
Proxmox Virtualization Nodes
Ganetti cluster
Farm HPC cluster Slurm
Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.
"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""
We have a support case open with SchedMD and will update this issue as we learn more.