
Farm HPC cluster Status
Real-time updates of Farm HPC cluster issues and outages
Farm HPC cluster status is Operational
Farm HPC cluster Login
Farm HPC cluster Storage
Farm HPC cluster File transfer node
Farm HPC cluster high2,med2,low2
Farm HPC cluster high,med,low
Farm HPC cluster bmh,bmm
Farm HPC cluster bigmemh,bigmemm
Active Incidents
Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.
"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""
We have a support case open with SchedMD and will update this issue as we learn more.
Recently Resolved Incidents
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Outage Survival Guide
Farm HPC cluster Components
Farm HPC cluster Login
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Storage
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster File transfer node
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster high2,med2,low2
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster high,med,low
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster bmh,bmm
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster bigmemh,bigmemm
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster bgpu
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster gpuh,gpum
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Email
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Virtualization
Proxmox Virtualization Nodes
Bi-annual maintenance due to data center generator test and upgrades.
Ganetti cluster
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Slurm
Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.
"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""
We have a support case open with SchedMD and will update this issue as we learn more.
Bi-annual maintenance due to data center generator test and upgrades.
Farm HPC cluster Software
Bi-annual maintenance due to data center generator test and upgrades.