Farm HPC cluster status is Minor Service Outage

Tue 27
Wed 28
Thu 29
Fri 30
Sat 31
Sun 1
Mon 2
now

Farm HPC cluster Slurm

Tue 27
Wed 28
Thu 29
Fri 30
Sat 31
Sun 1
Mon 2
now
Last updated 1 minute ago from official status page. Learn more
Stay ahead of Farm HPC cluster outages
Sign up to create a custom dashboard to monitor the services you rely on. 3,000+ services supported.

Active Incidents

Farm's slurmdbd having intermittent issues
Started 24 Apr 2025 00:22:22 (1 month ago), still ongoing
Major Incident
Investigating
Slurm

Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.

"""sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""

We have a support case open with SchedMD and will update this issue as we learn more.

Recently Resolved Incidents

No recent incidents

Farm HPC cluster Outage Survival Guide

A step-by-step guide to help you survive a Farm HPC cluster outage
NaN%

    Farm HPC cluster Components

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster Login

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster Storage

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster File transfer node

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster high2,med2,low2

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster high,med,low

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster bmh,bmm

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster bigmemh,bigmemm

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster bgpu

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster gpuh,gpum

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster Email

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster Virtualization

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now
    Proxmox Virtualization Nodes
    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now
    Ganetti cluster
    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now

    Farm HPC cluster Slurm

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now
    Farm's slurmdbd having intermittent issues
    Started 24 Apr 2025 00:22:22 (1 month ago), still ongoing
    Major Incident
    Investigating
    Slurm

    Farm's slurmdbd is having intermittent issues. If you see an error like below, it means the problem has occurred again, and we will restart slurmdbd to bring it back into service.

    """sacctmgr: error: _open_persist_conn: failed to open persistent connection to host:monitoring-ib:6819: Connection timed out sacctmgr: error: Sending PersistInit msg: Connection timed out"""

    We have a support case open with SchedMD and will update this issue as we learn more.

    Farm HPC cluster Software

    Tue 27
    Wed 28
    Thu 29
    Fri 30
    Sat 31
    Sun 1
    Mon 2
    now