LISTSERV - OSCER-USERS-L Archives

OSCER users,

OSCER scheduled maintenance outage Wed July 19 8am-midnight CT

Affected systems: supercomputer, OURdisk, OURRstore

---

(1) OURRstore: Fix the mount on dtn2.

---

(2) OURdisk: Begin testing auto-compression, to
reduce the rate of physical disk capacity consumption.

---

(3) Supercomputer: Slurm batch scheduler

(3a) Move the Slurm batch job database to faster hardware.

Before the outage:

(3a-i) Install additional RAM capacity in the physical servers
that Slurm runs on (already completed).

(3a-ii) Move the Slurm database from slow spinning disk to
fast SSD.

During the outage:

(3a-iii) Adjust the Slurm virtual machine to include
enough RAM to fit the entire database.

(3b) Update the Slurm version, to allow use of
a burst buffer of very fast NVMe SSDs for applications that
do heavy random read I/O Operations (IOPS),
such as machine learning.

(3b-i) Update the Slurm version to 22 (it's currently
at version 20, and it's unwise to upgrade Slurm by
more than 2 versions at a time).

Historically, this procedure took many hours to complete,
when the Slurm database was on spinning disk.

But, once the Slurm database is entirely in RAM, we expect
that this procedure will go much faster.

(We plan to do some pre-testing of that, to get a feel for
how long we should expect this upgrade to take.)

(3b-ii) If time permits, update Slurm to the most recent
stable version (# 23).

(3c) Set Slurm's maximum possible pending duration priority
plus-up much higher, to ensure that long-pending jobs
continue to increase their priority, so that they don't get
stuck pending forever.

As always, we apologize for the inconvenience --
our goal is to make OSCER resources even better.

The OSCER Team ([log in to unmask])