Skip Navigational Links
LISTSERV email list manager
LISTSERV - LISTS.OU.EDU
LISTSERV Menu
Log In
Log In
LISTSERV 17.5 Help - OSCER-USERS-L Archives
LISTSERV Archives
LISTSERV Archives
Search Archives
Search Archives
Register
Register
Log In
Log In

OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

Menu
LISTSERV Archives LISTSERV Archives
OSCER-USERS-L Home OSCER-USERS-L Home

Log In Log In
Register Register

Subscribe or Unsubscribe Subscribe or Unsubscribe

Search Archives Search Archives
Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
OSCER scheduled maintenance outage Wed July 19 8am-midnight CT
From:
Henry Neeman <[log in to unmask]>
Reply To:
Henry Neeman <[log in to unmask]>
Date:
Thu, 13 Jul 2023 10:46:23 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (65 lines)
OSCER users,

OSCER scheduled maintenance outage Wed July 19 8am-midnight CT

Affected systems: supercomputer, OURdisk, OURRstore

---

(1) OURRstore: Fix the mount on dtn2.

---

(2) OURdisk: Begin testing auto-compression, to
reduce the rate of physical disk capacity consumption.

---

(3) Supercomputer: Slurm batch scheduler

(3a) Move the Slurm batch job database to faster hardware.

Before the outage:

(3a-i) Install additional RAM capacity in the physical servers
that Slurm runs on (already completed).

(3a-ii) Move the Slurm database from slow spinning disk to
fast SSD.

During the outage:

(3a-iii) Adjust the Slurm virtual machine to include
enough RAM to fit the entire database.

(3b) Update the Slurm version, to allow use of
a burst buffer of very fast NVMe SSDs for applications that
do heavy random read I/O Operations (IOPS),
such as machine learning.

(3b-i) Update the Slurm version to 22 (it's currently
at version 20, and it's unwise to upgrade Slurm by
more than 2 versions at a time).

Historically, this procedure took many hours to complete,
when the Slurm database was on spinning disk.

But, once the Slurm database is entirely in RAM, we expect
that this procedure will go much faster.

(We plan to do some pre-testing of that, to get a feel for
how long we should expect this upgrade to take.)

(3b-ii) If time permits, update Slurm to the most recent
stable version (# 23).

(3c) Set Slurm's maximum possible pending duration priority
plus-up much higher, to ensure that long-pending jobs
continue to increase their priority, so that they don't get
stuck pending forever.

As always, we apologize for the inconvenience --
our goal is to make OSCER resources even better.

The OSCER Team ([log in to unmask])

ATOM RSS1 RSS2

LISTS.OU.EDU CataList Email List Search Powered by LISTSERV