OSCER users


Options: Use Classic View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Sender: OSCER users <[log in to unmask]>
Date: Mon, 17 Jul 2023 11:18:23 -0500
Reply-To: Henry Neeman <[log in to unmask]>
MIME-Version: 1.0
Message-ID: <[log in to unmask]>
In-Reply-To: <[log in to unmask]>
Content-Type: text/plain; charset=US-ASCII
From: Henry Neeman <[log in to unmask]>
Parts/Attachments: text/plain (113 lines)
OSCER users,


OSCER scheduled maintenance outage Wed July 19 8am-midnight CT

NOTE: Jobs submitted now that request more runtime than
there's time left before the maintenance outage starts
will be held back until after the maintenance outage ends.

For example, if you submit a job right now that requests
48 hours, the earliest it could be *guaranteed* to end
would be Wednesday a bit after 11:00am, during the
maintenance outage -- so that job wouldn't start until
after the maintenance outage ends.

If you have a job *PENDING* and you want to reduce
its runtime request, to make it possible for that job to
start before the maintenance outage, you can do this:

scontrol update JobId=####### TimeLimit=DD-HH:MM:SS

except (a) replace ####### with the job ID number, and
(b) replace DD-HH:MM:SS with 2-digit number of days,
2-digit number of hours, 2-digit number of minutes,
2-digit number of seconds.

For example,

scontrol update JobID=123456 TimeLimit=1-00:00:00

would change the requested runtime for job 1234567 to
1 day.

If you did that command more than 24 hours before
the start of the maintenance outage, that would
in principle allow (but NOT guarantee) the job to
start (and end) before the maintenance outage.

NOTE: If a job is already running, scontrol typically
won't work.



On Thu, 13 Jul 2023, Henry Neeman wrote:

>OSCER users,
>OSCER scheduled maintenance outage Wed July 19 8am-midnight CT
>Affected systems: supercomputer, OURdisk, OURRstore
>(1) OURRstore: Fix the mount on dtn2.
>(2) OURdisk: Begin testing auto-compression, to
>reduce the rate of physical disk capacity consumption.
>(3) Supercomputer: Slurm batch scheduler
>(3a) Move the Slurm batch job database to faster hardware.
>Before the outage:
>(3a-i) Install additional RAM capacity in the physical servers
>that Slurm runs on (already completed).
>(3a-ii) Move the Slurm database from slow spinning disk to
>fast SSD.
>During the outage:
>(3a-iii) Adjust the Slurm virtual machine to include
>enough RAM to fit the entire database.
>(3b) Update the Slurm version, to allow use of
>a burst buffer of very fast NVMe SSDs for applications that
>do heavy random read I/O Operations (IOPS),
>such as machine learning.
>(3b-i) Update the Slurm version to 22 (it's currently
>at version 20, and it's unwise to upgrade Slurm by
>more than 2 versions at a time).
>Historically, this procedure took many hours to complete,
>when the Slurm database was on spinning disk.
>But, once the Slurm database is entirely in RAM, we expect
>that this procedure will go much faster.
>(We plan to do some pre-testing of that, to get a feel for
>how long we should expect this upgrade to take.)
>(3b-ii) If time permits, update Slurm to the most recent
>stable version (# 23).
>(3c) Set Slurm's maximum possible pending duration priority
>plus-up much higher, to ensure that long-pending jobs
>continue to increase their priority, so that they don't get
>stuck pending forever.
>As always, we apologize for the inconvenience --
>our goal is to make OSCER resources even better.
>The OSCER Team ([log in to unmask])