LISTSERV - OSCER-USERS-L Archives

OSCER users,

Regarding our recent difficulties with OURdisk and /scratch:

We're working on getting several improvements and fixes
in place.

Happily, all of our near term tasks can be done with hardware
that we already have on hand -- so we don't have to wait weeks
or months for hardware delivery.

Please note that, until a particular capability is fully
tested, we can't know whether it'll work the way we want,
nor how fast it'll be.

See below for details.

---

I. WHAT WENT WRONG?

I.A. Large numbers of files open at the same time

I.B. A compatibility bug between Ceph versions 15 and 16

I.C. Network switch firmware bug

II. WHAT ARE WE DOING TO FIX THESE PROBLEMS?

II.A. Near term improvements (NO new hardware needed)

II.A.1. Network switch firmware upgrade

II.A.2. Shifting off of the "ceph-fuse" client

II.A.3. Reconfiguring OURdisk's metadata, monitoring and
S3 gateway servers

II.A.4. Slurm burst buffer

II.A.5. FS-cache

II.B. Medium term improvements

II.B.1. Additional RAM in some metadata servers

II.B.2. More redundancy in monitoring and S3 gateway servers

II.B.3. Multiple FS-cache servers

===

I. WHAT WENT WRONG?

I.A. Large numbers of files open at the same time

I.B. Compatibility bug between Ceph versions 15 and 16

I.C. Network switch firmware bug

---

I.A. Large numbers of files open at the same time

Recently, we've started to experience large numbers of files
being open at the same time -- zillions of such files --
sometimes opened by a single job.

On OURdisk's metadata servers, each open file consumes
a small amount of RAM.

But, a small amount of RAM times zillions of files that are
open at the same time can overconsume the RAM of OURdisk's
metadata servers.

This can cause disruption of OURdisk services.

---

I.B. Compatibility bug between Ceph versions 15 and 16

OURdisk uses an open source parallel filesystem named "Ceph".

We currently run Ceph version 15 ("Octopus") on our
compute nodes and Ceph version 16 ("Pacific") on
OURdisk's diskfull, metadata, monitoring and S3 gateway
servers (see below for why).

Octopus and Pacific are *supposed* to be compatible, but
there's a compatibility bug between them that has recently
started to get triggered.

This is because:

(i) on OURdisk's diskfull, metadata, monitoring and
S3 gateway servers, we run Ceph Pacific (16), and for Linux
we run CentOS 8,

BUT

(ii) on our supercomputer's compute nodes, we have to run
Ceph Octopus (15), and for Linux we run CentOS 7.

On OURdisk's Ceph servers, we do (i) because we need
the features and bug fixes in CentOS 8 and Ceph Pacific.

But on our supercomputer's compute nodes, we have to run
Ceph Octopus (15), because Ceph Octopus IS compatible with
CentOS 7, whereas Ceph Pacific (16) *ISN'T* compatible with
CentOS 7.

So we can't run Ceph Pacific on our compute nodes.

In principle, we could upgrade our compute nodes to
CentOS 8 or 9.

But that would be very labor intensive -- and we'd rather
focus our labor effort on moving to our new supercomputer,
Sooner, instead of upgrading our old one, Schooner.

---

I.C. Network switch firmware bug

There's a bug in some of our internal, dedicated network
switches that needs to get fixed.

===

II. WHAT ARE WE DOING TO FIX THESE PROBLEMS?

II.A. Near term improvements (NO new hardware needed)

II.A.1. Network switch firmware upgrade

II.A.2. Shifting off of the "ceph-fuse" client

II.A.3. Reconfiguring OURdisk's metadata, monitoring and
S3 gateway servers

II.A.4. Slurm burst buffer

II.A.5. FS-cache

---

II.A.1. Network switch firmware upgrade

Dell support personnel have told us that the firmware version
on some of our Dell network switches has been seen to cause
similar network problems at other institutions.

While that's not absolute proof that a firmware upgrade will
address our network issue, the probability is favorable.

So we're going to schedule a maintenance outage, to upgrade
that firmware, which will require NO new hardware.

We hope that'll resolve the network bug, but of course
we don't yet know that for sure.

We've extensively examined the network paths between OURdisk
and our compute nodes, and we haven't found any issues there.

---

II.A.2. Shifting off of the "ceph-fuse" client

Some of the problems we're experiencing, especially regarding
the Octopus/Pacific incompatibility bug, can be resolved by
shifting off of the "ceph-fuse" client software that we're
currently running on many of our supercomputer's compute nodes.

That shift is already underway, and requires NO new hardware.

---

II.A.3. Reconfiguring OURdisk's metadata, monitoring and
S3 gateway servers

Based on a discussion with a 3rd party Ceph expert:

* 3 of OURdisk's 5 metadata servers will each (initially)
have 2 metadata server "instances" on it, of different types
(active, standby, or standby/replay), using hardware that's
already on hand.

For example, an OURdisk physical metadata server might have
one active and one standby server instance on it.

* OURdisk's 5 monitoring servers will shift to dedicated
physical servers, using hardware that's already on hand.

* OURdisk's 4 object storage "S3" gateway servers will
shift to dedicated physical servers, using hardware that's
already on hand.

---

II.A.4. Slurm burst buffer

OURdisk's metadata servers need to be protected from having
zillions of files open at the same time, to avoid these issues
going forward.

Slurm supports use of a "burst buffer" to do I/O on a specific,
designated filesystem during live number crunching, even though
the files in question are supposed to reside elsewhere.

For example, imagine a small filesystem made of fast SSDs,
in addition to our large filesystem made of spinning
hard drives.

(We already have a server with 16 x 3.2 TB NVMe SSDs and
8 Infiniband ports, intended for exactly this purpose, so
NO new hardware will be needed.)

By adding just a few new batch directives to your batch script
file, you'll be able to tell Slurm:

* to copy your input files to the small fast SSD filesystem;

* to do your live number crunching on the small fast SSD
filesystem;

* to copy your output files from the small fast SSD
filesystem to the large spinning disk filesystem.

For example, you might tell Slurm to:

* for your job's input, use a certain set of files (or
directories) on OURdisk or /scratch;

* auto-copy those files to the small fast SSD filesystem
before starting the job;

* while running your job, output to the small fast SSD
filesystem;

* at the end of your job, auto-copy the output files from
the small fast SSD filesystem to OURdisk or /scratch.

The burst buffer will use a separate software technology
as its own internal filesystem.

So, when you use the burst buffer, all the I/O during your
job will bypass OURdisk, so the I/O won't bog down the
OURdisk metadata servers.

And we can limit the number of files open at the same time
during the copy-in at the beginning and the copy-out at the
end.

Please note that burst buffer will only be available for
people doing lots of IOPS, especially for read IOPS,
not for people who are writing to or reading from
a few large files.

(IOPS = I/O operations per second -- that is,
zillions of tiny reads and writes.)

People who do "sustained" reads and writes with
a few large files won't trigger OURdisk's metadata servers,
because they won't open zillions of files at the same time.

And using the burst buffer will actually be slower for them.

So we don't expect those folks to get value from our
burst buffer.

---

II.A.5. FS-cache

Recapping, OURdisk's metadata servers need to be protected from
having zillions of files open at the same time, to avoid these
issues going forward.

We've identified a technology, FS-cache ("FileSystem Cache"),
that we can put "in front of" OURdisk, for users who open
zillions of files at the same time.

That way, users who are running codes that open zillions of
files at the same time will open those files on FS-cache
instead of on OURdisk.

(And those files will auto-migrate between OURdisk and
FS-cache, but with far fewer files on OURdisk being open at
the same time.)

FS-cache will use a separate software technology as its own
internal filesystem.

That'll mean that the zillions of open files won't affect
OURdisk's metadata servers, or at least not nearly as much.

And, by deploying FS-cache on servers full of SSDs, we can
provide not only high bandwidth, but especially high IOPS.

Shortly, we'll deploy and start testing FS-cache on some
servers and SSDs that we already have on hand (separate
hardware from burst buffer).

Once our initial FS-cache testing and benchmarking is complete,
we'll test putting FS-cache into production for y'all with
that same hardware.

---

II.B. Medium term improvements (new hardware IS needed)

II.B.1. Additional RAM in some metadata servers

II.B.2. More redundancy in monitoring and S3 gateway servers

II.B.3. Multiple FS-cache servers

---

II.B.1. Additional RAM in some metadata servers

We'll purchase additional RAM to put in some of OURdisk's
metadata servers, which is needed for recovery from certain
kinds of problems (like some of the recent issues OURdisk
has experienced).

---

II.B.2. More redundancy in monitoring and S3 gateway servers

For OURdisk's monitoring servers and object storage S3 gateway
servers, we're using hardware that we already have on hand.

Two of the servers we plan to use have dual operating system
hard drives, dual data SSDs, and dual power supplies.

But, most of those servers have a single drive and a single
power supply, which reduces resiliency per such server.

So, we'll upgrade each such server to dual OS disk drives
(which we have on hand), dual data drives (which we have
on hand), and dual power supplies and power cords (which
we'll have to buy).

---

II.B.3. Multiple FS-cache servers

If FS-cache is successful, then we'll also purchase additional
hardware (servers and SSDs), to be able to deploy
several different FS-cache configurations, providing
various levels of speed and capacity.

This approach is expected to have high aggregate capacity,
high aggregate speed (both IOPS and bandwidth), and
high resiliency to failures of individual hardware components.

Especially, failure of one FS-cache subsystem will cause users
on it to auto-failover to another FS-cache subsystem.

In addition, we'll pair up FS-cache servers, to synchronously
mirror them, for maximum resiliency for files that have been
written to FS-cache but haven't yet drained off to OURdisk.

That is, if an FS-cache server fails, then its mirrored twin
will typically continue, so therefore that server's FS-cache
service -- and the files that are on that FS-cache server's
SSDs but not yet on OURdisk -- will still be fully accessible.