LISTSERV - OSCER-USERS-L Archives

Clarifying on one point:

Ceph is excellent at handling I/O, as long as the number of
files open at the same time is modest.

So, research software that does sustained reads and/or writes
on a few large files gets lots of value out of Ceph.

But, Ceph seems to be more sensitive to having large numbers
of files open at the same time -- which typically only happens
during heavy IOPS (zillions of tiny reads and writes).

By shifting IOPS-heavy jobs to separate storage subsystems,
we expect to handle both kinds of I/O well -- everything
should run faster, and everything should be more resilient.

And we expect that'll require very limited changes to
your workflow and batch scripts -- low labor cost to you!

Henry

----------

On Wed, 2 Nov 2022, Henry Neeman wrote:

>OSCER users,
>
>Regarding our recent difficulties with OURdisk and /scratch:
>
>We're working on getting several improvements and fixes
>in place.
>
>Happily, all of our near term tasks can be done with hardware
>that we already have on hand -- so we don't have to wait weeks
>or months for hardware delivery.
>
>Please note that, until a particular capability is fully
>tested, we can't know whether it'll work the way we want,
>nor how fast it'll be.
>
>See below for details.
>
>---
>
>I. WHAT WENT WRONG?
>
>I.A. Large numbers of files open at the same time
>
>I.B. A compatibility bug between Ceph versions 15 and 16
>
>I.C. Network switch firmware bug
>
>II. WHAT ARE WE DOING TO FIX THESE PROBLEMS?
>
>II.A. Near term improvements (NO new hardware needed)
>
>II.A.1. Network switch firmware upgrade
>
>II.A.2. Shifting off of the "ceph-fuse" client
>
>II.A.3. Reconfiguring OURdisk's metadata, monitoring and
>S3 gateway servers
>
>II.A.4. Slurm burst buffer
>
>II.A.5. FS-cache
>
>II.B. Medium term improvements
>
>II.B.1. Additional RAM in some metadata servers
>
>II.B.2. More redundancy in monitoring and S3 gateway servers
>
>II.B.3. Multiple FS-cache servers
>
>===
>
>I. WHAT WENT WRONG?
>
>I.A. Large numbers of files open at the same time
>
>I.B. Compatibility bug between Ceph versions 15 and 16
>
>I.C. Network switch firmware bug
>
>---
>
>I.A. Large numbers of files open at the same time
>
>Recently, we've started to experience large numbers of files
>being open at the same time -- zillions of such files --
>sometimes opened by a single job.
>
>On OURdisk's metadata servers, each open file consumes
>a small amount of RAM.
>
>But, a small amount of RAM times zillions of files that are
>open at the same time can overconsume the RAM of OURdisk's
>metadata servers.
>
>This can cause disruption of OURdisk services.
>
>---
>
>I.B. Compatibility bug between Ceph versions 15 and 16
>
>OURdisk uses an open source parallel filesystem named "Ceph".
>
>We currently run Ceph version 15 ("Octopus") on our
>compute nodes and Ceph version 16 ("Pacific") on
>OURdisk's diskfull, metadata, monitoring and S3 gateway
>servers (see below for why).
>
>Octopus and Pacific are *supposed* to be compatible, but
>there's a compatibility bug between them that has recently
>started to get triggered.
>
>This is because:
>
>(i) on OURdisk's diskfull, metadata, monitoring and
>S3 gateway servers, we run Ceph Pacific (16), and for Linux
>we run CentOS 8,
>
>BUT
>
>(ii) on our supercomputer's compute nodes, we have to run
>Ceph Octopus (15), and for Linux we run CentOS 7.
>
>On OURdisk's Ceph servers, we do (i) because we need
>the features and bug fixes in CentOS 8 and Ceph Pacific.
>
>But on our supercomputer's compute nodes, we have to run
>Ceph Octopus (15), because Ceph Octopus IS compatible with
>CentOS 7, whereas Ceph Pacific (16) *ISN'T* compatible with
>CentOS 7.
>
>So we can't run Ceph Pacific on our compute nodes.
>
>In principle, we could upgrade our compute nodes to
>CentOS 8 or 9.
>
>But that would be very labor intensive -- and we'd rather
>focus our labor effort on moving to our new supercomputer,
>Sooner, instead of upgrading our old one, Schooner.
>
>---
>
>I.C. Network switch firmware bug
>
>There's a bug in some of our internal, dedicated network
>switches that needs to get fixed.
>
>===
>
>II. WHAT ARE WE DOING TO FIX THESE PROBLEMS?
>
>II.A. Near term improvements (NO new hardware needed)
>
>II.A.1. Network switch firmware upgrade
>
>II.A.2. Shifting off of the "ceph-fuse" client
>
>II.A.3. Reconfiguring OURdisk's metadata, monitoring and
>S3 gateway servers
>
>II.A.4. Slurm burst buffer
>
>II.A.5. FS-cache
>
>---
>
>II.A.1. Network switch firmware upgrade
>
>Dell support personnel have told us that the firmware version
>on some of our Dell network switches has been seen to cause
>similar network problems at other institutions.
>
>While that's not absolute proof that a firmware upgrade will
>address our network issue, the probability is favorable.
>
>So we're going to schedule a maintenance outage, to upgrade
>that firmware, which will require NO new hardware.
>
>We hope that'll resolve the network bug, but of course
>we don't yet know that for sure.
>
>We've extensively examined the network paths between OURdisk
>and our compute nodes, and we haven't found any issues there.
>
>---
>
>II.A.2. Shifting off of the "ceph-fuse" client
>
>Some of the problems we're experiencing, especially regarding
>the Octopus/Pacific incompatibility bug, can be resolved by
>shifting off of the "ceph-fuse" client software that we're
>currently running on many of our supercomputer's compute nodes.
>
>That shift is already underway, and requires NO new hardware.
>
>---
>
>II.A.3. Reconfiguring OURdisk's metadata, monitoring and
>S3 gateway servers
>
>Based on a discussion with a 3rd party Ceph expert:
>
>* 3 of OURdisk's 5 metadata servers will each (initially)
>have 2 metadata server "instances" on it, of different types
>(active, standby, or standby/replay), using hardware that's
>already on hand.
>
>For example, an OURdisk physical metadata server might have
>one active and one standby server instance on it.
>
>* OURdisk's 5 monitoring servers will shift to dedicated
>physical servers, using hardware that's already on hand.
>
>* OURdisk's 4 object storage "S3" gateway servers will
>shift to dedicated physical servers, using hardware that's
>already on hand.
>
>---
>
>II.A.4. Slurm burst buffer
>
>OURdisk's metadata servers need to be protected from having
>zillions of files open at the same time, to avoid these issues
>going forward.
>
>Slurm supports use of a "burst buffer" to do I/O on a specific,
>designated filesystem during live number crunching, even though
>the files in question are supposed to reside elsewhere.
>
>For example, imagine a small filesystem made of fast SSDs,
>in addition to our large filesystem made of spinning
>hard drives.
>
>(We already have a server with 16 x 3.2 TB NVMe SSDs and
>8 Infiniband ports, intended for exactly this purpose, so
>NO new hardware will be needed.)
>
>By adding just a few new batch directives to your batch script
>file, you'll be able to tell Slurm:
>
>* to copy your input files to the small fast SSD filesystem;
>
>* to do your live number crunching on the small fast SSD
>filesystem;
>
>* to copy your output files from the small fast SSD
>filesystem to the large spinning disk filesystem.
>
>For example, you might tell Slurm to:
>
>* for your job's input, use a certain set of files (or
>directories) on OURdisk or /scratch;
>
>* auto-copy those files to the small fast SSD filesystem
>before starting the job;
>
>* while running your job, output to the small fast SSD
>filesystem;
>
>* at the end of your job, auto-copy the output files from
>the small fast SSD filesystem to OURdisk or /scratch.
>
>The burst buffer will use a separate software technology
>as its own internal filesystem.
>
>So, when you use the burst buffer, all the I/O during your
>job will bypass OURdisk, so the I/O won't bog down the
>OURdisk metadata servers.
>
>And we can limit the number of files open at the same time
>during the copy-in at the beginning and the copy-out at the
>end.
>
>Please note that burst buffer will only be available for
>people doing lots of IOPS, especially for read IOPS,
>not for people who are writing to or reading from
>a few large files.
>
>(IOPS = I/O operations per second -- that is,
>zillions of tiny reads and writes.)
>
>People who do "sustained" reads and writes with
>a few large files won't trigger OURdisk's metadata servers,
>because they won't open zillions of files at the same time.
>
>And using the burst buffer will actually be slower for them.
>
>So we don't expect those folks to get value from our
>burst buffer.
>
>---
>
>II.A.5. FS-cache
>
>Recapping, OURdisk's metadata servers need to be protected from
>having zillions of files open at the same time, to avoid these
>issues going forward.
>
>We've identified a technology, FS-cache ("FileSystem Cache"),
>that we can put "in front of" OURdisk, for users who open
>zillions of files at the same time.
>
>That way, users who are running codes that open zillions of
>files at the same time will open those files on FS-cache
>instead of on OURdisk.
>
>(And those files will auto-migrate between OURdisk and
>FS-cache, but with far fewer files on OURdisk being open at
>the same time.)
>
>FS-cache will use a separate software technology as its own
>internal filesystem.
>
>That'll mean that the zillions of open files won't affect
>OURdisk's metadata servers, or at least not nearly as much.
>
>And, by deploying FS-cache on servers full of SSDs, we can
>provide not only high bandwidth, but especially high IOPS.
>
>Shortly, we'll deploy and start testing FS-cache on some
>servers and SSDs that we already have on hand (separate
>hardware from burst buffer).
>
>Once our initial FS-cache testing and benchmarking is complete,
>we'll test putting FS-cache into production for y'all with
>that same hardware.
>
>---
>
>II.B. Medium term improvements (new hardware IS needed)
>
>II.B.1. Additional RAM in some metadata servers
>
>II.B.2. More redundancy in monitoring and S3 gateway servers
>
>II.B.3. Multiple FS-cache servers
>
>---
>
>II.B.1. Additional RAM in some metadata servers
>
>We'll purchase additional RAM to put in some of OURdisk's
>metadata servers, which is needed for recovery from certain
>kinds of problems (like some of the recent issues OURdisk
>has experienced).
>
>---
>
>II.B.2. More redundancy in monitoring and S3 gateway servers
>
>For OURdisk's monitoring servers and object storage S3 gateway
>servers, we're using hardware that we already have on hand.
>
>Two of the servers we plan to use have dual operating system
>hard drives, dual data SSDs, and dual power supplies.
>
>But, most of those servers have a single drive and a single
>power supply, which reduces resiliency per such server.
>
>So, we'll upgrade each such server to dual OS disk drives
>(which we have on hand), dual data drives (which we have
>on hand), and dual power supplies and power cords (which
>we'll have to buy).
>
>---
>
>II.B.3. Multiple FS-cache servers
>
>If FS-cache is successful, then we'll also purchase additional
>hardware (servers and SSDs), to be able to deploy
>several different FS-cache configurations, providing
>various levels of speed and capacity.
>
>This approach is expected to have high aggregate capacity,
>high aggregate speed (both IOPS and bandwidth), and
>high resiliency to failures of individual hardware components.
>
>Especially, failure of one FS-cache subsystem will cause users
>on it to auto-failover to another FS-cache subsystem.
>
>In addition, we'll pair up FS-cache servers, to synchronously
>mirror them, for maximum resiliency for files that have been
>written to FS-cache but haven't yet drained off to OURdisk.
>
>That is, if an FS-cache server fails, then its mirrored twin
>will typically continue, so therefore that server's FS-cache
>service -- and the files that are on that FS-cache server's
>SSDs but not yet on OURdisk -- will still be fully accessible.