Clarifying on one point: Ceph is excellent at handling I/O, as long as the number of files open at the same time is modest. So, research software that does sustained reads and/or writes on a few large files gets lots of value out of Ceph. But, Ceph seems to be more sensitive to having large numbers of files open at the same time -- which typically only happens during heavy IOPS (zillions of tiny reads and writes). By shifting IOPS-heavy jobs to separate storage subsystems, we expect to handle both kinds of I/O well -- everything should run faster, and everything should be more resilient. And we expect that'll require very limited changes to your workflow and batch scripts -- low labor cost to you! Henry ---------- On Wed, 2 Nov 2022, Henry Neeman wrote: >OSCER users, > >Regarding our recent difficulties with OURdisk and /scratch: > >We're working on getting several improvements and fixes >in place. > >Happily, all of our near term tasks can be done with hardware >that we already have on hand -- so we don't have to wait weeks >or months for hardware delivery. > >Please note that, until a particular capability is fully >tested, we can't know whether it'll work the way we want, >nor how fast it'll be. > >See below for details. > >--- > >I. WHAT WENT WRONG? > >I.A. Large numbers of files open at the same time > >I.B. A compatibility bug between Ceph versions 15 and 16 > >I.C. Network switch firmware bug > >II. WHAT ARE WE DOING TO FIX THESE PROBLEMS? > >II.A. Near term improvements (NO new hardware needed) > >II.A.1. Network switch firmware upgrade > >II.A.2. Shifting off of the "ceph-fuse" client > >II.A.3. Reconfiguring OURdisk's metadata, monitoring and >S3 gateway servers > >II.A.4. Slurm burst buffer > >II.A.5. FS-cache > >II.B. Medium term improvements > >II.B.1. Additional RAM in some metadata servers > >II.B.2. More redundancy in monitoring and S3 gateway servers > >II.B.3. Multiple FS-cache servers > >=== > >I. WHAT WENT WRONG? > >I.A. Large numbers of files open at the same time > >I.B. Compatibility bug between Ceph versions 15 and 16 > >I.C. Network switch firmware bug > >--- > >I.A. Large numbers of files open at the same time > >Recently, we've started to experience large numbers of files >being open at the same time -- zillions of such files -- >sometimes opened by a single job. > >On OURdisk's metadata servers, each open file consumes >a small amount of RAM. > >But, a small amount of RAM times zillions of files that are >open at the same time can overconsume the RAM of OURdisk's >metadata servers. > >This can cause disruption of OURdisk services. > >--- > >I.B. Compatibility bug between Ceph versions 15 and 16 > >OURdisk uses an open source parallel filesystem named "Ceph". > >We currently run Ceph version 15 ("Octopus") on our >compute nodes and Ceph version 16 ("Pacific") on >OURdisk's diskfull, metadata, monitoring and S3 gateway >servers (see below for why). > >Octopus and Pacific are *supposed* to be compatible, but >there's a compatibility bug between them that has recently >started to get triggered. > >This is because: > >(i) on OURdisk's diskfull, metadata, monitoring and >S3 gateway servers, we run Ceph Pacific (16), and for Linux >we run CentOS 8, > >BUT > >(ii) on our supercomputer's compute nodes, we have to run >Ceph Octopus (15), and for Linux we run CentOS 7. > >On OURdisk's Ceph servers, we do (i) because we need >the features and bug fixes in CentOS 8 and Ceph Pacific. > >But on our supercomputer's compute nodes, we have to run >Ceph Octopus (15), because Ceph Octopus IS compatible with >CentOS 7, whereas Ceph Pacific (16) *ISN'T* compatible with >CentOS 7. > >So we can't run Ceph Pacific on our compute nodes. > >In principle, we could upgrade our compute nodes to >CentOS 8 or 9. > >But that would be very labor intensive -- and we'd rather >focus our labor effort on moving to our new supercomputer, >Sooner, instead of upgrading our old one, Schooner. > >--- > >I.C. Network switch firmware bug > >There's a bug in some of our internal, dedicated network >switches that needs to get fixed. > >=== > >II. WHAT ARE WE DOING TO FIX THESE PROBLEMS? > >II.A. Near term improvements (NO new hardware needed) > >II.A.1. Network switch firmware upgrade > >II.A.2. Shifting off of the "ceph-fuse" client > >II.A.3. Reconfiguring OURdisk's metadata, monitoring and >S3 gateway servers > >II.A.4. Slurm burst buffer > >II.A.5. FS-cache > >--- > >II.A.1. Network switch firmware upgrade > >Dell support personnel have told us that the firmware version >on some of our Dell network switches has been seen to cause >similar network problems at other institutions. > >While that's not absolute proof that a firmware upgrade will >address our network issue, the probability is favorable. > >So we're going to schedule a maintenance outage, to upgrade >that firmware, which will require NO new hardware. > >We hope that'll resolve the network bug, but of course >we don't yet know that for sure. > >We've extensively examined the network paths between OURdisk >and our compute nodes, and we haven't found any issues there. > >--- > >II.A.2. Shifting off of the "ceph-fuse" client > >Some of the problems we're experiencing, especially regarding >the Octopus/Pacific incompatibility bug, can be resolved by >shifting off of the "ceph-fuse" client software that we're >currently running on many of our supercomputer's compute nodes. > >That shift is already underway, and requires NO new hardware. > >--- > >II.A.3. Reconfiguring OURdisk's metadata, monitoring and >S3 gateway servers > >Based on a discussion with a 3rd party Ceph expert: > >* 3 of OURdisk's 5 metadata servers will each (initially) >have 2 metadata server "instances" on it, of different types >(active, standby, or standby/replay), using hardware that's >already on hand. > >For example, an OURdisk physical metadata server might have >one active and one standby server instance on it. > >* OURdisk's 5 monitoring servers will shift to dedicated >physical servers, using hardware that's already on hand. > >* OURdisk's 4 object storage "S3" gateway servers will >shift to dedicated physical servers, using hardware that's >already on hand. > >--- > >II.A.4. Slurm burst buffer > >OURdisk's metadata servers need to be protected from having >zillions of files open at the same time, to avoid these issues >going forward. > >Slurm supports use of a "burst buffer" to do I/O on a specific, >designated filesystem during live number crunching, even though >the files in question are supposed to reside elsewhere. > >For example, imagine a small filesystem made of fast SSDs, >in addition to our large filesystem made of spinning >hard drives. > >(We already have a server with 16 x 3.2 TB NVMe SSDs and >8 Infiniband ports, intended for exactly this purpose, so >NO new hardware will be needed.) > >By adding just a few new batch directives to your batch script >file, you'll be able to tell Slurm: > >* to copy your input files to the small fast SSD filesystem; > >* to do your live number crunching on the small fast SSD >filesystem; > >* to copy your output files from the small fast SSD >filesystem to the large spinning disk filesystem. > >For example, you might tell Slurm to: > >* for your job's input, use a certain set of files (or >directories) on OURdisk or /scratch; > >* auto-copy those files to the small fast SSD filesystem >before starting the job; > >* while running your job, output to the small fast SSD >filesystem; > >* at the end of your job, auto-copy the output files from >the small fast SSD filesystem to OURdisk or /scratch. > >The burst buffer will use a separate software technology >as its own internal filesystem. > >So, when you use the burst buffer, all the I/O during your >job will bypass OURdisk, so the I/O won't bog down the >OURdisk metadata servers. > >And we can limit the number of files open at the same time >during the copy-in at the beginning and the copy-out at the >end. > >Please note that burst buffer will only be available for >people doing lots of IOPS, especially for read IOPS, >not for people who are writing to or reading from >a few large files. > >(IOPS = I/O operations per second -- that is, >zillions of tiny reads and writes.) > >People who do "sustained" reads and writes with >a few large files won't trigger OURdisk's metadata servers, >because they won't open zillions of files at the same time. > >And using the burst buffer will actually be slower for them. > >So we don't expect those folks to get value from our >burst buffer. > >--- > >II.A.5. FS-cache > >Recapping, OURdisk's metadata servers need to be protected from >having zillions of files open at the same time, to avoid these >issues going forward. > >We've identified a technology, FS-cache ("FileSystem Cache"), >that we can put "in front of" OURdisk, for users who open >zillions of files at the same time. > >That way, users who are running codes that open zillions of >files at the same time will open those files on FS-cache >instead of on OURdisk. > >(And those files will auto-migrate between OURdisk and >FS-cache, but with far fewer files on OURdisk being open at >the same time.) > >FS-cache will use a separate software technology as its own >internal filesystem. > >That'll mean that the zillions of open files won't affect >OURdisk's metadata servers, or at least not nearly as much. > >And, by deploying FS-cache on servers full of SSDs, we can >provide not only high bandwidth, but especially high IOPS. > >Shortly, we'll deploy and start testing FS-cache on some >servers and SSDs that we already have on hand (separate >hardware from burst buffer). > >Once our initial FS-cache testing and benchmarking is complete, >we'll test putting FS-cache into production for y'all with >that same hardware. > >--- > >II.B. Medium term improvements (new hardware IS needed) > >II.B.1. Additional RAM in some metadata servers > >II.B.2. More redundancy in monitoring and S3 gateway servers > >II.B.3. Multiple FS-cache servers > >--- > >II.B.1. Additional RAM in some metadata servers > >We'll purchase additional RAM to put in some of OURdisk's >metadata servers, which is needed for recovery from certain >kinds of problems (like some of the recent issues OURdisk >has experienced). > >--- > >II.B.2. More redundancy in monitoring and S3 gateway servers > >For OURdisk's monitoring servers and object storage S3 gateway >servers, we're using hardware that we already have on hand. > >Two of the servers we plan to use have dual operating system >hard drives, dual data SSDs, and dual power supplies. > >But, most of those servers have a single drive and a single >power supply, which reduces resiliency per such server. > >So, we'll upgrade each such server to dual OS disk drives >(which we have on hand), dual data drives (which we have >on hand), and dual power supplies and power cords (which >we'll have to buy). > >--- > >II.B.3. Multiple FS-cache servers > >If FS-cache is successful, then we'll also purchase additional >hardware (servers and SSDs), to be able to deploy >several different FS-cache configurations, providing >various levels of speed and capacity. > >This approach is expected to have high aggregate capacity, >high aggregate speed (both IOPS and bandwidth), and >high resiliency to failures of individual hardware components. > >Especially, failure of one FS-cache subsystem will cause users >on it to auto-failover to another FS-cache subsystem. > >In addition, we'll pair up FS-cache servers, to synchronously >mirror them, for maximum resiliency for files that have been >written to FS-cache but haven't yet drained off to OURdisk. > >That is, if an FS-cache server fails, then its mirrored twin >will typically continue, so therefore that server's FS-cache >service -- and the files that are on that FS-cache server's >SSDs but not yet on OURdisk -- will still be fully accessible.