OSCER users, Regarding our recent difficulties with OURdisk and /scratch: We're working on getting several improvements and fixes in place. Happily, all of our near term tasks can be done with hardware that we already have on hand -- so we don't have to wait weeks or months for hardware delivery. Please note that, until a particular capability is fully tested, we can't know whether it'll work the way we want, nor how fast it'll be. See below for details. --- I. WHAT WENT WRONG? I.A. Large numbers of files open at the same time I.B. A compatibility bug between Ceph versions 15 and 16 I.C. Network switch firmware bug II. WHAT ARE WE DOING TO FIX THESE PROBLEMS? II.A. Near term improvements (NO new hardware needed) II.A.1. Network switch firmware upgrade II.A.2. Shifting off of the "ceph-fuse" client II.A.3. Reconfiguring OURdisk's metadata, monitoring and S3 gateway servers II.A.4. Slurm burst buffer II.A.5. FS-cache II.B. Medium term improvements II.B.1. Additional RAM in some metadata servers II.B.2. More redundancy in monitoring and S3 gateway servers II.B.3. Multiple FS-cache servers === I. WHAT WENT WRONG? I.A. Large numbers of files open at the same time I.B. Compatibility bug between Ceph versions 15 and 16 I.C. Network switch firmware bug --- I.A. Large numbers of files open at the same time Recently, we've started to experience large numbers of files being open at the same time -- zillions of such files -- sometimes opened by a single job. On OURdisk's metadata servers, each open file consumes a small amount of RAM. But, a small amount of RAM times zillions of files that are open at the same time can overconsume the RAM of OURdisk's metadata servers. This can cause disruption of OURdisk services. --- I.B. Compatibility bug between Ceph versions 15 and 16 OURdisk uses an open source parallel filesystem named "Ceph". We currently run Ceph version 15 ("Octopus") on our compute nodes and Ceph version 16 ("Pacific") on OURdisk's diskfull, metadata, monitoring and S3 gateway servers (see below for why). Octopus and Pacific are *supposed* to be compatible, but there's a compatibility bug between them that has recently started to get triggered. This is because: (i) on OURdisk's diskfull, metadata, monitoring and S3 gateway servers, we run Ceph Pacific (16), and for Linux we run CentOS 8, BUT (ii) on our supercomputer's compute nodes, we have to run Ceph Octopus (15), and for Linux we run CentOS 7. On OURdisk's Ceph servers, we do (i) because we need the features and bug fixes in CentOS 8 and Ceph Pacific. But on our supercomputer's compute nodes, we have to run Ceph Octopus (15), because Ceph Octopus IS compatible with CentOS 7, whereas Ceph Pacific (16) *ISN'T* compatible with CentOS 7. So we can't run Ceph Pacific on our compute nodes. In principle, we could upgrade our compute nodes to CentOS 8 or 9. But that would be very labor intensive -- and we'd rather focus our labor effort on moving to our new supercomputer, Sooner, instead of upgrading our old one, Schooner. --- I.C. Network switch firmware bug There's a bug in some of our internal, dedicated network switches that needs to get fixed. === II. WHAT ARE WE DOING TO FIX THESE PROBLEMS? II.A. Near term improvements (NO new hardware needed) II.A.1. Network switch firmware upgrade II.A.2. Shifting off of the "ceph-fuse" client II.A.3. Reconfiguring OURdisk's metadata, monitoring and S3 gateway servers II.A.4. Slurm burst buffer II.A.5. FS-cache --- II.A.1. Network switch firmware upgrade Dell support personnel have told us that the firmware version on some of our Dell network switches has been seen to cause similar network problems at other institutions. While that's not absolute proof that a firmware upgrade will address our network issue, the probability is favorable. So we're going to schedule a maintenance outage, to upgrade that firmware, which will require NO new hardware. We hope that'll resolve the network bug, but of course we don't yet know that for sure. We've extensively examined the network paths between OURdisk and our compute nodes, and we haven't found any issues there. --- II.A.2. Shifting off of the "ceph-fuse" client Some of the problems we're experiencing, especially regarding the Octopus/Pacific incompatibility bug, can be resolved by shifting off of the "ceph-fuse" client software that we're currently running on many of our supercomputer's compute nodes. That shift is already underway, and requires NO new hardware. --- II.A.3. Reconfiguring OURdisk's metadata, monitoring and S3 gateway servers Based on a discussion with a 3rd party Ceph expert: * 3 of OURdisk's 5 metadata servers will each (initially) have 2 metadata server "instances" on it, of different types (active, standby, or standby/replay), using hardware that's already on hand. For example, an OURdisk physical metadata server might have one active and one standby server instance on it. * OURdisk's 5 monitoring servers will shift to dedicated physical servers, using hardware that's already on hand. * OURdisk's 4 object storage "S3" gateway servers will shift to dedicated physical servers, using hardware that's already on hand. --- II.A.4. Slurm burst buffer OURdisk's metadata servers need to be protected from having zillions of files open at the same time, to avoid these issues going forward. Slurm supports use of a "burst buffer" to do I/O on a specific, designated filesystem during live number crunching, even though the files in question are supposed to reside elsewhere. For example, imagine a small filesystem made of fast SSDs, in addition to our large filesystem made of spinning hard drives. (We already have a server with 16 x 3.2 TB NVMe SSDs and 8 Infiniband ports, intended for exactly this purpose, so NO new hardware will be needed.) By adding just a few new batch directives to your batch script file, you'll be able to tell Slurm: * to copy your input files to the small fast SSD filesystem; * to do your live number crunching on the small fast SSD filesystem; * to copy your output files from the small fast SSD filesystem to the large spinning disk filesystem. For example, you might tell Slurm to: * for your job's input, use a certain set of files (or directories) on OURdisk or /scratch; * auto-copy those files to the small fast SSD filesystem before starting the job; * while running your job, output to the small fast SSD filesystem; * at the end of your job, auto-copy the output files from the small fast SSD filesystem to OURdisk or /scratch. The burst buffer will use a separate software technology as its own internal filesystem. So, when you use the burst buffer, all the I/O during your job will bypass OURdisk, so the I/O won't bog down the OURdisk metadata servers. And we can limit the number of files open at the same time during the copy-in at the beginning and the copy-out at the end. Please note that burst buffer will only be available for people doing lots of IOPS, especially for read IOPS, not for people who are writing to or reading from a few large files. (IOPS = I/O operations per second -- that is, zillions of tiny reads and writes.) People who do "sustained" reads and writes with a few large files won't trigger OURdisk's metadata servers, because they won't open zillions of files at the same time. And using the burst buffer will actually be slower for them. So we don't expect those folks to get value from our burst buffer. --- II.A.5. FS-cache Recapping, OURdisk's metadata servers need to be protected from having zillions of files open at the same time, to avoid these issues going forward. We've identified a technology, FS-cache ("FileSystem Cache"), that we can put "in front of" OURdisk, for users who open zillions of files at the same time. That way, users who are running codes that open zillions of files at the same time will open those files on FS-cache instead of on OURdisk. (And those files will auto-migrate between OURdisk and FS-cache, but with far fewer files on OURdisk being open at the same time.) FS-cache will use a separate software technology as its own internal filesystem. That'll mean that the zillions of open files won't affect OURdisk's metadata servers, or at least not nearly as much. And, by deploying FS-cache on servers full of SSDs, we can provide not only high bandwidth, but especially high IOPS. Shortly, we'll deploy and start testing FS-cache on some servers and SSDs that we already have on hand (separate hardware from burst buffer). Once our initial FS-cache testing and benchmarking is complete, we'll test putting FS-cache into production for y'all with that same hardware. --- II.B. Medium term improvements (new hardware IS needed) II.B.1. Additional RAM in some metadata servers II.B.2. More redundancy in monitoring and S3 gateway servers II.B.3. Multiple FS-cache servers --- II.B.1. Additional RAM in some metadata servers We'll purchase additional RAM to put in some of OURdisk's metadata servers, which is needed for recovery from certain kinds of problems (like some of the recent issues OURdisk has experienced). --- II.B.2. More redundancy in monitoring and S3 gateway servers For OURdisk's monitoring servers and object storage S3 gateway servers, we're using hardware that we already have on hand. Two of the servers we plan to use have dual operating system hard drives, dual data SSDs, and dual power supplies. But, most of those servers have a single drive and a single power supply, which reduces resiliency per such server. So, we'll upgrade each such server to dual OS disk drives (which we have on hand), dual data drives (which we have on hand), and dual power supplies and power cords (which we'll have to buy). --- II.B.3. Multiple FS-cache servers If FS-cache is successful, then we'll also purchase additional hardware (servers and SSDs), to be able to deploy several different FS-cache configurations, providing various levels of speed and capacity. This approach is expected to have high aggregate capacity, high aggregate speed (both IOPS and bandwidth), and high resiliency to failures of individual hardware components. Especially, failure of one FS-cache subsystem will cause users on it to auto-failover to another FS-cache subsystem. In addition, we'll pair up FS-cache servers, to synchronously mirror them, for maximum resiliency for files that have been written to FS-cache but haven't yet drained off to OURdisk. That is, if an FS-cache server fails, then its mirrored twin will typically continue, so therefore that server's FS-cache service -- and the files that are on that FS-cache server's SSDs but not yet on OURdisk -- will still be fully accessible.