OSCER users, UPDATE: OURdisk is now available on OURcloud virtual servers and mounts on external servers (as well as on Schooner). Many thanks to OSCER's team of heroes! Henry ---------- On Fri, 28 Apr 2023, Henry Neeman wrote: >OSCER users, > >UPDATE: > >OURdisk *IS* now available on Schooner. > >Please alert us if you encounter any difficulties with OURdisk >on Schooner. > >OURdisk *ISN'T* yet available on OURcloud virtual servers >nor on mounts on external servers. > >We'll alert y'all when OURdisk is available on OURcloud and >external servers. > >Many thanks to the amazing OSCER team for making this happen! > >Henry > >---------- > >On Fri, 28 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >The OSCER Operations Team is currently bringing OURdisk up >on our several hundred compute nodes. > >We'll send an update when that's completed. > >Henry > >---------- > >On Thu, 27 Apr 2023, Henry Neeman wrote: > >OSCER users, > >OURdisk will be returned to service tomorrow morning >(Fri Apr 28), so that we can keep an eye on it all day. > >Many thanks to Keith Brewster for serving as our guinea pig >friendly user in a successful test of the CAPS real time >weather forecasting system, to make sure that things are place. > >(CAPS = Center for Analysis & Prediction of Storms) > >And many thanks to the several heroes on the OSCER Operations >Team, who went far above and beyond the call of duty to fix >OURdisk, working long hours deep into the night to get this >resolved. > >--- > >WHAT WENT WRONG? > >We believe the proximate cause was the following: > >Multiple of the OURdisk monitor nodes experienced a combination >of a bug in its network card's driver software, plus a bug in >the specific version of the Ceph storage software that OURdisk >is based on. > >WHY DID IT TAKE SO LONG TO FIX? > >Working with our external Ceph expert consultant in Europe, >it took quite a while to diagnose the problem sufficiently to >be able to bring the bad monitor nodes up again. > >Then, we faced significant difficulties getting the >three monitor nodes that are the minimum needed to be able to >maintain a "quorum" (agreement on which of the monitor nodes >is the leader). > >We also had to do some physical reconfiguring, to increase >the amount of disk space available in the monitor nodes, >so that they wouldn't crash again. > >Then, we had to add each disk drive in OURdisk's diskfull >nodes into the drive pool by hand, at slightly over a minute >per disk drive (for 624 drives). > >Then we had to clean up the mess left behind, as described >in our previous notes. > >HOW WILL WE MINIMIZE THE LIKELIHOOD OF ANOTHER SUCH FAILURE? > >In January, we purchased several new servers for Ceph >monitor, metadata and manager subsystems. > >We're working now to get them into production as quickly as >possible, to replace the older legacy servers that we're >currently using. > >Thus, all Ceph support capabilities will be on separate, >dedicated hardware, with much bigger, faster CPUs, RAM and >disk. > >We're also going to shift to a different network approach, >and update Ceph to the most recent stable sub-version of >major version 16 ("Pacific"), but those improvements >can't happen as quickly. > >WHOSE FAULT WAS THIS? > >Mine. I made the executive decision to shift priority to >our growing backlog of tasks unrelated to OURdisk, instead of >getting OURdisk's new support servers into production. > >The buck stops here. > >Henry > >---------- > >On Thu, 27 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >We HAVEN'T YET returned OURdisk to service, but: > >We've successfully brought it back up. > >Next, we're going to have one of our longtime users >test it for us. > >If that test goes well, we plan to make it available for >general use tomorrow morning (Fri Apr 28), so that we can >keep an eye on it for the day. > >Henry > >---------- > >On Thu, 27 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >We HAVEN'T YET returned OURdisk to service, but: > >(a) we've successfully returned all of OURdisk's individual >diskfull nodes to service; > >(b) in our Ceph monitor nodes, we've cleared out the backlog >of previous states ("CRUSH maps"), reducing that database from >460 GB to less than 2 GB; > >(c) we're currently in the process of returning ~1/3 of the >"placement groups" to best status ("active+clean"). > >We don't yet know how long (c) will take, and it's probably >UNSAFE to allow users onto OURdisk until that's completed. > >We also want to make sure that our Ceph monitor nodes are >stable. > >So we're hoping to be able to let y'all back on to OURdisk >tomorrow (Fri March 28). > >We apologize for this very unhappy situation. > >Henry > >---------- > >On Wed, 26 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >We hope to have OURdisk back in production tomorrow >(Thu Apr 27). > >Many thanks to our heroic OSCER Operations Team for >moving this forward! > >Specifically: > >* The monitor system has been returned to full service, >after an incredible effort by our team. > >* We're currently adding diskfull nodes back into production, >one disk drive at a time, based on guidance from our >external Ceph expert consultant -- who's (a) in Europe and >(b) on vacation, but still has been helping us. > >The time lag per disk drive is roughly 1 minute, and we have >624 such drives, so the minimum time to completion of this >task is more than 10 hours. > >As always, we're very aware of how disruptive this issue is >and are doing our best to get it resolved as quickly as >we can. We apologize for the trouble. > >Henry > >---------- > >On Tue, 25 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >The OSCER Operations Team has made good progress on >the OURdisk problem, but aren't done yet, and there's >a high probability that the issue won't get resolved today. > >Working closely all day with our external Ceph expert >consultant, who has now gone to sleep (because it's >a lot later in Europe than here), we've restored our >Ceph monitor servers to service. > >But, we're still troubleshooting what's likely to be >a server-level network issue, which is preventing our >monitor servers from maintaining a "quorum" (mutual agreement >about the state of the system). > >We're continuing to move forward on this and have >multiple approaches to try. > >As always, we're very aware of how disruptive this issue is >and are doing our best to get it resolved as quickly as >we can. We apologize for the trouble. > >Henry > >---------- > >On Mon, 24 Apr 2023, Henry Neeman wrote: > >OSCER users, > >UPDATE: > >We hope to have OURdisk in production later today. Our team >has been working on it since yesterday morning. > >We think we've isolated the issue to the network driver layer, >and we're collaborating closely with our external Ceph expert >consultant to resolve the issue. > >We also hope to have a more permanent fix in place soon, >most likely during the next maintenance outage, which is >likely to be mid-May, in order to avoid dissertation and >thesis deadlines. > >But, we can't know for sure until we've returned OURdisk to >service. > >Again, we apologize for the trouble. > >Henry > >---------- > >On Sun, 23 Apr 2023, Henry Neeman wrote: > >OSCER users, > >OURdisk is down. Our team is working to return it to service, >but it's most likely that'll take until tomorrow (Mon Apr 24). > >We apologize for the trouble. > >--- > >Henry Neeman ([log in to unmask]) >Director, OU Supercomputing Center for Education & Research (OSCER) >Associate Professor, Gallogly College of Engineering >Adjunct Associate Professor, School of Computer Science >OU Information Technology >The University of Oklahoma > >Engineering Lab 212, 200 Felgar St, Norman OK 73019 >405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell), >[log in to unmask] (to e-mail me a text message) >http://www.oscer.ou.edu/