LISTSERV - OSCER-USERS-L Archives

OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

	LISTSERV Archives
	OSCER-USERS-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: OURdisk down, working to return it to service
From:	Henry Neeman <[log in to unmask]>
Reply To:	Henry Neeman <[log in to unmask]>
Date:	Fri, 28 Apr 2023 09:27:36 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (263 lines)
OSCER users,

UPDATE:

The OSCER Operations Team is currently bringing OURdisk up
on our several hundred compute nodes.

We'll send an update when that's completed.

Henry

----------

On Thu, 27 Apr 2023, Henry Neeman wrote:

>OSCER users,
>
>OURdisk will be returned to service tomorrow morning
>(Fri Apr 28), so that we can keep an eye on it all day.
>
>Many thanks to Keith Brewster for serving as our guinea pig
>friendly user in a successful test of the CAPS real time
>weather forecasting system, to make sure that things are place.
>
>(CAPS = Center for Analysis & Prediction of Storms)
>
>And many thanks to the several heroes on the OSCER Operations
>Team, who went far above and beyond the call of duty to fix
>OURdisk, working long hours deep into the night to get this
>resolved.
>
>---
>
>WHAT WENT WRONG?
>
>We believe the proximate cause was the following:
>
>Multiple of the OURdisk monitor nodes experienced a combination
>of a bug in its network card's driver software, plus a bug in
>the specific version of the Ceph storage software that OURdisk
>is based on.
>
>WHY DID IT TAKE SO LONG TO FIX?
>
>Working with our external Ceph expert consultant in Europe,
>it took quite a while to diagnose the problem sufficiently to
>be able to bring the bad monitor nodes up again.
>
>Then, we faced significant difficulties getting the
>three monitor nodes that are the minimum needed to be able to
>maintain a "quorum" (agreement on which of the monitor nodes
>is the leader).
>
>We also had to do some physical reconfiguring, to increase
>the amount of disk space available in the monitor nodes,
>so that they wouldn't crash again.
>
>Then, we had to add each disk drive in OURdisk's diskfull
>nodes into the drive pool by hand, at slightly over a minute
>per disk drive (for 624 drives).
>
>Then we had to clean up the mess left behind, as described
>in our previous notes.
>
>HOW WILL WE MINIMIZE THE LIKELIHOOD OF ANOTHER SUCH FAILURE?
>
>In January, we purchased several new servers for Ceph
>monitor, metadata and manager subsystems.
>
>We're working now to get them into production as quickly as
>possible, to replace the older legacy servers that we're
>currently using.
>
>Thus, all Ceph support capabilities will be on separate,
>dedicated hardware, with much bigger, faster CPUs, RAM and
>disk.
>
>We're also going to shift to a different network approach,
>and update Ceph to the most recent stable sub-version of
>major version 16 ("Pacific"), but those improvements
>can't happen as quickly.
>
>WHOSE FAULT WAS THIS?
>
>Mine. I made the executive decision to shift priority to
>our growing backlog of tasks unrelated to OURdisk, instead of
>getting OURdisk's new support servers into production.
>
>The buck stops here.
>
>Henry
>
>----------
>
>On Thu, 27 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We HAVEN'T YET returned OURdisk to service, but:
>
>We've successfully brought it back up.
>
>Next, we're going to have one of our longtime users
>test it for us.
>
>If that test goes well, we plan to make it available for
>general use tomorrow morning (Fri Apr 28), so that we can
>keep an eye on it for the day.
>
>Henry
>
>----------
>
>On Thu, 27 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We HAVEN'T YET returned OURdisk to service, but:
>
>(a) we've successfully returned all of OURdisk's individual
>diskfull nodes to service;
>
>(b) in our Ceph monitor nodes, we've cleared out the backlog
>of previous states ("CRUSH maps"), reducing that database from
>460 GB to less than 2 GB;
>
>(c) we're currently in the process of returning ~1/3 of the
>"placement groups" to best status ("active+clean").
>
>We don't yet know how long (c) will take, and it's probably
>UNSAFE to allow users onto OURdisk until that's completed.
>
>We also want to make sure that our Ceph monitor nodes are
>stable.
>
>So we're hoping to be able to let y'all back on to OURdisk
>tomorrow (Fri March 28).
>
>We apologize for this very unhappy situation.
>
>Henry
>
>----------
>
>On Wed, 26 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk back in production tomorrow
>(Thu Apr 27).
>
>Many thanks to our heroic OSCER Operations Team for
>moving this forward!
>
>Specifically:
>
>* The monitor system has been returned to full service,
>after an incredible effort by our team.
>
>* We're currently adding diskfull nodes back into production,
>one disk drive at a time, based on guidance from our
>external Ceph expert consultant -- who's (a) in Europe and
>(b) on vacation, but still has been helping us.
>
>The time lag per disk drive is roughly 1 minute, and we have
>624 such drives, so the minimum time to completion of this
>task is more than 10 hours.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Tue, 25 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>The OSCER Operations Team has made good progress on
>the OURdisk problem, but aren't done yet, and there's
>a high probability that the issue won't get resolved today.
>
>Working closely all day with our external Ceph expert
>consultant, who has now gone to sleep (because it's
>a lot later in Europe than here), we've restored our
>Ceph monitor servers to service.
>
>But, we're still troubleshooting what's likely to be
>a server-level network issue, which is preventing our
>monitor servers from maintaining a "quorum" (mutual agreement
>about the state of the system).
>
>We're continuing to move forward on this and have
>multiple approaches to try.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Mon, 24 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk in production later today. Our team
>has been working on it since yesterday morning.
>
>We think we've isolated the issue to the network driver layer,
>and we're collaborating closely with our external Ceph expert
>consultant to resolve the issue.
>
>We also hope to have a more permanent fix in place soon,
>most likely during the next maintenance outage, which is
>likely to be mid-May, in order to avoid dissertation and
>thesis deadlines.
>
>But, we can't know for sure until we've returned OURdisk to
>service.
>
>Again, we apologize for the trouble.
>
>Henry
>
>----------
>
>On Sun, 23 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>OURdisk is down. Our team is working to return it to service,
>but it's most likely that'll take until tomorrow (Mon Apr 24).
>
>We apologize for the trouble.
>
>---
>
>Henry Neeman ([log in to unmask])
>Director, OU Supercomputing Center for Education & Research (OSCER)
>Associate Professor, Gallogly College of Engineering
>Adjunct Associate Professor, School of Computer Science
>OU Information Technology
>The University of Oklahoma
>
>Engineering Lab 212, 200 Felgar St, Norman OK 73019
>405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
>[log in to unmask] (to e-mail me a text message)
>http://www.oscer.ou.edu/
ATOM RSS1 RSS2
LISTS.OU.EDU