LISTSERV - OSCER-USERS-L Archives

OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

	LISTSERV Archives
	OSCER-USERS-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: OURdisk down, working to return it to service
From:	Henry Neeman <[log in to unmask]>
Reply To:	Henry Neeman <[log in to unmask]>
Date:	Thu, 27 Apr 2023 21:16:24 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (248 lines)
OSCER users,

OURdisk will be returned to service tomorrow morning
(Fri Apr 28), so that we can keep an eye on it all day.

Many thanks to Keith Brewster for serving as our guinea pig
friendly user in a successful test of the CAPS real time
weather forecasting system, to make sure that things are place.

(CAPS = Center for Analysis & Prediction of Storms)

And many thanks to the several heroes on the OSCER Operations
Team, who went far above and beyond the call of duty to fix
OURdisk, working long hours deep into the night to get this
resolved.

---

WHAT WENT WRONG?

We believe the proximate cause was the following:

Multiple of the OURdisk monitor nodes experienced a combination
of a bug in its network card's driver software, plus a bug in
the specific version of the Ceph storage software that OURdisk
is based on.

WHY DID IT TAKE SO LONG TO FIX?

Working with our external Ceph expert consultant in Europe,
it took quite a while to diagnose the problem sufficiently to
be able to bring the bad monitor nodes up again.

Then, we faced significant difficulties getting the
three monitor nodes that are the minimum needed to be able to
maintain a "quorum" (agreement on which of the monitor nodes
is the leader).

We also had to do some physical reconfiguring, to increase
the amount of disk space available in the monitor nodes,
so that they wouldn't crash again.

Then, we had to add each disk drive in OURdisk's diskfull
nodes into the drive pool by hand, at slightly over a minute
per disk drive (for 624 drives).

Then we had to clean up the mess left behind, as described
in our previous notes.

HOW WILL WE MINIMIZE THE LIKELIHOOD OF ANOTHER SUCH FAILURE?

In January, we purchased several new servers for Ceph
monitor, metadata and manager subsystems.

We're working now to get them into production as quickly as
possible, to replace the older legacy servers that we're
currently using.

Thus, all Ceph support capabilities will be on separate,
dedicated hardware, with much bigger, faster CPUs, RAM and
disk.

We're also going to shift to a different network approach,
and update Ceph to the most recent stable sub-version of
major version 16 ("Pacific"), but those improvements
can't happen as quickly.

WHOSE FAULT WAS THIS?

Mine. I made the executive decision to shift priority to
our growing backlog of tasks unrelated to OURdisk, instead of
getting OURdisk's new support servers into production.

The buck stops here.

Henry

----------

On Thu, 27 Apr 2023, Henry Neeman wrote:

>OSCER users,
>
>UPDATE:
>
>We HAVEN'T YET returned OURdisk to service, but:
>
>We've successfully brought it back up.
>
>Next, we're going to have one of our longtime users
>test it for us.
>
>If that test goes well, we plan to make it available for
>general use tomorrow morning (Fri Apr 28), so that we can
>keep an eye on it for the day.
>
>Henry
>
>----------
>
>On Thu, 27 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We HAVEN'T YET returned OURdisk to service, but:
>
>(a) we've successfully returned all of OURdisk's individual
>diskfull nodes to service;
>
>(b) in our Ceph monitor nodes, we've cleared out the backlog
>of previous states ("CRUSH maps"), reducing that database from
>460 GB to less than 2 GB;
>
>(c) we're currently in the process of returning ~1/3 of the
>"placement groups" to best status ("active+clean").
>
>We don't yet know how long (c) will take, and it's probably
>UNSAFE to allow users onto OURdisk until that's completed.
>
>We also want to make sure that our Ceph monitor nodes are
>stable.
>
>So we're hoping to be able to let y'all back on to OURdisk
>tomorrow (Fri March 28).
>
>We apologize for this very unhappy situation.
>
>Henry
>
>----------
>
>On Wed, 26 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk back in production tomorrow
>(Thu Apr 27).
>
>Many thanks to our heroic OSCER Operations Team for
>moving this forward!
>
>Specifically:
>
>* The monitor system has been returned to full service,
>after an incredible effort by our team.
>
>* We're currently adding diskfull nodes back into production,
>one disk drive at a time, based on guidance from our
>external Ceph expert consultant -- who's (a) in Europe and
>(b) on vacation, but still has been helping us.
>
>The time lag per disk drive is roughly 1 minute, and we have
>624 such drives, so the minimum time to completion of this
>task is more than 10 hours.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Tue, 25 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>The OSCER Operations Team has made good progress on
>the OURdisk problem, but aren't done yet, and there's
>a high probability that the issue won't get resolved today.
>
>Working closely all day with our external Ceph expert
>consultant, who has now gone to sleep (because it's
>a lot later in Europe than here), we've restored our
>Ceph monitor servers to service.
>
>But, we're still troubleshooting what's likely to be
>a server-level network issue, which is preventing our
>monitor servers from maintaining a "quorum" (mutual agreement
>about the state of the system).
>
>We're continuing to move forward on this and have
>multiple approaches to try.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Mon, 24 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk in production later today. Our team
>has been working on it since yesterday morning.
>
>We think we've isolated the issue to the network driver layer,
>and we're collaborating closely with our external Ceph expert
>consultant to resolve the issue.
>
>We also hope to have a more permanent fix in place soon,
>most likely during the next maintenance outage, which is
>likely to be mid-May, in order to avoid dissertation and
>thesis deadlines.
>
>But, we can't know for sure until we've returned OURdisk to
>service.
>
>Again, we apologize for the trouble.
>
>Henry
>
>----------
>
>On Sun, 23 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>OURdisk is down. Our team is working to return it to service,
>but it's most likely that'll take until tomorrow (Mon Apr 24).
>
>We apologize for the trouble.
>
>---
>
>Henry Neeman ([log in to unmask])
>Director, OU Supercomputing Center for Education & Research (OSCER)
>Associate Professor, Gallogly College of Engineering
>Adjunct Associate Professor, School of Computer Science
>OU Information Technology
>The University of Oklahoma
>
>Engineering Lab 212, 200 Felgar St, Norman OK 73019
>405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
>[log in to unmask] (to e-mail me a text message)
>http://www.oscer.ou.edu/
ATOM RSS1 RSS2
LISTS.OU.EDU