OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Henry Neeman <[log in to unmask]>
Reply To:
Henry Neeman <[log in to unmask]>
Date:
Thu, 27 Apr 2023 17:24:14 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (146 lines)
OSCER users,

UPDATE:

We HAVEN'T YET returned OURdisk to service, but:

(a) we've successfully returned all of OURdisk's individual
diskfull nodes to service;

(b) in our Ceph monitor nodes, we've cleared out the backlog
of previous states ("CRUSH maps"), reducing that database from
460 GB to less than 2 GB;

(c) we're currently in the process of returning ~1/3 of the
"placement groups" to best status ("active+clean").

We don't yet know how long (c) will take, and it's probably
UNSAFE to allow users onto OURdisk until that's completed.

We also want to make sure that our Ceph monitor nodes are
stable.

So we're hoping to be able to let y'all back on to OURdisk
tomorrow (Fri March 28).

We apologize for this very unhappy situation.

Henry

----------

On Wed, 26 Apr 2023, Henry Neeman wrote:

>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk back in production tomorrow
>(Thu Apr 27).
>
>Many thanks to our heroic OSCER Operations Team for
>moving this forward!
>
>Specifically:
>
>* The monitor system has been returned to full service,
>after an incredible effort by our team.
>
>* We're currently adding diskfull nodes back into production,
>one disk drive at a time, based on guidance from our
>external Ceph expert consultant -- who's (a) in Europe and
>(b) on vacation, but still has been helping us.
>
>The time lag per disk drive is roughly 1 minute, and we have
>624 such drives, so the minimum time to completion of this
>task is more than 10 hours.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Tue, 25 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>The OSCER Operations Team has made good progress on
>the OURdisk problem, but aren't done yet, and there's
>a high probability that the issue won't get resolved today.
>
>Working closely all day with our external Ceph expert
>consultant, who has now gone to sleep (because it's
>a lot later in Europe than here), we've restored our
>Ceph monitor servers to service.
>
>But, we're still troubleshooting what's likely to be
>a server-level network issue, which is preventing our
>monitor servers from maintaining a "quorum" (mutual agreement
>about the state of the system).
>
>We're continuing to move forward on this and have
>multiple approaches to try.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Mon, 24 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk in production later today. Our team
>has been working on it since yesterday morning.
>
>We think we've isolated the issue to the network driver layer,
>and we're collaborating closely with our external Ceph expert
>consultant to resolve the issue.
>
>We also hope to have a more permanent fix in place soon,
>most likely during the next maintenance outage, which is
>likely to be mid-May, in order to avoid dissertation and
>thesis deadlines.
>
>But, we can't know for sure until we've returned OURdisk to
>service.
>
>Again, we apologize for the trouble.
>
>Henry
>
>----------
>
>On Sun, 23 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>OURdisk is down. Our team is working to return it to service,
>but it's most likely that'll take until tomorrow (Mon Apr 24).
>
>We apologize for the trouble.
>
>---
>
>Henry Neeman ([log in to unmask])
>Director, OU Supercomputing Center for Education & Research (OSCER)
>Associate Professor, Gallogly College of Engineering
>Adjunct Associate Professor, School of Computer Science
>OU Information Technology
>The University of Oklahoma
>
>Engineering Lab 212, 200 Felgar St, Norman OK 73019
>405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
>[log in to unmask] (to e-mail me a text message)
>http://www.oscer.ou.edu/

ATOM RSS1 RSS2