OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Henry Neeman <[log in to unmask]>
Reply To:
Henry Neeman <[log in to unmask]>
Date:
Thu, 27 Apr 2023 19:54:56 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (167 lines)
OSCER users,

UPDATE:

We HAVEN'T YET returned OURdisk to service, but:

We've successfully brought it back up.

Next, we're going to have one of our longtime users
test it for us.

If that test goes well, we plan to make it available for
general use tomorrow morning (Fri Apr 28), so that we can
keep an eye on it for the day.

Henry

----------

On Thu, 27 Apr 2023, Henry Neeman wrote:

>OSCER users,
>
>UPDATE:
>
>We HAVEN'T YET returned OURdisk to service, but:
>
>(a) we've successfully returned all of OURdisk's individual
>diskfull nodes to service;
>
>(b) in our Ceph monitor nodes, we've cleared out the backlog
>of previous states ("CRUSH maps"), reducing that database from
>460 GB to less than 2 GB;
>
>(c) we're currently in the process of returning ~1/3 of the
>"placement groups" to best status ("active+clean").
>
>We don't yet know how long (c) will take, and it's probably
>UNSAFE to allow users onto OURdisk until that's completed.
>
>We also want to make sure that our Ceph monitor nodes are
>stable.
>
>So we're hoping to be able to let y'all back on to OURdisk
>tomorrow (Fri March 28).
>
>We apologize for this very unhappy situation.
>
>Henry
>
>----------
>
>On Wed, 26 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk back in production tomorrow
>(Thu Apr 27).
>
>Many thanks to our heroic OSCER Operations Team for
>moving this forward!
>
>Specifically:
>
>* The monitor system has been returned to full service,
>after an incredible effort by our team.
>
>* We're currently adding diskfull nodes back into production,
>one disk drive at a time, based on guidance from our
>external Ceph expert consultant -- who's (a) in Europe and
>(b) on vacation, but still has been helping us.
>
>The time lag per disk drive is roughly 1 minute, and we have
>624 such drives, so the minimum time to completion of this
>task is more than 10 hours.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Tue, 25 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>The OSCER Operations Team has made good progress on
>the OURdisk problem, but aren't done yet, and there's
>a high probability that the issue won't get resolved today.
>
>Working closely all day with our external Ceph expert
>consultant, who has now gone to sleep (because it's
>a lot later in Europe than here), we've restored our
>Ceph monitor servers to service.
>
>But, we're still troubleshooting what's likely to be
>a server-level network issue, which is preventing our
>monitor servers from maintaining a "quorum" (mutual agreement
>about the state of the system).
>
>We're continuing to move forward on this and have
>multiple approaches to try.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Mon, 24 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk in production later today. Our team
>has been working on it since yesterday morning.
>
>We think we've isolated the issue to the network driver layer,
>and we're collaborating closely with our external Ceph expert
>consultant to resolve the issue.
>
>We also hope to have a more permanent fix in place soon,
>most likely during the next maintenance outage, which is
>likely to be mid-May, in order to avoid dissertation and
>thesis deadlines.
>
>But, we can't know for sure until we've returned OURdisk to
>service.
>
>Again, we apologize for the trouble.
>
>Henry
>
>----------
>
>On Sun, 23 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>OURdisk is down. Our team is working to return it to service,
>but it's most likely that'll take until tomorrow (Mon Apr 24).
>
>We apologize for the trouble.
>
>---
>
>Henry Neeman ([log in to unmask])
>Director, OU Supercomputing Center for Education & Research (OSCER)
>Associate Professor, Gallogly College of Engineering
>Adjunct Associate Professor, School of Computer Science
>OU Information Technology
>The University of Oklahoma
>
>Engineering Lab 212, 200 Felgar St, Norman OK 73019
>405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
>[log in to unmask] (to e-mail me a text message)
>http://www.oscer.ou.edu/

ATOM RSS1 RSS2