OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Henry Neeman <[log in to unmask]>
Reply To:
Henry Neeman <[log in to unmask]>
Date:
Wed, 26 Apr 2023 17:02:27 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (113 lines)
OSCER users,

UPDATE:

We hope to have OURdisk back in production tomorrow
(Thu Apr 27).

Many thanks to our heroic OSCER Operations Team for
moving this forward!

Specifically:

* The monitor system has been returned to full service,
after an incredible effort by our team.

* We're currently adding diskfull nodes back into production,
one disk drive at a time, based on guidance from our
external Ceph expert consultant -- who's (a) in Europe and
(b) on vacation, but still has been helping us.

The time lag per disk drive is roughly 1 minute, and we have
624 such drives, so the minimum time to completion of this
task is more than 10 hours.

As always, we're very aware of how disruptive this issue is
and are doing our best to get it resolved as quickly as
we can. We apologize for the trouble.

Henry

----------

On Tue, 25 Apr 2023, Henry Neeman wrote:

>OSCER users,
>
>UPDATE:
>
>The OSCER Operations Team has made good progress on
>the OURdisk problem, but aren't done yet, and there's
>a high probability that the issue won't get resolved today.
>
>Working closely all day with our external Ceph expert
>consultant, who has now gone to sleep (because it's
>a lot later in Europe than here), we've restored our
>Ceph monitor servers to service.
>
>But, we're still troubleshooting what's likely to be
>a server-level network issue, which is preventing our
>monitor servers from maintaining a "quorum" (mutual agreement
>about the state of the system).
>
>We're continuing to move forward on this and have
>multiple approaches to try.
>
>As always, we're very aware of how disruptive this issue is
>and are doing our best to get it resolved as quickly as
>we can. We apologize for the trouble.
>
>Henry
>
>----------
>
>On Mon, 24 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>UPDATE:
>
>We hope to have OURdisk in production later today. Our team
>has been working on it since yesterday morning.
>
>We think we've isolated the issue to the network driver layer,
>and we're collaborating closely with our external Ceph expert
>consultant to resolve the issue.
>
>We also hope to have a more permanent fix in place soon,
>most likely during the next maintenance outage, which is
>likely to be mid-May, in order to avoid dissertation and
>thesis deadlines.
>
>But, we can't know for sure until we've returned OURdisk to
>service.
>
>Again, we apologize for the trouble.
>
>Henry
>
>----------
>
>On Sun, 23 Apr 2023, Henry Neeman wrote:
>
>OSCER users,
>
>OURdisk is down. Our team is working to return it to service,
>but it's most likely that'll take until tomorrow (Mon Apr 24).
>
>We apologize for the trouble.
>
>---
>
>Henry Neeman ([log in to unmask])
>Director, OU Supercomputing Center for Education & Research (OSCER)
>Associate Professor, Gallogly College of Engineering
>Adjunct Associate Professor, School of Computer Science
>OU Information Technology
>The University of Oklahoma
>
>Engineering Lab 212, 200 Felgar St, Norman OK 73019
>405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
>[log in to unmask] (to e-mail me a text message)
>http://www.oscer.ou.edu/

ATOM RSS1 RSS2