OSCER-USERS-L Archives

OSCER users

OSCER-USERS-L@LISTS.OU.EDU

Options: Use Forum View

Use Monospaced Font
Show HTML Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Neeman, Henry J." <[log in to unmask]>
Reply To:
Neeman, Henry J.
Date:
Tue, 11 Feb 2025 20:16:47 +0000
Content-Type:
multipart/alternative
Parts/Attachments:
text/plain (3268 bytes) , text/html (18 kB)


OSCER users,


We suspect that today's OURdisk metadata server failure,
and lack of automatic failover to a separate metadata server,
is a side effect of a bug in the specific version of the Ceph
software that we're currently running on OURdisk.

So, we plan to upgrade to a more recent version of Ceph
(and to a more recent version of Linux), which we hope will
address the situation we encountered today.

Henry


________________________________
From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 12:18 PM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources

OSCER users, A quick follow-up: Worldwide, the overwhelming majority of cluster supercomputer unscheduled outages are because of storage issues. I recently did a micro-study of cluster supercomputer outages at NSF-funded national supercomputing
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message did not originate from the OU mail system. Be careful opening links and verify sources prior to sharing information.

ZjQcmQRYFpfptBannerEnd


OSCER users,


A quick follow-up:

Worldwide, the overwhelming majority of cluster supercomputer
unscheduled outages are because of storage issues.

I recently did a micro-study of cluster supercomputer outages
at NSF-funded national supercomputing centers, meaning
the biggest academic supercomputers in the US.

I found that 79% of their cluster supercomputer unscheduled
outages were caused by storage issues (for those where a cause
was identified).

That's on 13 different supercomputers over a 12 year period.

So what we experienced this morning was unfortunate, but normal.

Henry


________________________________
From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:46 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources



OSCER users,


It appears that the issue is resolved, thanks to quick work
by our OSCER team.

Please log in and try things, and contact us if you see
any problems:

[log in to unmask]

The issue was an overloaded OURdisk metadata server,
which should have automatically failed over to its secondary,
but ended up needing to be failed over manually.

We're going to work with our software consultants for Ceph
(the open source software technology that OURdisk is built
on top of), to figure out a way to reduce the probability of
needing manual failover in cases like this.

Again, we apologize for the trouble.

Henry


________________________________
From: Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:29 AM
To: OSCER users <[log in to unmask]>
Subject: Re: Problems on OSCER resources



OSCER users,


UPDATE:

We believe we've identified the source of the trouble, and
we're working to resolve it.

Henry


________________________________
From: Neeman, Henry J.
Sent: Tuesday, February 11, 2025 11:06 AM
To: OSCER users <[log in to unmask]>
Subject: Problems on OSCER resources



OSCER users,


We're currently experiencing problems on multiple OSCER systems.

We're working on diagnosing and resolving the issue
and will send updates as we make progress.

We apologize for the trouble.

Henry Neeman ([log in to unmask])



ATOM RSS1 RSS2