OSCER users,


We suspect that today's OURdisk metadata server failure, 
and lack of automatic failover to a separate metadata server, 
is a side effect of a bug in the specific version of the Ceph 
software that we're currently running on OURdisk.

So, we plan to upgrade to a more recent version of Ceph 
(and to a more recent version of Linux), which we hope will 
address the situation we encountered today.

Henry



From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 12:18 PM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources
 
OSCER users, A quick follow-up: Worldwide, the overwhelming majority of cluster supercomputer unscheduled outages are because of storage issues. I recently did a micro-study of cluster supercomputer outages at NSF-funded national supercomputing
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message did not originate from the OU mail system. Be careful opening links and verify sources prior to sharing information.
 
ZjQcmQRYFpfptBannerEnd


OSCER users,


A quick follow-up:

Worldwide, the overwhelming majority of cluster supercomputer 
unscheduled outages are because of storage issues.

I recently did a micro-study of cluster supercomputer outages 
at NSF-funded national supercomputing centers, meaning 
the biggest academic supercomputers in the US.

I found that 79% of their cluster supercomputer unscheduled 
outages were caused by storage issues (for those where a cause 
was identified).

That's on 13 different supercomputers over a 12 year period.

So what we experienced this morning was unfortunate, but normal.

Henry



From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:46 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources
 


OSCER users,


It appears that the issue is resolved, thanks to quick work 
by our OSCER team.

Please log in and try things, and contact us if you see 
any problems:


The issue was an overloaded OURdisk metadata server, 
which should have automatically failed over to its secondary, 
but ended up needing to be failed over manually.

We're going to work with our software consultants for Ceph 
(the open source software technology that OURdisk is built 
on top of), to figure out a way to reduce the probability of 
needing manual failover in cases like this.

Again, we apologize for the trouble.

Henry



From: Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:29 AM
To: OSCER users <[log in to unmask]>
Subject: Re: Problems on OSCER resources
 


OSCER users,


UPDATE:

We believe we've identified the source of the trouble, and 
we're working to resolve it.

Henry



From: Neeman, Henry J.
Sent: Tuesday, February 11, 2025 11:06 AM
To: OSCER users <[log in to unmask]>
Subject: Problems on OSCER resources
 


OSCER users,


We're currently experiencing problems on multiple OSCER systems.

We're working on diagnosing and resolving the issue 
and will send updates as we make progress.

We apologize for the trouble.

Henry Neeman ([log in to unmask])