OSCER users,
We suspect that today's OURdisk metadata server failure,
and lack of automatic failover to a separate metadata server,
is a side effect of a bug in the specific version of the Ceph
software that we're currently running on OURdisk.
So, we plan to upgrade to a more recent version of Ceph
(and to a more recent version of Linux), which we hope will
address the situation we encountered today.
Henry
________________________________
From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 12:18 PM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources
OSCER users, A quick follow-up: Worldwide, the overwhelming majority of cluster supercomputer unscheduled outages are because of storage issues. I recently did a micro-study of cluster supercomputer outages at NSF-funded national supercomputing
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message did not originate from the OU mail system. Be careful opening links and verify sources prior to sharing information.
ZjQcmQRYFpfptBannerEnd
OSCER users,
A quick follow-up:
Worldwide, the overwhelming majority of cluster supercomputer
unscheduled outages are because of storage issues.
I recently did a micro-study of cluster supercomputer outages
at NSF-funded national supercomputing centers, meaning
the biggest academic supercomputers in the US.
I found that 79% of their cluster supercomputer unscheduled
outages were caused by storage issues (for those where a cause
was identified).
That's on 13 different supercomputers over a 12 year period.
So what we experienced this morning was unfortunate, but normal.
Henry
________________________________
From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:46 AM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: Problems on OSCER resources
OSCER users,
It appears that the issue is resolved, thanks to quick work
by our OSCER team.
Please log in and try things, and contact us if you see
any problems:
[log in to unmask]
The issue was an overloaded OURdisk metadata server,
which should have automatically failed over to its secondary,
but ended up needing to be failed over manually.
We're going to work with our software consultants for Ceph
(the open source software technology that OURdisk is built
on top of), to figure out a way to reduce the probability of
needing manual failover in cases like this.
Again, we apologize for the trouble.
Henry
________________________________
From: Neeman, Henry J. <[log in to unmask]>
Sent: Tuesday, February 11, 2025 11:29 AM
To: OSCER users <[log in to unmask]>
Subject: Re: Problems on OSCER resources
OSCER users,
UPDATE:
We believe we've identified the source of the trouble, and
we're working to resolve it.
Henry
________________________________
From: Neeman, Henry J.
Sent: Tuesday, February 11, 2025 11:06 AM
To: OSCER users <[log in to unmask]>
Subject: Problems on OSCER resources
OSCER users,
We're currently experiencing problems on multiple OSCER systems.
We're working on diagnosing and resolving the issue
and will send updates as we make progress.
We apologize for the trouble.
Henry Neeman ([log in to unmask])
|