OSCER users, We suspect that today's OURdisk metadata server failure, and lack of automatic failover to a separate metadata server, is a side effect of a bug in the specific version of the Ceph software that we're currently running on OURdisk. So, we plan to upgrade to a more recent version of Ceph (and to a more recent version of Linux), which we hope will address the situation we encountered today. Henry ________________________________ From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]> Sent: Tuesday, February 11, 2025 12:18 PM To: [log in to unmask] <[log in to unmask]> Subject: Re: Problems on OSCER resources OSCER users, A quick follow-up: Worldwide, the overwhelming majority of cluster supercomputer unscheduled outages are because of storage issues. I recently did a micro-study of cluster supercomputer outages at NSF-funded national supercomputing ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message did not originate from the OU mail system. Be careful opening links and verify sources prior to sharing information. ZjQcmQRYFpfptBannerEnd OSCER users, A quick follow-up: Worldwide, the overwhelming majority of cluster supercomputer unscheduled outages are because of storage issues. I recently did a micro-study of cluster supercomputer outages at NSF-funded national supercomputing centers, meaning the biggest academic supercomputers in the US. I found that 79% of their cluster supercomputer unscheduled outages were caused by storage issues (for those where a cause was identified). That's on 13 different supercomputers over a 12 year period. So what we experienced this morning was unfortunate, but normal. Henry ________________________________ From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]> Sent: Tuesday, February 11, 2025 11:46 AM To: [log in to unmask] <[log in to unmask]> Subject: Re: Problems on OSCER resources OSCER users, It appears that the issue is resolved, thanks to quick work by our OSCER team. Please log in and try things, and contact us if you see any problems: [log in to unmask] The issue was an overloaded OURdisk metadata server, which should have automatically failed over to its secondary, but ended up needing to be failed over manually. We're going to work with our software consultants for Ceph (the open source software technology that OURdisk is built on top of), to figure out a way to reduce the probability of needing manual failover in cases like this. Again, we apologize for the trouble. Henry ________________________________ From: Neeman, Henry J. <[log in to unmask]> Sent: Tuesday, February 11, 2025 11:29 AM To: OSCER users <[log in to unmask]> Subject: Re: Problems on OSCER resources OSCER users, UPDATE: We believe we've identified the source of the trouble, and we're working to resolve it. Henry ________________________________ From: Neeman, Henry J. Sent: Tuesday, February 11, 2025 11:06 AM To: OSCER users <[log in to unmask]> Subject: Problems on OSCER resources OSCER users, We're currently experiencing problems on multiple OSCER systems. We're working on diagnosing and resolving the issue and will send updates as we make progress. We apologize for the trouble. Henry Neeman ([log in to unmask])