We encourage *ALL* research computing professionals to attend!
* Software-facing like research software engineers * Data-facing like research data librarians * System-facing like supercomputer system administrators * Strategy/policy-facing like supercomputing center directors * Researcher-facing like research computing facilitators
OSCER Scheduled maintenance partial outage Mon July 14 4:00pm - Tue July 15 5:00pm Central Time
Affected systems: Supercomputer, home1 users only (~50%)
In order to minimize risk and disruption for users on home1, we've decided to do the software rollback this coming Tuesday (July 15), with user access to home1 shutting off Mon July 14 4:00pm Central Time, in order to do the final copy of user files (to avoid data loss).
Access to home directories on home1 have been shut off and we are making the final synchronization of home1, in preparation for the software rollback tomorrow.
If your home directory is on home1 and you log in, your working directory will be set to root (just a slash) instead of your home directory.
We hope to have your home directory access restored at a reasonable hour tomorrow (Tue July 15).
An effect of today’s outage is that you may receive an error message from your job:
AssocGrpCPURunMinutesLimit
This is an effect of the outage today and does not indicate either an error with your job or your account. Please ignore this message.
Chris
From: OSCER users <[log in to unmask]> on behalf of Neeman, Henry J. <[log in to unmask]> Date: Tuesday, July 15, 2025 at 08:31 To: [log in to unmask] <[log in to unmask]> Subject: Re: OSCER Scheduled maintenance partial outage Mon July 14 4pm - Tue July 15 5pm CT
We have concluded home1 maintenance, and you should be able to submit jobs again.
We will continue to monitor the health of the home1 server over the coming days to verify that the software version rollback has been effective at improving server stability.
The OSCER Team
From: OSCER users <[log in to unmask]> on behalf of Little, Christopher <[log in to unmask]> Date: Tuesday, July 15, 2025 at 9:51 AM To: [log in to unmask] <[log in to unmask]> Subject: Re: OSCER Scheduled maintenance partial outage Mon July 14 4pm - Tue July 15 5pm CT OSCER users,
OSCER Users, The /home fileserver issue this morning has been fixed. If you have any issues still, please contact us at [log in to unmask]<mailto:[log in to unmask]>. We strive to improve all OSCER systems. We apologize for any inconvenience.
As some of you have noticed, we're continuing to experience disruption on one of Schooner's /home subsystems (home1).
We believe that the cause is a bug in the combination of software versions that we're using to implement home1, specifically the version of the Linux operating system and "ZFS."
(ZFS is the storage software that both home1 and some /scratch subsystems are built on.)
We've completed the first file synchronization of home1 to other storage as a secondary copy in case of trouble. It took less than a full 24 hours (more like 15 hours).
NOTE: For home1 users who are running long batch jobs, either in the OSCER-owned longjobs or longlargemem queue, or in a condominium queue, those jobs will end up getting killed.
Because home1 has been failing daily, we expect that it will happen again tomorrow (Thu July 10).
If that happens, we'll take that opportunity to do the final file synchronization (which we now hope will be much quicker than the first one), then roll back the software, and then reboot home1.
If things go well -- NOT GUARANTEED -- no running jobs will crash as a result.
it appears that the file system issue with one of the home directory servers has returned. We are working on it, and will update you as soon as we have more information.
the systems are back up, so you should all be able log in again.
Please send email to [log in to unmask] if you are still experiencing any issues.
Thanks,
Horst
Horst Severini <[log in to unmask]> wrote:
> Hi all, > > it appears that the file system issue with one of the home directory servers > has returned. We are working on it, and will update you as soon as we > have more information. > > Our apologies for the inconvenience, > > Horst
We're working to diagnose the issue and will then resolve it.
We apologize for the trouble!
---
Henry Neeman ([log in to unmask]) Director, OU Supercomputing Center for Education & Research (OSCER) Associate Professor, Gallogly College of Engineering Adjunct Associate Professor, School of Computer Science OU Information Technology The University of Oklahoma
Engineering Lab 212, 200 Felgar St, Norman OK 73019 405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell), [log in to unmask] (to email me a text message) http://www.oscer.ou.edu/