User alerts for HPCC's computing resources

25 minute read

Scheduled exceptions and downtimes

[7 Jul 2026] HPCC Cluster Maintenance Shutdown

UCR HPCC will be performing routine maintenance on the cluster between the following times:

Start: Thursday, Jul 9 2026, 8am
End: Friday, Jul 10 2026, 8pm

A Slurm reservation block has been set to prevent jobs from running during maintenance. If your job is stuck in the queue with the reason “ReqNodeNotAvail”, please be aware that jobs will not automatically start after the shutdown — any jobs remaining in the queue at the start of maintenance will be cancelled and will need to be manually requeued once the cluster is back online. We apologize for the disruption that this shutdown will cause.

Unscheduled exceptions and downtimes

[2 Jun 2026] MPI Job Failures Due to Security Update

In response to a recent vulnerability, the system’s ptrace scope has been restricted. This change can cause certain MPI jobs to fail, particularly those that rely on cross-memory attach (CMA) for inter-process communication.

If your MPI job is failing unexpectedly, check your job output for errors such as:

Read -1, expected 2642383392, errno = 1

If you see this, you can work around this by disabling the vader CMA mechanism with the following flag:

--mca btl_vader_single_copy_mechanism none

For example:

mpirun --mca btl_vader_single_copy_mechanism none -np 4 ./my_mpi_program
# or
mpiexec --mca btl_vader_single_copy_mechanism none -np 4 ./my_mpi_program

Or in a Slurm batch script:

#!/bin/bash
#SBATCH -p epyc
#SBATCH -n 4

mpirun --mca btl_vader_single_copy_mechanism none -np 4 ./my_mpi_program
# or
mpiexec --mca btl_vader_single_copy_mechanism none -np 4 ./my_mpi_program

If you continue to experience issues, please contact us at support@hpcc.ucr.edu or via Slack.

Past exceptions

[8 May 2026] Unavoidable HPCC Cluster Maintenance Shutdown

When: Friday, May 8 2026, 8am-8pm

UCR HPCC has to perform an unscheduled maintenance shutdown on Friday, May 8, to apply a critical security update. This issue affects almost all Linux-based systems worldwide, not just UCR’s HPCC.

We expect this maintenance to be shorter than a typical full maintenance shutdown.

Because this is an unscheduled shutdown, running jobs and active SSH, tmux, or screen sessions will need to be terminated when maintenance begins. Users are encouraged to save work and stop jobs or sessions before 8:00 AM to reduce the risk of lost work.

We apologize for the short notice and appreciate your understanding.

[18-Feb-2026] HPCC Cluster Maintenance Shutdown

UCR HPCC will be performing routine maintenance on the cluster between the following times:

Start: Wednesday, Feb 18 2026, 8am
End: Thursday, Feb 19 2026, 8pm

A Slurm reservation block has been set to prevent jobs from running during maintenance. If your job is stuck in the queue with the reason “ReqNodeNotAvail”, you may either reduce the duration of your job (using the -t flag) to finish before the shutdown, or leave the job in the queue where it will start once nodes are brought online. We apologize for the disruption that this shutdown will cause.

19-Oct-2025 UCR Campus Network Maintenance

UCR Information Technology Services (ITS) will be performing campus network maintenance on Sunday, Oct 19 2025, from 4am to 12pm. This will likely disrupt network activity for the HPCC facility. Please save any open sessions prior to the maintenance window, and avoid connecting to HPCC services until ITS confirms all operations are complete.

Official announcement - UCR ITS

TechAlerts - UCR ITS

5-Sept-2025 HPCC Follow-up Shutdown

Update Sept. 6 3:45PM: The Slurm reservation has been removed and queued jobs have started running!

Update Sept. 6 3:00PM: Head nodes and OnDemand are back online. The Slurm reservation is still in place but is expected to be removed shortly to allow for jobs to start running.

Start: Friday, September 5, 9pm PDT
Ends: Saturday, September 6, evening PDT

Please be prepared that the final install of the 500kW by the contractor will require this second electrical power outage in our server room.

HPCC is not planning any significant cluster maintenance tasks for this shutdown; we will only be shutting down the cluster to allow for electrical work to proceed. As such, downtime should be much shorter than previous scheduled shutdowns.

01-September-2025 (Updated Sept. 2)

Update: Most HPCC services should have resumed normal operations, including email support. Any emails sent between approx. 5pm Sept 1st and 8AM Sept 2nd should be resent as they were likely lost.

On the evening of September 1st, the machine that handles storage for various virtual machines was found in a locked up state. This caused many of the core HPCC services (including DNS, Slurm, and importantly email) to become unresponsive. HPCC Administrators are on site diagnosing the issue and are attempting to bring servers back online as soon as possible.

In the meantime, since email support might be flakey, our main form of communication will be via our Slack Workspace: https://ucr-hpcc.slack.com

15-Aug-2025

11:00 AM: The Slurm reservation block has been lifted. Computing jobs are running again on the cluster, and we are bringing more and more nodes back online. We are optimistic that all services will be back online today.
3:00 PM: Almost all services are back online, with the exception of few non-essential nodes. Again, sorry for the additional time it took to get back to normal.

14-Aug-2025

12:00 PM: The electrical work has been completed by the construction company. To avoid additional shutdowns, we are deploying as many maintenance upgrades as possible during this shutdown (e.g. file system, network, Slurm, etc), and we were trying to complete them within the planned shutdown time.
6:00 PM: At the moment, the headnodes are available for data access and light work. Compute nodes will be functional in the next 24 hours. We apologize for the delay in bringing all services back online.

12-August-2025 (Updated Aug 7)

Start: 9:00AM, Tuesday August 12th
End: evening, Thursday August 14th

The upcoming HPCC shutdown, due to major electrical upgrades in our server room (install of a 500kW UPS), were again delayed by the construction company. Unfortunately, these changes are out of our control. The final schedule for the shutdown has been set to Aug 12th-14th. The HPCC apologizes for these scheduling changes. These electrical upgrades are extremely important for the long-term health of UCR’s research computing infrastructure.

18-June-2025

2:45 PM: Power is restored and ssh into our head nodes is working again. More detailed information to follow…
12:45 PM: Serveral buidlings on campus including SOMED1 (where HPCC server room is located) lost electrical power. Updates will be posted here and/or on Slack.
3:00 PM: All HPCC Services should be available again. Please reach out to support if you find anything misbehaiving, and stay up to date with the remaining ITS outages on the TechAlerts page.

14-May-2025

1:00 AM: There was a power outage affecting several buildings on campus, including our server room in SOMED1 at approximately 1AM. Until the larger 500kW UPS will be installed in late summer, not all of our computing nodes are battery backed up. However, our core infrastructure is UPS backed up, including storage, administration nodes, etc. Our newer “epyc” and modern GPU nodes (gpu[06-09]) are also battery backed up and should not have lost any jobs. Nodes in the “intel”, “batch”, and “highmem” partitions, and nodes gpu[01-05], are not on battery backup at the moment, so any jobs running on them at the time of the outage were lost and will need to be resubmitted. We’re sorry for this disruption, and we’re working on bringing the remaining nodes back online.
9:00 AM: Most or all machines and services should be operational again. Jobs that were impacted should be requeued automatically.

17-Mar-2025

8:00 AM: We are still working on the getting the internal network configured in the new server room. We are optimistic to finalize the configuration later today. Sorry again for these delays.
4:30 PM: Headnodes are available for user login again!!! We’re now working on bringing up some compute nodes. It’s unlikely that all compute resources will come online by tonight.
8:00 PM: A significant portion of compute and GPU nodes have been brought online. The remaining nodes will follow when we’ve properly assessed power/network load balancing and server room cooling capacity. OnDemand should be functional. Other secondary services like JupyterHub and license servers will come online tomorrow.

15-Mar-2025

The cluster shutdown is unfortunately running behind schedule, due to unexpected delays in the moving process. We apologize for the extended shutdown, and the negative impact this will have on your research.

1:30 AM: Core HPCC infrastructure booted successfully.
10:00 AM: Sysadmins are on site to continue work on restoring cluster functionality.
9:00 PM: Sysadmins have concluded their efforts for the weekend. Network difficulties hindered our progress.

10-Mar-2025 to 14-Mar-2025

Move into New Server Room: The HPCC’s long-awaited move to a much larger and newly renovated server room in the SOM-ED1 building is finally happening! This major improvement will unfortunately require about 5 days of downtime to move all HPCC systems to the new server room, and redeploy and test them. We apologize for the disruption this will cause to your research. This is a rare and unique event (once every 20+ years) that requires extensive organization and coordination between multiple parties and is a major logistical undertaking.

Benefits: The relocation of the HPCC will result in considerably more rack space, electrical and cooling capacity, and more dependable system operation, allowing us to accommodate substantially more CPU/GPU computing resources and data storage now and in the future. These enhancements will translate to significant benefits for all users.

Schedule: Starting the morning of March 10th, users' access will be stopped, and we will begin shutting down all running jobs and services (see below). Any computing jobs that will still be running during the shutdown will need to be terminated by us. Users can restart or re-submit them after the move. Most services will be available again in the evening of March 14th.

Services Impacted: All services hosted by the HPCC will be down during this period. This includes, but is not limited to cluster access (SSH, Slurm, data access), including HTTP access to user data; as well as services provided by OnDemand; RStudio/Posit, MATLAB, Jupyter, etc. Please note, email sent to support@hpcc.ucr.edu will also not work (only direct email). During the move, HPCC’s Slack workspace (https://ucr-hpcc.slack.com) will be the main means of group communication with users.

Notes: As we get closer to the shutdown you might see your jobs get queued with the reason “ReqNodeNotAvail, Reserved for maintenance”. This will happen when the runtime of your job will overlap with the shutdown. If your job can tolerate a shorter runtime, please adjust it to not overlap. If the job requires a longer runtime, then it will need to wait until after our servers are back online.

23-Feb-2025: AC Issues

8:00AM: The condensers on the roof of the Genomics building both lost power, which caused both CRAC units in our server room to stop cooling. Facilities is on site and are working to bring the AC units back online.
9:00AM: Facilities was able to bring one of the condensers back online, and are working on bringing one of the AC units back to alleviate some of the heat. To prevent our power infrastructure from overheating, we’ve had to kill running jobs on the cluster. We apologize for this inconvenience.
12:00PM: Emergency repairs on the AC system are complete, and load-shedded servers have been brought back online. Thank you for your patience.

02-Nov-2024: AC Issues

11:00AM: One of our 2 redundent AC units has began blowing hot air into the server room. While facilities diagnoses and resolves the issue we have paused the Slurm queue. We are sorry for the interruption.
12:00PM: Update: to avoid overheating the power in the HPCC had to be shut down. This affects all HPCC services hosted from the Genomics Server room, including SSH login, Slurm jobs, and e-mail ticket system. We’re working on bringing services up safely and quickly.
4:00PM: Currently, the cluster is still down because of a problem with one of the AC units. Physical plant and the systems administrator are working hard to bring everything back online again. As of now, it is hard to tell how long it will take until things are back to normal again. Please be patient and check the alert page here as well as your Slack messages.
5:00AM (Nov 3): Headnodes are available for login. The /rhome and /bigdata storage systems are fully functional, so you may retrieve your files and do some light work. A limited number of compute and GPU nodes may come online depending on sysadmin discretion and cooling constraints. AC repairs will begin on Monday at the earliest, and these repairs must be completed before we can bring the cluster back fully online. Again, we apologize for the disruption.
12:00PM (Nov 3): A small number of nodes from each partition have been made available. We will slowly release more nodes as we monitor the temperature and work on the AC unit continues.
12:00PM (Nov 4): A temporary fix is in place as facilities awaits parts for a complete fix. We are slowly increasing the number of available compute nodes, though with limited job runtimes. Currently any job with a runtime that will end before Nov 10 at Midnight will run. As temperatures in the room hold, we will increase the time limit and number of nodes in operation.
11:00AM (Nov 7): Facilities is in the process of obtaining the equipment necessary for the repairs. We’ve been told by them that the repair will not require shutting down the AC units, so we can resume operations as normal. We have removed the existing reservation, so long-running jobs may begin. We will continue to release the remaining offline nodes throughout the day as well.

24-Jun-2024: AC Work Completed The AC repairs have been completed by facilities. Over the next few hours we will slowly begin to bring nodes back online.

22-Jun-2024: Update on AC Unit Problems

UCR Facilities was able to return the affected AC unit to working order. However, the unit is still in suboptimal condition, and will have to be powered off for full maintenance on Monday (Jun 24). Until then, HPCC will have to limit the capacity of the Slurm cluster. This compromise will allow maintenance to be safely performed without resorting to a full emergency cancellation of all jobs.

21-Jun-2024: Network Outage and HVAC Problem

Starting 5:15 PM the network connection to the HPCC server room is down. This might be related to a larger network problem on campus, see here. Update 7:30 PM: the network connection is working again, but as it turns out there also is a problem with one of the HVAC units in the server room. It is still unclear how the two problems are connected. Currently, facilities is working on the HVAC problem. To avoid overheating, the Slurm queue has been paused by the sys admins. Current jobs will continue to run, while pending jobs will need to wait until the situation improves.

CANCELLED: June 14th Shutdown Ahead of Genomics Electrical Maintenance

Starting on Friday, June 14th at 8am and extending to Saturday, June 15th at 11pm UCR HPCC will be powering down the cluster ahead of a scheduled electrical shutdown of the UCR Genomics Building. To make the most of this downtime, HPCC will offline the cluster on the 14th in order to install routine software updates and perform other minor maintenance tasks. During the shutdown, most of our online services hosted from the UCR Genomics Building will be unavailable including, but is not limited to: SSH, Slurm, Rstudio, JupyterHub, OnDemand, and web file access. Please save your work on any of these services before the maintenance window. E-mail support may be temporarily interrupted, but should otherwise remain online. We recommend checking our Slack channel for any minor status updates during the shutdown.

If you submit a Slurm or OnDemand job that extends into the maintenance window, you will receive an error containing “ReqNodeNotAvail”. Your job will queue, but will not start, until maintenance is over. If you want your job to start sooner, cancel the job request, and resubmit your job with a shorter –time duration, such that your job will finish before the maintenance window begins.

We apologize for the disruption to your research and teaching workflows. Thanks for your understanding.

7-May-2024: AC Unit Repairs, Slurm Paused

After investigating a High Temperature warning on one of our AC units, campus Facilities determined a component in one of the AC units appears to be faulty and needs to be replaced. They are planning on receiving the replacement part the morning of the 8th, but in order to do the replacement both AC units will need to be taken offline. We’ve began putting nodes into a “draining” state in an attempt to bring the load on the cluster as low as possible to best manage heat in the server room. Existing jobs will continue to run, but newly scheduled jobs will be put in the queue until we can confirm that the repair has been completed and the AC units are online again. We know this is disruptive to your research and teaching, and we apologize for this development.

Update 1: Facilities was unable to complete the job in their time allotted for today, and will continue work tomorrow the morning (May 9th).

Update 2: While performing the repair, facilities identified a leak which needs to be patched before refilling the refrigerant, otherwise any refrigerant would immediately escape. The ETA for this repair, as given to us by facilities, is May 10th.

Update 3: Facilities has contacted us to say that they are finishing up the repairs and that we can begin to start loading the servers again. As such we’ve released ~50% of the nodes to begin running jobs again.

Update 4: Facilities has completed the repairs, and all nodes have been released.

08-Mar-2024 AC Unit Refrigerant Leak, Queue Paused

The UCR HPCC Slurm queue has been paused due to another AC malfunction in the server room. Earlier on Thursday (March 7th) afternoon, UCR Facilities was called in to check an AC alarm and determined that one of our AC units had low refrigerant. While attempting to recharge the AC refrigerant a leak occurred, forcing personnel to evacuate the room for their own safety. They intend to start the repair first thing on Friday (March 8th) morning. But because that AC unit is still not running at full capacity, cooling is limited. As such, the Slurm queue has been paused as a precaution to keep the room within a safe temperature. We understand that this will be disruptive to your work, and we apologize for this development.

Update 1: After speaking with the person performing the repair, they said that the repair could take all day. We will make further announcements as we receive updates.

Update 2: The repair has been completed and the Slurm queue has been released.

22-Feb-2023: Upgrades of OS and GPFS

Starting Thursday, February 22nd at 8:00am and lasting until Friday, February 23rd at 8:00pm, the cluster will be unavailable due to an upgrade of the Operating System as well as our central storage system. Any submitted jobs that overlap with the shutdown time will be queued with the reason “ReqNodeNotAvail” and will need to be requeued with a shorter time or wait until the maintenance is over at which point they will automatically start.

During the shutdown services will be unavailable including, but not limited to: SSH, RStudio, JupyterHub, OnDemand, and web-based file access. If you need additional information or help, you can reach us at our Slack or outside of the shutdown through email (support@hpcc.ucr.edu).

Update: The shutdown had finished and the Slurm queue has been reopened.

19-Jan-2024: AC Unit Repair Followup

Due to followup maintenance required following the January 17th repairs, the slurm queue will once again need to be haulted in order to manage heat output in the server room. Maintenance should only take a few hours, after which the queue will be released again.

Update: The repair has been completed. Nodes will be resumed and queue opened.

17-Jan-2024: AC Unit repair

The AC units in the HPCC server room are experiencing issues. To keep the operating temperature within safe limits, the job queue has been halted. New and pending jobs will stay queued, while currently running jobs will be allowed to finish. We apologize for the inconvenience.

Update: The maintenance has completed, but a followup repair will be required in order to return to 100% operation of the AC units.

14 Aug, 2023: Login and GPFS storage issues

The login service on the primary headnode failed on Monday afternoon. A new headnode with updated software is now online, so users may access their files and do some light work. The GPFS cluster file system also exhibited some performance stalls. The Slurm scheduler has been paused until this issue subsides.

4-Aug-2023: Upgrades of OS, GPFS, Slurm, Head Nodes and Power Distribution

HPCC staff will be performing maintenance work on Aug 4th to Aug 5th that will require a shut down of the HPCC cluster. Objectives for this event include the following upgrades: GPFS, Slurm, OS upgrade to Rocky Linux 8.8, electrical reconfiguration of PDUs, and various other maintenance tasks.
Update Aug 6th: Due to underestimated workload on planned maintenance tasks, services are still offline. Basic functionality should be restored by the end of the day.
Update Aug 7th: Maintenance is still ongoing. Currently, we are redeploying computer nodes and other services. User login is restricted until the central data storage system has been re-mounted.
Update Aug 8th: SSH login has been restored. Users can access their data and perform light work on the head nodes until access to a larger number of compute nodes has been restored. In addition, web-based file sharing and JupyterHub are available again. RStudio Server will be restored next.

Update Aug 9th: Slurm has been redployed and is operational. A larger number of computer nodes have been redeployed and are available to users again.
Final update Aug 11th: All remaining services are available again. The maintenance is complete. This includes remaining CPU and GPU nodes, etc. We apologize for the extended time it took us to bring all services online again. Thank you for your understanding.

17-18 Jul, 2023: Slurm halted

Facilities had to perform maintenance on the AC units due to the formation of condensation water by the AC units, possibly caused by hot summer days. During the maintenance Slurm jobs had to be halted to avoid overheating. Running jobs will be allowed to continue, provided that the server room does not get too hot.

10-Apr-2023: Bigdata back

5:30 PM - Bigdata is back. Thank for for your patience.

10-Apr-2023: Bigdata down

4:30 PM - Bigdata is currently down. Please be patient…

22-Dec-2022: Network outage

9:00 AM - Due to a network outage the cluster was inaccessible for several hours.

1-Nov-2022: Network router repair

9:00 PM - ITS had to repair a router in the Genomics Building. Around 4:00 AM in the morning (Nov 2nd) network access to the Genomics Building became available again. During the affected time window the cluster was not accessible (e.g. via ssh). Processes running on the cluster were not affected.

25-Jun-2021: Bigdata storage repaired

5:00 PM - Server running our bigdata storage have been recovered, and all functions of bigdata directory is now back to normal.

25-Jun-2021: Bigdata storage failed

3:30 PM - Server running our bigdata storage crashed, and bigdata directory went down with it.

12-Jan-2020: AC unit repaired

5:00 PM - AC repairs have been completed. The reservation has been removed, and new Slurm jobs are now no longer suspended.

11-Jan-2020: AC unit failed

3:00 PM - One of our AC units is under emergency repairs. A Slurm reservation was put in place to suspend new jobs from running.

5-6 May, 2023: Maintenance and Electrical Power Upgrades in Server Room

UCR Facilities Services will be upgrading our electrical capacity in the Genomics server room. To take advantage of the unavoidable system downtime, the HPCC will perform general maintenance and various upgrades on the cluster. The shutdown is expected to last 2 days. It has been scheduled during a weekend to minimize disruptions for users.
Update: since Sat/Sun night, most systems are back online again. If users notice any missing functionality, please let us know at support@hpcc.ucr.edu, or at our Slack channel (https://ucr-hpcc.slack.com). Thanks you for your patience and understanding.

28-Oct-2020: Cluster jobs failed due to storage suspension

3:00 PM - During a routine extension of the bigdata filesystem, there were some complications and disk i/o had to be suspended.
5:30 PM - We have repaired the issue, and everything should be functioning as usual. However, this means that all computing jobs running during timeframe were stopped and will need to be restarted.

19-Aug-2020: Cluster inaccessible due to power outage in Genomics Bdg

11:30 PM - All systems were restored by Jordan Hayes and are opterational again.
10:30 PM - HPC systems admin Jordan Hayes is trying to restart the network, storage and cluster again.
10:00 PM - Facilities was able to bring up the power and cooling again.
8:30 PM - Facilities is investigating and trying to reactivate power and cooling.

10-Aug-2020: Cluster inaccessible due to power outage in Genomics Bdg

At 5:10 PM: Facilities has restored power and cooling systems in the server room. HPC systems admin Jordan Hayes is restarting the cluster and storage systems. At 10:10 PM: All HPCC services were restored (computing cluster, storage systems, web services).

22-Mar-2020: Cluster inaccessible due to campus-wide network outage

Due to a campus-wide network outage at UCR, many HPCC services were not accessible between 8:00 AM and 1:00 PM. Currently, most HPCC services are accessible again. Note, running jobs on the cluster should not have been affected by this disruption. Updates about the current situations can be found here.

13-Mar-2020: Routine maintenance shutdown

We have scheduled an HPCC Cluster Maintenance Shutdown for Friday, March 13th. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage systems, backup systems, and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, pelican, parrot) before the shutdown.

08-Jan-2020: Storage outage

We had some issues with our storage systems this evening that may have caused disruptions in your work. These issues should be resolved. We’re continuing to monitor the situation to ensure everything is operational, and we apologize for any inconveniences this may have caused. Please let us know at support@hpcc.ucr.edu if you require any assistance regarding job status and recovery.

21-Nov-2019: Routine filesystem maintenance and diagnostics

We have scheduled an HPCC Cluster Maintenance Shutdown for this Thursday, November 21st. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage systems, backup systems, and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, pelican, parrot) before the shutdown.

23-Aug-2019: Routine maintenance shutdown

We have scheduled an HPCC Cluster Maintenance Shutdown for Friday, Aug 23, 2019. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage systems, backup systems, and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, pelican, globus) before the shutdown. Status: completed. For user changes related to this maintenance please see here.

01-Mar-2019: Routine Maintenance Shutdown

We have scheduled an HPCC Cluster Maintenance Shutdown for Friday, March 1st. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage systems, backup systems, and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, pelican, globus) before the shutdown. Status: successfully completed.

1:00 PM, 20-Dec-18: Outage due to AC failure

All systems were down for 3 hours due to a failure of the AC units in our server room. Electricians and AC technicians have repaired the units.

2:30 PM, 11-Jul-18: Storage Issues

For the past several weeks we have been observing slower storage access. In some cases the /bigdata storage was inaccessible for several minutes and caused some jobs to terminate prematurely. We have identified the issue and have taken steps to ensure that this problem does not reoccur.

6:00 PM, 02-Jul-18: Storage Issues

Storage issues on the afternoon of July 2, 2018 caused disruptions in some cluster services. The issues should be resolved, but we’re continuing to monitor the situation for any other developments.

12:00 AM, 31-Jan-18: routine maintenance shutdown

For routine maintenance and upgrades we have scheduled an HPCC (Biocluster) shutdown for 12:00AM, Jan-31-2018 to 12:00AM, Feb-01-2018. (complete)

12:00 AM, 05-Dec-17: NFS & SMB issues

NFS and SMB services have been suspended temporarily. This will cause many of our web services to not function properly. These include, but not limited to:

https://rstudio.bioinfo.ucr.edu & https://rstudio2.bioinfo.ucr.edu
https://galaxy.bioinfo.ucr.edu
https://dashboard.bioinfo.ucr.edu
https://biocluster.ucr.edu/~username (.html directories)
mysql://bioclusterdb.int.bioinfo.ucr.edu (databases)

Note, this issue was resolved soon after it occurred.

11:00 AM, 13-Aug-17: Cooling problem

Since Sat morning one of the HVAC units is not working properly. To avoid overheating, we have shut down most of the idle nodes (1:30PM, Sun). As soon as the HVAC unit is repaired we will power these nodes back on. Note, this issue was resolved on 17-Aug-17. UCR facility services has repaired the broken HVAC unit and serviced the second one.

12:00 AM, 16-Jun-17 to 17-Jun-17: maintenance shutdown

To sustain future growth, the power load in the HPCC server room needs to be optimized. For this we have scheduled an HPCC (Biocluster) shutdown in four weeks from now which will start at noon on June 16th and last until noon June 17th. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage access, backup systems and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, owl, penguin, pelican, globus) before the shutdown.

10:02 AM, 13-Apr-17: UPS failure

Our UPS unit went down some time last night causing a power failure on all systems. Jordan is bypassing the UPS to bring things back up in the next few hours. Nationwide Power will come in asap to repair the UPS. Note, this failure has not resulted in any overheating problems since the AC units are running on a different power cricuit.

11:22 AM, 13-Apr-17: Cluster back up running

Situation is resolved for now and things are working. We are currently discussing the situation with our electricians to avoid future instances.

Unannounced exceptions

None currently observed.

Standard Operating Procedures

SOP for unscheduled outages

When unforeseen issues arise they are categorized by severity:

Green - Normal operation, no current issues
Yellow - Minor issue[s], likely not observed by users (ie. jobs are not affected)
Orange - Medium issue[s], likely observed by users but not fatal (ie. jobs may perform slower than usual)
Red - Critical issue[s], major service or entire cluster is not functioning as expected (ie. jobs have terminated prematurely)

Email notifications are only sent to users if there is a Red critical issue.

SOP for scheduled shutdowns

The following outlines the timeline for advance email notifications on scheduled shutdowns of the HPCC cluster and other exceptions:

Four weeks advance notice
Followed by weekly reminders
Final reminder the day before the outage

Twitter feed

For additional news and information, please consult the HPCC Twitter site. Also see the Tweets window at the bottom of this and other pages of the HPCC website.

Team collaborations with Slack

Last modified June 17, 2026: alerts.md - Move Scheduled section above Unscheduled section (8cdad060b)