Unannounced exceptions

Nothing to report at this point.

For the most current information on short-term exceptions of the cluster please consult the HPCC Twitter site. Also see the Tweets window at the bottom of this and other pages of the HPCC website. Additional information and updates on exceptions may be posted here.

Scheduled exceptions and downtimes

To sustain future growth, the power load in the HPCC server room needs to be optimized. For this we have scheduled an HPCC (Biocluster) shutdown in four weeks from now which will start at noon on June 16th and last until noon June 17th. This will require a shutdown of ALL services: Slurm (queuing system), hosted websites, virtual environments/machines, storage access, backup systems and network services. We ask that you please make sure that you do not have any jobs running in the queue, and that you completely logout of the cluster (pigeon, owl, penguin, pelican, globus) before the shutdown.

There are no downtimes or exceptions scheduled at this point.

SOP for scheduled shutdowns

Standard operating procedures for scheduled shutdowns are here.

Team collaborations with Slack

Sign up and use Slack Team Collaboration app here: ucr-hpcc.slack

Past exceptions

11:00 AM, 13-Aug-17: Cooling problem

Since Sat morning one of the HVAC units is not working properly. To avoid overheating, we have shut down most of the idle nodes (1:30PM, Sun). As soon as the HVAC unit is repaired we will power these nodes back on. Note, this issue was resolved on 17-Aug-17. UCR facility services has repaired the broken HVAC unit and serviced the second one.

10:02 AM, 13-Apr-17: UPS failure

Our UPS unit went down some time last night causing a power failure on all systems. Jordan is bypassing the UPS to bring things back up in the next few hours. Nationwide Power will come in asap to repair the UPS. Note, this failure has not resulted in any overheating problems since the AC units are running on a different power cricuit.

11:22 AM, 13-Apr-17: Cluster back up running

Situation is resolved for now and things are working. We are currently discussing the situation with our electricians to avoid future instances.