All critical services are insulated via a Xen based paravirtualization layer, giving great flexibility and scalability to meet customer requirements. For example, this structure allows the isolation of a user who is misusing the cluster without affecting the stability of the whole system. In addition, the technology provided by sNow! allows live and instantaneous migration of all services from one hypervisor to another without compromising or degrading services at any time. This is important for potential critical upgrades or hardware replacement of the administration server(s) of the cluster and ensures high availability.
Therefore, in scenarios where services can not be interrupted, an update of the batch queue manager or a firmware upgrade of the Infiniband HCAs on the cluster nodes no longer requires any downtime and is replaced by a progressive upgrade process that can be completed in a matter of hours or weeks, depending on the complexity of the task, the length of the running jobs and on the checkpoint-restart capabilities needed to migrate jobs to another node.
Detail of the bulletproof system with all the High-Availability options available activated.