The monitoring solution developed by HPCNow! that helps to improve the efficiency of HPC clusters

Barcelona, February 8th 2023 – Scientists and engineers at HPCNow! have developed a solution to monitor the status of HPC clusters in real time. The monitoring stack includes open-source solutions such as Grafana, Elasticsearch and Prometheus, for visualization and data storage, and Slurm plugins plus customized scripts to gather all the information needed by the system administrator. The solution is delivered using Docker Compose for single-node monitoring scenarios, or using Docker Swarm if high-availability is requested by the customer. Additionally, it includes the necessary dashboards to display the information gathered, some of them are:

Slurm jobs: accounts for all Slurm jobs over a period of time.
Job detail: returns the detail of each job (submission, start and end date, CPUs used and their efficiency, memory used and its efficiency, Slurm script, etc.)
Slurm accounting: general overview of the HPC workload.
Job efficiency monitoring (CPU and memory): resources asked, used and wasted.

The HPCNow! monitoring solution is flexible. It is provided taking into account the needs of the customer in terms of availability, variables to control and visualization.

This new technology is a must for those institutions that are facing cluster congestion issues, that want to maximize their return on investment, and/or to keep the cloud bursting budget under control. Additionally, it helps the HPC center to draw a line to define what is reasonable regarding resource usage and educate users on using the cluster properly if they are allocating more resources than needed.

More information:

– Improving efficiency in HPC clusters using monitoring tools

* Download the press release in pdf here

The monitoring solution developed by HPCNow! that helps to improve the efficiency of HPC clusters

Headquarters

NZ Office

Contact