Santander, December 2019 — Interviewing David Del Prado Secadas, support and network technician at IHCantabria (University of Cantabria, Spain).
Question: Since when do you use in IHCantabria the new work efficiency monitoring tools that HPCNow! implemented?
Answer: We consider that monitoring is vital in an infrastructure and we have gone with it in all our systems. That is why we felt that the HPC efficiency panels offered by HPCNow! were perfect to implement in our monitoring solution. In addition, this solution has a very important added value, which is that it is totally customizable to the needs of each infrastructure. So after some meetings and a few weeks of work we managed to put it into production and integrate it into our monitoring system from May 2019.
Question: What problems have you been able to identify thanks to these tools?
Answer: The main objective of these panels is to detect poorly sized jobs to avoid losing computer resources. With these dashboards we have managed to cover this main objective by being able to quickly detect the incorrect use of CPU and RAM resources. But, in addition, thanks to the information we collect in the monitoring process, we also have the capacity to optimize the configuration of the planner (slurm), detect storage problems, relate accounting with efficiency, or profile the work type of each of the projects and users. All this information helps us to be more efficient saving in costs, resources and time.
Question: What is the impact on the computational resources in terms of waiting time, efficiency, successful work, etc. that you have noticed?
Answer: Thanks to monitoring we are able to detect problems and correct them with objective information. In this way, we have managed to optimize the use of computer resources and use them close to 100% efficiency when before we were far from this value. Some other tasks we do is to generate reports reflecting the difference in cost and time by applying the recommended configurations after analyzing the efficiency panels. With this we obtain a very important improvement in the costs and time of execution of the projects getting to be more efficient in the use of the cluster resources.
Question: Do you think that this tool will help you in the next procurement process?
Answer: The more information you have about your infrastructure, the better the decision taken when making a procurement process. Therefore, it is going to be a key piece at the time of being able to initiate any future process of procurement since it gives us a real vision of the current situation of our infrastructure as well as the needs that we have at level of supercomputing.
Question: Are you exposing these dashboards to your users? (user maturity, better understanding of real needs, etc.)
Answer: The way to offer these dashboards to users has been staggered. By collecting so much information, we didn’t want them to be overwhelmed and eventually stop using the panels because they don’t understand everything that is seen. That’s why, in a first phase, we met with them and explained everything the tool was able to offer them, and later we offered them step by step to each of the panels always with the help of the cluster administrators to explain any doubts they might have. We also created customized reports with the data exposed in the dashboards to facilitate the understanding and recommendations in the configuration of the works.
Question: What results (with some examples) have you achieved so far thanks to these new tools?
Answer: Mainly we have managed to be more efficient in the use of cluster resources. As soon as we implemented the efficiency dashboards, we saw that most of the jobs were far from 100% efficient. We had cases below 50%. Thanks to these dashboards we are now able to detect these cases quickly and correct this problem by getting as close to 100% efficiency as possible. Also by making efficient use of CPU and RAM resources, we have managed to computationally extend the life of our cluster by stopping wasting resources. The more efficient the jobs are, the more available capacity we have in the cluster and the more computing opportunity we have available for the users.
Question: Any other relevant comments?
Answer: An HPC supercomputing cluster, can be a huge black hole, in relation to what happens within it. That is why we consider that it is very important to have a tool with these monitoring characteristics. Moreover, it is very important the added value of being able to adjust it to the needs of each environment as it is in this case with the solution provided by HPCNow!. Right now it is a key piece in our daily operation and makes all the administration and decision making tasks much easier, both now and in the future.
Thank you very much for your answers, David!