HyperQueue facilitates better utilization of LUMI’s computational resources
Development is moving forward inevitably, and supercomputers that ruled the Top500 list ten years ago are largely forgotten. New supercomputers are much more powerful and can contain over a hundred cores within a single compute node. When computing certain types of massively parallel tasks, running them can be complicated, leading to the potential underutilization of the new supercomputers. The solution is HyperQueue, a tool developed by scientists at the Czech IT4Innovations National Supercomputing Center, one of the LUMI consortium partners. HyperQueue tool organizes computations efficiently to solve scientific problems.
IT4Innovations is involved in the LIGATE project, which aims to use European supercomputers for drug design. This involves Computer-Aided Drug Design (CADD) using cutting-edge supercomputers and, in the future, also exascale systems. Specifically, the LIGATE project deals with screening vast quantities of molecules, which is the kind of task that can benefit from exploiting the full potential of compute nodes. Thus, scientists from IT4Innovations at VSB – Technical University of Ostrava have delivered a unique solution in the form of HyperQueue. This tool allows a large number of computational tasks to be run efficiently and easily on modern heterogeneous supercomputers. What exactly does this mean?
– Modern supercomputers are characterized by combining different computer architectures and containing a large number of heterogeneous resources. Using them efficiently with traditional computational tools can be difficult, explains Jan Martinovič from the Advanced Data and Simulation Lab at IT4Innovations.
–That’s why we created the HyperQueue tool, which simplifies the use of supercomputers with complex resources and provides a simple interface for entering computational tasks. At the same time, it can efficiently use the available computational resources of a supercomputer, he adds.
Additionally, according to Branislav Jansík, IT4Innovations’ Supercomputing Services Director:
– HyperQueue has been successfully deployed and tested on several supercomputers with different hardware architectures. These include the EuroHPC supercomputer Karolina, operated by IT4Innovations, the Czech National Supercomputing Center, and Europe’s most powerful supercomputer, LUMI, operated by the Finnish CSC – IT Center for Science.
– HyperQueue has been immensely useful in scaling up existing workflows run by researchers, requiring little to no changes to their code. Working with HyperQueue is very straightforward. We have even added native support for HyperQueue to the bioinformatics workflow manager Nextflow so that users can benefit from HyperQueue when running their genomics workflows without even having to learn HyperQueue. It’s nice to encounter a tool that plays nicely with the system scheduler and does not negatively affect system stability when doing comprehensive workflows, says Henrik Nortamo, Applications Specialist at CSC.
HyperQueue is also being deployed on supercomputers at the Italian CINECA as part of the LIGATE project. It has the potential to become a primary tool to help efficiently schedule a large number of jobs that would not be otherwise able to individually use the full capacity of a supercomputer’s compute node.
– We found HyperQueue an easy-to-use tool that simplifies deployment on novel HPC machines, making the use of resources more efficient for workloads composed of many small tasks. This is exactly the case for the in-silico virtual screening application we are developing in the context of the LIGATE project. It has already been used to help fight the COVID-19 pandemic, added Gianluca Palermo from Politecnico di Milano, the Technical Manager of the LIGATE project.
Modern HPC clusters contain a large number of heterogeneous resources that provide vast amounts of computational power. It is challenging to design monolithic programs that can leverage that performance potential effectively (e.g., by scaling to hundreds of cores); HPC users often design their computational workflows as a set of smaller, interdependent tasks that use only a fraction of the resources of a single cluster node. Yet executing these workflows on HPC clusters in the presence of job managers such as Torque/PBS or Slurm can be challenging. They can impose limits on the concurrent execution of multiple tasks on a single node, thus hampering node utilization. Their design, in general, is not accustomed to an enormous amount of smaller, less resource-intensive tasks, which can lead to the manager being overloaded.
HyperQueue is an HPC task execution framework that solves this problem. It allows users to submit tasks in a simple way outside of a computational job. HyperQueue then asks for computational resources from the job manager and executes the tasks on all available computing nodes. It uses a sophisticated scheduler to load balance the tasks while considering arbitrary resource specifications and current node utilization. As an example use-case, it is trivial to define a computation with many tasks that use a small number of cores and execute it on a cluster with very powerful nodes (e.g., 128 cores) while achieving very high node utilization out of the box.
For more information see: https://github.com/It4innovations/hyperqueue