Modern machine learning with supercomputers

What does it take to provide the most powerful AI system possible? For us working in supercomputing centres around the world, “most powerful” is where we are aiming at all the time. After all, pushing the limits of computing is what we are paid to do.

The world of supercomputing is moving into a GPU-accelerated future with machine learning workloads having an increasing role. Supercomputers are not like ordinary computers but just faster. Nor are they only (literally) tons of normal computers piled on each other. The core idea of supercomputing is to maximise the amount of computing capacity you can harness to solve a single real-world computational problem. So it means there are tons of “normal computers”, but they need to be interconnected in a way that is far from normal. This is also the context in which GPUs are introduced into supercomputing, or as it is more formally called, high-performance computing (HPC).

We at CSC – IT Center for Science are tasked with providing one of the three EuroHPC supercomputers. The system installed at CSC’s data center will be called LUMI. Its computing power will be largely coming from GPUs and the whole system is designed with AI, machine learning and data analytics in mind. LUMI will not only be a huge step up in computing capacity, but our job is also to make it a step up in user friendliness. For years we have been following the development of the best software frameworks and implementing them for high-performance computing workloads. Right now we are planning the machine learning support and tools we can provide LUMI users.

With the boom of deep neural networks we have seen several new machine learning frameworks emerging, including ones like TensorFlow, Keras and PyTorch. As machine learning has matured in the company world, it has led to the development of another layer of frameworks. These higher level machine learning frameworks vary from comprehensive platforms such as Kubeflow to more focused supporting frameworks such as MLflow. Broadly speaking, they all include different features that allow you to manage your jobs, create automated workflows, handle datasets and model files and in general they abstract away the physical computing environment you are working with. In addition to development of machine learning models, they are also used for “MLops” e.g. managing the production machine learning lifecycle. Often those frameworks are available as open source, but there are also closed source offerings such as the Valohai platform.

When using HPC systems you are typically confronted with a batch processing system, such as Slurm. Batch processing systems are there to maximise the utilisation of the system: you submit your workload split into smaller jobs and the system organises jobs from a number of users in a very effective way. After all, supercomputing systems cost tens to hundreds of millions in euros, so your workload might take over a pretty expensive chunk of computing capacity. Therefore it is important to maintain a high utilisation rate.

Batch processing systems are really good at what they do. Many of the higher level machine learning frameworks provide similar functionality, but implemented on a more modern stack, namely Kubernetes. However, supercomputers typically need to cater for a large variety of use cases outside of the machine learning realm and running the whole system on somewhat fresh and limited job scheduling features of modern machine learning frameworks is not a realistic possibility. Furthermore, the out-of-ordinary interconnection technologies used in supercomputers are not properly supported by commodity software stacks. Finally, you need to consider backwards compatibility and the huge legacy of HPC software that is not written for Kubernetes. From these observations we can derive our mission statement: we are looking for a machine learning framework that while running on Kubernetes, and using its full flexibility, also beautifully integrates with batch processing systems to allow you to run on really big iron.

Authors:

Juha Hulkkonen, CSC: The author is a data engineering and machine learning specialist in CSC’s data analytics group, working with machine learning and big data workflows

Aleksi Kallio, CSC: The author is the manager of CSC’s data analytics group, coordinating development of machine learning and data engineering based services.

Markus Koskela, CSC: The author is a machine learning specialist in CSC’s data analytics group, working with various machine learning applications and computing environments.

Mats Sjöberg, CSC: The author is a machine learning specialist in CSC’s data analytics group, working with various machine learning applications and computing environments.

The blog was originally published at CSC’s website.