Skip to main content

Research group created the largest Finnish language model ever with the LUMI supercomputer

Currently, one of the hottest topics in the field of technology is text-generating AI. I last met Sampo Pyysalo, a University Researcher at the Department of Computing at the University of Turku, Finland, and Professor Filip Ginter a year ago when their research team on Natural Language Processing, TurkuNLP, had been granted computing resources from the LUMI supercomputer for the development of Finnish language models. The project was one of almost thirty pilot projects run in the GPU partition of the LUMI supercomputer.

As I meet them again in early 2023, the group is alive with a content murmur: the Generative Pre-trained Transformer 3 (GPT-3) level language model of Finnish has been completed in the LUMI pilot project, and it has been published on the Internet for open access only a moment earlier.

– The LUMI pilot project went smoothly from a scientific standpoint, and we achieved more than we had initially dared to hope for, Pyysalo rejoices.

Sampo Pyysalo, Ville Komulainen, Risto Luukkonen and Filip Ginter

TurkuNLP researchers, University Researcher Sampo Pyysalo (left), Research Assistants Risto Luukkonen and Ville Komulainen and Professor Filip Ginter. Image: CSC – IT Center for Science.

Big and quick technological transformation

These massive language models are needed because they lay the foundation for next-generation AI applications.

The GPT-3 model, based on deep neural networks, can predict the following words after a text entry. The much-discussed ChatGPT bot is also based on the GPT-3 model, which is a closed language model developed by OpenAI.

– It is likely that the most important AI applications of this decade will be built on these kinds of language models. We are undergoing a pretty big and quick transition right now. The most significant applications have not yet been made, Pyysalo evaluates.

Language-based AI applications are only at the beginning of their journey, but researchers say that the technology has come here to stay.

The LUMI supercomputer’s massive computational power based on GPUs or graphics processors was used to compute an incomprehensibly large language model. During the pilot project, the group created a GPT-3 model with 13 billion parameters based entirely on Finnish. This is the largest Finnish language model ever.

Several smaller, purely Finnish language models were also calculated during the LUMI pilot project. During the pilot project, the research group also taught Finnish to a larger model with 176 billion parameters, based on the pre-trained BLOOM model (BigScience Large Open-science Open-access Multilingual Language Model).

Less swearing and hate speech

The development of language models is based on huge data sets, which are used by deep-learning neural networks to create a new language model. In the LUMI pilot project, the research group also created an identification system that filtered out the most problematic text segments from the data fed into the language model.

– We trained our language model with very high-quality data that meets EU requirements. By classifying different text types, we have a better-than-average understanding of what kind of data the model has read, and we were able to eliminate the most toxic and problematic texts from the model. For example, compared to previous models, we were able to cut the model’s spontaneous swearing in half, Pyysalo illustrates.

– Data pre-processing is a very important part of the training of language models. We eliminated hate speech from the data entered into language models and also deleted any personal data, such as personal identity codes, phone numbers as well as physical and electronic addresses. This way, we control what the language model learns and what it generates when used, Ginter continues.

As a relatively small language area, international commercial players have comparatively less interest in Finnish. Training open language models outside large companies is important to promote open science.

–The language model we have developed is an open model, which means that it is available to everyone. It is also important to train open language models in the academic community and ensure that Finnish is involved, Ginter states.

Challenges with new technology

The LUMI pilot project did not go entirely without a hiccup. LUMI’s GPU partition is based on AMD’s latest MI250X graphics processors and there was much to learn.

– With the transition to new technology came challenges. Super-large language models utilize special code libraries, and optimizing them for new processor technology took a lot of time, Ginter explains.

The group received support for the completion of the project from the LUMI User Support Team, the processor manufacturer AMD and Hugging Face, which is known especially for its Natural Language Processing applications.

– We eventually got about 75–80% performance from LUMI compared to what we thought would be achievable. At this point in the life cycle of a supercomputer, it is probably a good number, Pyysalo assumes.

The computing power of the LUMI supercomputer accelerated the creation of language models enormously.

– This would never have been possible without a system like LUMI. With smaller systems, we would still calculate this model in 2025, Pyysalo evaluates.

Finnish is running out

After the LUMI pilot project, the group will continue to develop language models in the LUMI Extreme Scale project, for which the group was granted 2 million GPU hours from the share of LUMI capacity reserved for Finnish researchers.

– In this project, we focus on how multilingual and translation data can support the development of the largest Finnish-language models, Pyysalo explains.

The problem with the further development of the models is that the amount of Finnish language is limited. Ginter has been working in the field since the early 2000s, and a project he previously led collected as much of the Finnish-language Internet as possible to develop language models. Finnish language data for the LUMI pilot project was also obtained from the National Library of Finland. This is not enough either.

– All in all, there is simply not enough Finnish available in digital format for us to train the model for more than 100 billion parameters using Finnish alone. Finns talk little and there are not that many of them, Ginter laughs.

For the Finnish language, the model is a salvation.

– These models and the technology based on them cause major changes in many sectors, and these models are owned exclusively by a few multinational companies. Our model is genuinely open and enables things that could not be built on the models developed by these large multinational companies, says Pyysalo.

Follow-up projects on the LUMI supercomputer

Pyysalo is also involved in the High Performance Language Technologies project of the Horizon Europe framework programme. The project produces language models for all EU languages. This project was granted 3 million GPU hours from LUMI. Professor Jörg Tiedemann from the University of Helsinki is also involved in the project.

– Experience in developing Finnish language models serves as the foundation for the project. Other European languages will follow the Finnish language model. The language models are run on the LUMI supercomputer, says Pyysalo.

Ginter (TurkuNLP) and Tiedemann (University of Helsinki) are also involved in the Green NLP project together with CSC – IT Center for Science, Finland. In this project the researchers develop the training and use of language models to make them more energy efficient. The aim is to create best practices for improving energy efficiency in the field of natural language processing. The LUMI supercomputer is used in this project too.

 

Have a look at the interview:


Author: Anni Jakobsson, CSC – IT Center for Science, Finland