May we introduce: LUMI

This blog post will give a general introduction to the LUMI system, the software system, and what programming models will be supported. We will also try to answer the most burning questions about the system and how you can get your programs running on it. In later posts, we will be doing more deep dives into specific parts of the system and software, how to prepare your codes, and how to port your codes. We will also be doing use case-specific deep dives on, for instance, how to prepare your ML workloads for LUMI.

System introduction

As announced on October 21st,  LUMI will be an HPE Cray EX supercomputer consisting of several partitions targeted for different use cases. The largest partition of the system is the LUMI-G partition consisting of GPU accelerated nodes using future-generation AMD Instinct GPUs. In addition to this, there is a smaller CPU-only partition LUMI-C that features AMD Epyc CPUs and an auxiliary partition for data analytics with large memory nodes and some GPUs for data visualization. We will detail the different partitions and the hardware in them closer to the system installation.

The system will be installed in two phases. In the first phase in the spring of 2021, everything except the GPU partition will be installed. The first phase also includes a small-scale porting platform for the AMD GPUs. This porting platform will include hardware that is similar to that which will be used in the GPU partition of the LUMI system, as well as a pre-release of the software stack that will be used for the GPUs. This platform will be accessible to the end-users, and they can carry out porting work on it; it is not intended for actual production workloads. The second phase will be installed by the end of 2021.

Programming environment introduction
The AMD programming environment comes under the ROCm (Radeon Open Compute) brand. As the name suggests, it is mostly open source components, but they are actively developed and maintained by AMD, the source code is hosted at RadeonOpenCompute/ROCm, and the documentation can be found on ROCm documentation platform.

The ROCm software stack contains the usual set of accelerated scientific libraries, such as ROCBlas for BLAS functionality and ROCfft for FFT, etc. The AMD software stack naturally also comes with the necessary compilers needed to get code compiled for the GPUs. The AMD GPU compiler will have support for offloading through OpenMP directives. In addition, the ROCm stack comes with HIP, which is AMDs replacement for CUDA, and the tools required to make translating code from CUDA to HIP much easier. The code translated to HIP can still work on CUDA hardware. We will cover translating codes to HIP in more detail in later posts.

The ROCm stack also includes the tools needed to debug code running on the GPUs in the form of ROCgdb, AMDs ROCm source-level debugger for Linux based on the GNU Debugger (GDB). For profiling, the ROCm stack also comes with rocProf, implemented on the top of rocProfiler and rocTracer APIs, allowing you to profile code running on the GPUs. AMD provides its MIOpen library, an open-source library for high performance machine learning primitives for machine learning. MIOpen provides optimized hand-tuned implementations such as convolutions, batch normalizations, poolings, activation functions, and recurrent neural networks (RNNs).

In addition to the AMD software stack, LUMI will also come with the full Cray Programming Environment (CPE) stack. The Cray programming environment comes with the needed compilers and tools that help users port, debug, and optimize for GPUs and conventional multi-core CPUs. It also includes fine-tuned scientific libraries that can use the CPU host and the GPU accelerator when executing kernels.

Some highlights of the CPE:

  • The Cray Compiling Environment (CCE) consists of optimizing compilers for C/C++ and Fortran; CCE supports directive-based offloading to the GPUs through OpenMP use.
  • The LibSci_acc libraries will include optimized dense linear algebra kernels tuned specifically for AMD GPUs.
  • The Cray Programming Environments Deep Learning Plugin (CPE DL Plugin) is a communication package based on MPI and optimized for DL workloads. It is easily incorporated with popular DL frameworks like TensorFlow via the Python or C API.
  • CrayPat (Cray Performance Analysis Tool) is the primary performance analysis tool for the Cray system. Programs written in Fortran, C, C++, with MPI, OpenMP (including offloaded code), or a combination of these programming languages and models are supported.
  • Debugging the Cray system comes with two primary debugger tools: gdb4hpc and the Cray Comparative Debugger (CCDB). The gdb4hpc tool is a GDB-based parallel debugger that enables a scalable traditional debugging session with a gdb-like command-line interface. Comparative debugging is a data-centric debugging technology that allows users to track software errors by comparing two program executions concurrently.

Getting your code ready for LUMI
While LUMI does include a partition with CPU-only nodes, most of the performance comes from GPU nodes. If you want to take full advantage of LUMI, you will need to make sure your code can do at least the majority of its work on the GPUs. If your codes currently don’t utilize any GPU acceleration, now is a great time to start looking into what parts could benefit from GPU acceleration. If you are using a code that currently uses GPU acceleration, some porting may still be needed.
In cases where the code is widely used in the community, and you are not the developer, the porting effort will likely be made by someone else. But in the cases where the code is something you have developed yourself, you will need to do some porting work. If the current code uses CUDA, that code will need to be converted to AMD’s equivalent to HIP. The HIP is very similar to CUDA, and there are source-to-source translation tools available that will do the majority of the conversion work, see the Questions and Answers section below. Some OpenACC support level is available for the AMD hardware used, but it may be worth considering converting OpenACC codes to use OpenMP offloading instead.

In later posts, we will provide more in-depth guides for preparing your codes for LUMI.

Questions and Answers

How was the LUMI vendor chosen?
The LUMI contract was awarded based on a combination of functional requirements and performance/cost metrics. The functional requirements included things like a working software stack etc. For the cost/performance metric, the performance was a combination of synthetic benchmarks such as the industry-standard HPL and HPCG benchmarks but also a set of application benchmarks. The application benchmark set consisted of three benchmarks from the MLPerf benchmarks set and 4 codes with 6 different input cases; the codes are applications that are widely used within the LUMI consortium. The non-ML part of the application benchmarks consisted of codes using either CUDA and one using OpenACC to offload their computation to the GPU. With this type of benchmark set, we enforced additional functionality requirements that there needs to be a robust way to translate or run programs using CUDA on the machine.

How will my CUDA-code work on LUMI?
Your CUDA code has to be converted to AMD HIP to work with the AMD GPUs. The HIP is a C++ Runtime API and Kernel Language that can be used to create portable applications for AMD and NVIDIA GPUs using the same source code. AMD provides various tools for porting CUDA code to HIP so that the developer does not have to port all of the code manually, we still expect some manual modifications, and performance tuning for the hardware to be needed.

Code converted to HIP can also run on CUDA hardware using just the HIP header files, allowing you to maintain only the HIP version of your code. When running HIP on CUDA hardware, the performance should be similar to the original CUDA code’s performance. Many CUDA libraries are also ported or in the process of being ported. AMD provided its optimized libraries; for example, the AMD BLAS library comes in the form of rocBLAS. They also offer an additional convenience library, hipBLAS, to allow HIP programs to call the appropriate BLAS libraries based on the hardware they are running on, i.e., rocBLAS will be called on an AMD platform and cuBLAS on a CUDA platform. Recently Hipfort was also released, which is a way for Fortran codes to access the HIP libraries and allow them to offload code to AMD GPUs.

How will my OpenACC code work on LUMI?
A community effort, led by Oak Ridge National Laboratory, supports OpenACC in the GNU compilers. The most recent GCC release, currently 10.1, supports OpenACC 2.6. We expect that GCC’s support and stability will improve and be better closer to the GPU partition availability. If you can currently build your OpenACC programs with the GNU compiler, you should be able to use OpenACC on LUMI.

As an alternative to OpenACC, LUMI will support programs using OpenMP directives to offload code to the GPUs. OpenMP offloading has better support from the system vendor, meaning it may be worth considering porting your OpenACC code to OpenMP.

How can I run my favourite ML framework on LUMI?
The most well-known machine learning frameworks are going to be available for AMD GPUs. For example, TensorFlow and PyTorch are already supported by AMD, and their effort continues to bring the best experience for the machine learning users. You can find more information here. 

Can other programming approaches for GPUs be used on LUMI?

  • OpenMP: With OpenMP, it is possible to offload computation to the GPUs using similar directives as one can use to create multithreaded applications. The AMD compiler and the Cray compiler both support offloading to the AMD GPUs used in LUMI and will support the latest version 5.0 of the OpenMP standard.
  • OpenCL: AMD has long been supporting OpenCL, and that support should continue. The AMD OpenCL driver supports CPUs and GPUs, and the programming environment includes debugging and profiling tools (CodeXL) and performance libraries like clMath. Documentation for OpenCL on AMD GPUs can be found at ROCm Documentation platform.
  • SYCL: SYCL is a single-source C++-based programming model targeting systems with heterogeneous architecture. Multiple SYCL implementations are supporting AMD GPUs. HipSYCL is an SYCL implementation built on HIP; as such, it targets both AMD and NVIDIA GPUs; another alternative is using Codeplays ComputeCpp SYCL implementation that also supports AMD GPUs.
  • Libraries: Many libraries offer GPU acceleration for specific operations. For instance, for linear algebra, there is the BLAS library that AMD has created their version that can offload the computation done to the GPU. Using these libraries can be an easy way to get part of your program offloaded to GPUs.

When can I get access to something to test on?
With the installation of the first phase, a small early access platform will be installed as well. That system will include a GPU configuration similar to what the real LUMI GPU nodes will look like, but using the previous generation AMD GPUs. That porting platform will also have all the relevant software stacks for programming the AMD GPUs.

Authors: 

Pekka Manninen, Director, LUMI Leadership Computing Facility
Fredrik Robertsén, Technology strategist, LUMI Leadership Computing Facility
Georgios Markomanolis, Lead HPC scientist, CSC