Preparing codes for LUMI: converting CUDA applications to HIP

15.4.2021

decorative image about coding

Image: Adobe Stock

In the first blog post “May we introduce: LUMI” we discussed the LUMI high-level specifications, the software stack and what programming models could be used to get your GPU accelerated codes running on the system. In this article we are going do a deeper dive on HIP which is AMDs direct answer to CUDA. As there are a lot of existing GPU applications using CUDA we need a way to translate these to run on LUMI, and in this article we will discuss the steps needed for that.

What is HIP?

As mentioned, HIP is AMD’s answer to CUDA, however, whereas CUDA code can only run on Nvidia GPUs programs using HIP can run on both AMD and Nvidia GPUs. The HIP API syntax is very similar to the CUDA API, and the abstraction level is the same meaning that porting between the two is easy and we will cover the practical ways this can be done below.

Currently HIP supports a large set of the same features that CUDA does, but there are some that are not supported, most of the features that are not supported are newer features introduced with the later versions of CUDA.

On the AMD side HIP is part of the AMDs compute stack called Radeon Open Compute, ROCm. The ROCm stack encompasses everything from a compiler for the AMD GPU hardware, to low-level optimized libraries for the AMD hardware. The big distinction to make here is that HIP is the parts that are portable between AMD and Nvidia hardware, whereas the rest of ROCm stack is for AMD hardware.

Terminology explained

The way HIP handles the GPU hardware is the same as CUDA, as such the terminology is very similar. Kernels are executed as grids of blocks, and each block contains a given number of threads, the major difference is that in CUDA we refer to 32 threads executing at the same time as a warp but in HIP nomenclature these are a wavefront, and on AMD hardware the number of threads executing at the same time is 64.

How many threads are included in a wavefront is determined by what hardware is used not if it is HIP or CUDA code, when running the code on AMD hardware the wavefront consists of 64 threads, however when running HIP code on Nvidia hardware it is only 32 threads.

To execute GPU kernels, we use special variables whose purpose is to identify the thread on the grid, such keywords are threadIdx.x, blockIdx.x etc. For CUDA and HIP we use the same terms regarding kernels, while for OpenCL uses different names for this, but that is out of the scope of this post.

CUDA	HIP
block	block
grid	grid
thread	thread
warp	wavefront

Table 1: Terminology for CUDA and HIP

Kernel code

Generally, the kernel code looks the same for CUDA and HIP and the same C++ language features are supported. The same __constant__ and __shared__ etc. memory qualifiers are available. HIP also supports most of the math functions that are used on CUDA, you can find the latest information on the ROCm docs.

There are however some differences that some codes may encounter, all texture and surface memory functions are not available in HIP, the same warp shuffle instructions are not supported, however as of ROCm 4.1 cooperative groups are now available in HIP code, and inline PTX code will not work. In addition to this there are some minor differences, the exact differences can be found in the HIP programming guide.

API (Runtime API)

The HIP runtime API generally mirrors the CUDA one, simply by replacing the cuda text in the call with hip gets you the equivalent HIP runtime call in most cases. Table 2 shows a simple comparison with how the calls change between HIP and CUDA, the HIP version will naturally also include different header files for the runtime API. There are cases where the conversion is not direct, in some cases certain arguments need passing in different ways, but generally if there is an equivalent HIP call it is just a question of replacing cuda with hip and the call will work.

CUDA	HIP
#include "cuda.h"	#include "hip/hip_runtime.h"
cudaMalloc(&d_x, N*sizeof(double));	hipMalloc(&d_x, N*sizeof(double));
cudaMemcpy(d_x,x,N*sizeof(double), cudaMemcpyHostToDevice);	hipMemcpy(d_x,x,N*sizeof(double), hipMemcpyHostToDevice);
cudaDeviceSynchronize();	hipDeviceSynchronize();

Table 2: Differences between CUDA and HIP API

The one major different currently is how kernels are launched. CUDA uses a special syntax with angled brackets, i.e. kernelName<<<gird, block>>>, whereas in HIP the launching of kernels is done with a regular function call to hipLaunchKernelGGL that as argument takes the kernel name, grid and block size, etc. Table 3 shows the different ways the kernels are launched.

CUDA	HIP
kernel_name <<<gridsize, blocksize, shared_mem_size, stream>>> (arg0, arg1, ...);	hipLaunchKernelGGL(kernel_name, gridsize, blocksize, shared_mem_size, stream, arg0, arg1, ... );

Table 3: Launching a kernel

Libraries

AMD has ported many well-known GPU libraries to work with their GPUs, generally these come in two flavors. The HIP version of the libraries work in the same way as the as the HIP runtime API, the calls are very similar to the CUDA versions and then codes that use them can run on both AMD and Nvidia hardware. In the background the appropriate underlying library will be called based on what hardware the code is compiled for. For instance on an Nvidia system hipBLAS will call cuBLAS, whereas on AMD systems hipBLAS will call the appropriate AMD libraries, in this case rocBLAS. When a library’s name starts with roc that means they are the native library for AMDs GPUs, the interface to these may differ somewhat from the HIP and CUDA version. Table 4 lists some libraries, with their CUDA, HIP, and ROCm equivalents, for a full and up to date list you can check the ROCm libraries website.

CUDA	HIP	ROCm	Description
cuBLAS	hipBLAS	rocBLAS	Basic Linear Algebra Subroutines
cuRAND	hipRAND	rocRAND	Random Number Generator Library
cuFFT	hipFFT	rocFFT	Fast Fourier Transfer Library
cuSPARSE	hipSPARSE	rocSPARSE	Sparse BLAS + SPMV
NCCL		RCCL	Communications Primitives Library based on the MPI equivalents
CUB	hipCUB	rocPRIM	Low Level Optimized Parallel Primitives

Table 4: Libraries

The portability of HIP

The good thing with code converted to HIP is that it works on both AMD and Nvidia hardware, so you can still run code converted on your existing hardware. The way this is achieved is neatly hidden from the user, you simply pass the code to the hipcc compiler and it will take care of compiling the code for the correct platform. The choice which architecture it builds for can be controlled through environment variables, mainly the HIP_PLATFORM variable will tell the compiler what architecture to target.

Digging a bit further, when compiling for Nvidia hardware the hipcc compiler will simply call the nvcc compiler to build code for those GPUs. On that platform the hip run time headers will also simply call the corresponding CUDA runtime API calls. This means that on and Nvidia platform you can also supply just the hip runtime headers and use the normal nvcc compiler to compile the code, the headers will default to calling CUDA API functions.

On AMD GPU platforms the HIP runtime will then link to the appropriate AMD hardware calls, on those platforms the hipcc compiler will take care of the code compilation. In this case the hipcc compiler is based on clang so you could make sure you code compiles with clang for your transition to AMD hardware to be easier.

Figure 1: Diagram of HIP utilization on different GPUs

Figure 1: Diagram of HIP utilization on different GPUs

If in converting your code you also rename your .cu files to .cpp you will need to pass the Nvidia compiler the additional –x cu argument to make it look for device code in normal .cpp files.

How to ”hipify” your code?

The AMD software stack includes tools that automatically converts exsisting CUDA code to HIP, these tools are used to “hipify” CUDA codes. There are two main tools as well as some helper scripts that make converting an entire code base easier. The first tool is based on Perl script, called hipify-perl and the other one on the Clang compiler and it is called hipify-clang. Both tools work with C/C++ codes, but not with Fortran, we will discuss the approaches that can be used for Fortran later.

Hipify-perl

The easier to use tool is the hipify-perl tool, it will attempt to hipify the CUDA code through basic find and replace techniques the cuda string in API calls is replaced with hip, it is and however a bit more intelligent, it will also add the appropriate headers and for HIP calls where the arguments are different it will attempt to correct this. The majority of cases the script will manage to do the entire conversion, but you should always check the correctness of the translation.

In its simplest form the tool will just print out the translated version of any file passed to it. Adding an —inplace argument to the invocation of the tool will replace the contents of the file with the translated contents, and at the same time create a copy of the file with a .prehip extension which contains the original code. This way you can compile your code with the same filenames, and then If you want to change the original source code, modify the mult.c.prehip file and execute the hipify-perl tool again. Another useful argument to the hipify-perl tool is the —print-stats argument. This will print out statistics about what the tool did, how many calls it found and converted.

Hipify-clang

Hipify-clang is a more advanced tool for doing the conversion, it is based on the clang compiler and thus can have more context for the code when doing the conversion than just a simple find/replace. Since this tool is based on clang it comes with the caveat that the code needs to be compliable, meaning at times you need to add headers, defines etc. to make sure the code can be compiled.

Hipconvertinplace-perl/clang

For conversion of larger code bases the AMD tools come with convenient scripts that can convert all the code under a certain directory. The tool again comes in two flavors, hipconvertinplace-perl.sh which will call hipify-perl for the conversion and hipconvertinplace.sh which uses hipify-clang. In this section we will just cover the first tool as the perl script is often enough to convert most cases.

The hipconvertinplace-perl.sh script executes hipify-perl with –inplace and –print-stats options for the whole directory that is used as argument, or if called without an argument the current directory will be used. For example, if we want to hipify a directory called src we would call:

$ hipconvertinplace-perl.sh src

This command will hipify all the appropriate files and it will print statistics for each file including a summary.

Hipifying C/C++ code

Next, we have two examples for how two codes that does saxpy, i.e. single precision A*X+Y, can be converted from CUDA to HIP. The first examples includes the kernel code to do the computation whereas the second one relies on cuBLAS for the operation.

Saxpy

In this example the file saxpycuda.cpp includes saxpy code written in CUDA, it consists of a few memory allocation and copy calls as well as the kernel and the code to launch it. In order to convert it these CUDA API calls would have to be translated to HIP.

Using the hipify-perl program we hipify the code with the following call:

$ hipify-perl –inplace –print-stats saxpycuda.cpp
info: converted 14 CUDA->HIP refs ( error:0 init:0 version:0 device:0 context:0 module:0 memory:7 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 occupancy:0 texture:0 surface:0 peer:0 graphics:0 profiler:0 openGL:0 D3D9:0 D3D10:0 D3D11:0 VDPAU:0 EGL:0 thread:0 complex:0 library:0 device_library:0 device_function:3 include:0 include_cuda_main_header:0 type:0 literal:0 numeric_literal:3 define:0 extern_shared:0 kernel_launch:1 )
warn:0 LOC:40 in ‘saxpycuda.cpp’
hipMemcpy 3
hipFree 2
hipMemcpyHostToDevice 2
hipMalloc 2
hipLaunchKernelGGL 1
hipMemcpyDeviceToHost 1

From the output we see that the script did 14 conversions to HIP, the script also prints out how many of which call it found and converted. In case it encountered a call that it cannot convert it would also print this out as a warning at this state.

The conversion script also added include statements for the hip_runtime header. The kernel remains exactly the same as none of the code used there differ between HIP and CUDA, usually this is the case.

CUDA	HIP
#include <stdio.h> __global__ void saxpy(int n, float a, float x, float y) { int i = blockIdx.xblockDim.x + threadIdx.x; if (i < n) y[i] = ax[i] + y[i]; } int main(void) { int N = 1<<15; float x, y, d_x, d_y; x = (float)malloc(Nsizeof(float)); y = (float)malloc(Nsizeof(float)); cudaMalloc(&d_x, Nsizeof(float)); cudaMalloc(&d_y, Nsizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } cudaMemcpy(d_x, x, Nsizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, Nsizeof(float), cudaMemcpyHostToDevice); saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); cudaFree(d_x); cudaFree(d_y); free(x); free(y); }	include "hip/hip_runtime.h" #include <stdio.h> __global__ void saxpy(int n, float a, float x, float y) { int i = blockIdx.xblockDim.x + threadIdx.x; if (i < n) y[i] = ax[i] + y[i]; } int main(void) { int N = 1<<15; float x, y, d_x, d_y; x = (float)malloc(Nsizeof(float)); y = (float)malloc(Nsizeof(float)); hipMalloc(&d_x, Nsizeof(float)); hipMalloc(&d_y, Nsizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } hipMemcpy(d_x, x, Nsizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_y, y, Nsizeof(float), hipMemcpyHostToDevice); hipLaunchKernelGGL(saxpy, dim3((N+255)/256), dim3(256), 0, 0, N, 2.0f, d_x, d_y); hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); hipFree(d_x); hipFree(d_y); free(x); free(y); }

CUDA

HIP

#include <stdio.h>

__global__ void saxpy(int n, float a, float *x,
                      float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<15;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), 
             cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), 
             cudaMemcpyHostToDevice);

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, 
                              d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), 
             cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

include "hip/hip_runtime.h"
#include <stdio.h>

__global__ void saxpy(int n, float a, float *x, 
                      float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<15;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  hipMalloc(&d_x, N*sizeof(float));
  hipMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  hipMemcpy(d_x, x, N*sizeof(float), 
            hipMemcpyHostToDevice);
  hipMemcpy(d_y, y, N*sizeof(float), 
            hipMemcpyHostToDevice);

  hipLaunchKernelGGL(saxpy, dim3((N+255)/256),
                     dim3(256), 0, 0, N, 2.0f, 
                     d_x, d_y);

  hipMemcpy(y, d_y, N*sizeof(float), 
            hipMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  hipFree(d_x);
  hipFree(d_y);
  free(x);
  free(y);
}

Table 5: Conversion through hipify-perl from saxpy with CUDA to saxpy with HIP

When comping the converted code one simply passes the code to the hipcc compiler wrapper:

hipcc -o csaxpy_hip saxpycuda.cpp

In this case the compilation works the same way on both AMD and Nvidia platforms.

Saxpy with cuBLAS

The second example we have is using saxpy from cuBLAS, in this case in addition to converting the CUDA API calls to HIP we also need to convert cuBLAS calls to hipBLAS, fortunately the hipify tools can also convert these library calls.
As with the previous case we simply call hipify-perl to convert the code:

$ hipify-perl –inplace –print-stats saxpy.cpp
info: converted 12 CUDA->HIP refs ( error:0 init:0 version:0 … library:6 … include_cuda_main_header:1 type:1 )
warn:0 LOC:35 in ‘saxpy.cpp’
hipFree 2
hipMalloc 2

From the report we see that the tool did 12 replacement, 4 for memory operations, 6 from library operations such as cuBLAS, etc. From the output shown in table 7 we see that the cuBLAS calls were converted and the corresponding header were included.

CUDA	HIP
#include "cublas_v2.h" ... cublasHandle_t handle; cublasCreate(&handle); cudaMalloc(&a_d, sizeof(float) * N); cudaMalloc(&b_d, sizeof(float) * N); cublasSetVector( N, sizeof(float), a_h, 1, a_d, 1); cublasSetVector( N, sizeof(float), b_h, 1, b_d, 1); cublasSaxpy(handle, N, &s, a_d, 1, b_d, 1); cublasGetVector( N, sizeof(float), b_d, 1, b_h, 1); ... cudaFree( a_d); cudaFree( b_d); cublasDestroy(handle); ...	#include "hipblas.h" ... hipblasHandle_t handle; hipblasCreate(&handle); hipMalloc(&a_d, sizeof(float) * N); hipMalloc(&b_d, sizeof(float) * N); hipblasSetVector( N, sizeof(float), a_h, 1, a_d, 1); hipblasSetVector( N, sizeof(float), b_h, 1, b_d, 1); hipblasSaxpy(handle, N, &s, a_d, 1, b_d, 1); hipblasGetVector( N, sizeof(float), b_d, 1, b_h, 1); ... hipFree( a_d); hipFree( b_d); hipblasDestroy(handle); ...

CUDA

HIP

#include "cublas_v2.h"
...
 cublasHandle_t handle;
 cublasCreate(&handle);

 cudaMalloc(&a_d, sizeof(float) * N);
 cudaMalloc(&b_d, sizeof(float) * N);

 cublasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
 cublasSetVector( N, sizeof(float), b_h, 1, b_d, 1);

 cublasSaxpy(handle, N, &s, a_d, 1, b_d, 1);
 cublasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
...
 cudaFree( a_d);
 cudaFree( b_d);
 cublasDestroy(handle);
...

#include "hipblas.h"
...
 hipblasHandle_t handle;
 hipblasCreate(&handle);

 hipMalloc(&a_d, sizeof(float) * N);
 hipMalloc(&b_d, sizeof(float) * N);

 hipblasSetVector( N, sizeof(float), a_h, 1, a_d, 1);
 hipblasSetVector( N, sizeof(float), b_h, 1, b_d, 1);

 hipblasSaxpy(handle, N, &s, a_d, 1, b_d, 1);
 hipblasGetVector( N, sizeof(float), b_d, 1, b_h, 1);
...
 hipFree( a_d);
 hipFree( b_d);
 hipblasDestroy(handle);
...

Table 6: Conversion through hipify-perl from saxpy with CUBLAS to saxpy with hipBLAS

For the compilation we again use hipcc, and we also need to link the code with hipblas. Meaning that for compilation we would use:

hipcc -o saxpy_hip saxpy.cpp -I/path_to_hipblas/include/ -L/path_to_hipblas/lib/ -lhipblas

Again the compilation works the same way on both AMD and Nvidia platforms.

Hipify Fortran code

GPU code in Fortran can is done in a variety of ways, our intention is to cover the more prominent ones in a later article, this will include converting Fortran that uses CUDA code in different ways to HIP.

Converting CUDA to HIP is straightforward

In the end converting CUDA code to HIP is usually quite straightforward, with the catch being that the most bleeding edge CUDA features are not supported but may be supported in the future. The AMD GPU software stack comes with tools that will significantly speed up the conversion process compared to doing it manually.

Have a look at the CUDA and HIP codes on Github.

Authors: George Markomanolis, Lead HPC Scientist at CSC – IT Center for Science Ltd. and Fredrik Robertsén Technology Strategist at CSC – IT Center for Science Ltd.

EuroHPC JU

LUMI supercomputer

Resource allocation

LUMI consortium

Sustainable future

Open positions

Contact us

Newsletter archive

LUMI Documentation

FAQ

Need help?

LUMI Service Status