“Short-leg” of RISC-V | RISC-V 的小短腿

Table of Contents

Preface

I came across some code performance issues only on RISC-V recently. Their root cause is the short offset of branch instruction on RISC-V. I am writing a blog post to share the issues and discuss possible solutions for compiler and CPU architecture in the future.

Branch offset length on other ISAs

On x86-64, JMP instructions has signed 8 bits or 32 bits relative offset. Thus, we can jump to +-2GiB code space in a single conditional branch instruction.

On aarch64, B.cond instructions has signed 19 bits relative offset. The PC must be aligned to 4Bytes, so relative addresses are encoded as "imm19" times 4. Thus, we can jump to +-1MiB code space in a single conditional branch instruction.

On RISC-V (whatever 32I/64I), branch instructions has signed imm[12:1] offset. The PC must be aligned to 2Bytes. Thus, we can only jump to +- 4KiB code space in a single conditional branch instruction.

On MIPS32, branch instructions has signed 16 bits offset. The PC must be aligned to 4Bytes, so relative addresses are encoded as "imm16" times 4. Thus, we can jump to +-128KiB code space in a single conditional branch instruction.

In conclusion, RISC-V has the shortest branch offset (4K) among x86-64 (2G) and aarch64(1M) and even MIPS32(128K).

What will happen if the branch offset is too small?

A case was shown here using loop unroll: https://godbolt.org/z/6jf5Pon3h

int a[64], b[64], c[64], d[64];
int e[64], f[64], g[64], h[64];

int foo(bool do_memset) {
    if (do_memset) { [[likely]] // some code > 4KB
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) a[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) b[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) c[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) d[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) e[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) f[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) g[j*8+i] *=2333;
        for (int j=0;j<8;j++)
            for (int i=0;i<8;i++) h[j*8+i] *=2333;
        return 0;
    }
    else return 1;
}

Then both GCC 14.2.0 and LLVM 18.1.0 will generate "branch + jump" to skip the if-true block.

foo(bool):
        mv      a1, a0
        li      a0, 1
        bnez    a1, .LBB0_1
        j       .LBB0_2
.LBB0_1:
        li      a0, 0
.Lpcrel_hi0:
        auipc   a1, %pcrel_hi(a)
        addi    a2, a1, %pcrel_lo(.Lpcrel_hi0)
        lw      a3, 0(a2)
        lui     a1, 1
        addi    a1, a1, -1763
        lw      a4, 4(a2)
        mul     a3, a3, a1
        sw      a3, 0(a2)
        lw      a3, 8(a2)
        mul     a4, a4, a1
...

But if we have sufficient length on branch offset (emulated by reducing the number of "for" blocks), the code will be:

foo(bool):
        mv      a1, a0
        li      a0, 1
        beqz    a1, .LBB0_2
        li      a0, 0
.Lpcrel_hi0:
        auipc   a1, %pcrel_hi(a)
        addi    a1, a1, %pcrel_lo(.Lpcrel_hi0)
        lw      a3, 0(a1)
        lui     a2, 1
        addi    a2, a2, -1763
        lw      a4, 4(a1)
        mul     a3, a3, a2
        sw      a3, 0(a1)
        lw      a3, 8(a1)
        mul     a4, a4, a2
...

What will happen if the code running on real CPU?

The point is: Untaken branch does not require Branch Preditor to work.

In this case, without sufficient offset in branch instruction, the compiler will convert an "unlikely branch" to "likely branch + unlikely jump" cause the branch predictor must know the first branch should be taken and should have the branch target address hit in the branch target buffer.

Suppose a workload with so many branches without locality overflows the storage size of the branch predictor in the CPU chip itself. In that case, code like this will perform very poorly on an trivial OoO CPU because of the frequent front and backend flush, and there is no chance of using OoO to have a better throughput to tolerate long latency memory access.

A very bad case

Verilator generated code + clang PGO

Features:

  • Widespread branch (nearly every 10 instructions)
  • Huge code size on large designs
  • No locality
  • Huge memory footprint, many high latency memory access
  • Branch direction can be easily determined by PGO

But the code on RISC-V is like:

Possible Solutions

Our Goal: Reduce number of likely taken branch instruction

CPU Side

Intuition: Fuse branch + jump to a single branch with long offset in frontend

Related work: Macro-op fusion by SiFive

And the patent from SiFive shows this case even in 2018:

Compiler Side

Since we have no conditional branch to register on RISC-V, solutions like jump table island on LLVM on ARM are not possible.

But we can have a jump table island using branch instruction itself. A simple solution can be placing an island based on jump instruction in every 4KB code region. However, if the processor already has macro-op fusion, emitting code in this way will have more code size and cause more jump instructions to be generated and executed.

Talk

陈泱宇 – 探讨RISC-V的分支指令小短腿带来的性能问题

4 Responses

  1. “On RISC-V (whatever 32I/64I), branch instructions has signed imm[13:1] offset. The PC must be aligned to 2Bytes. Thus, we can only jump to +- 4KiB code space in a single conditional branch instruction.”

    A small typo, it should be “imm[12:1]”.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to Top