How CCMP reduce the pressure of branch predictor on aarch64
Preface
When comparing branch MPKI (Miss Per Kilo Instructions) on aarch64 with other architectures such as RISC-V (including RVA23) or x86-64 (without APX), we often observe that certain unpredictable branches are eliminated on aarch64.
This is because aarch64 offers numerous conditional execution instructions that significantly reduce the number of branch instructions that need to be executed. On other architectures like RISC-V, even with Bitmanip or Zicond extensions, to achieve the same number of branches, we require 3 or more ALU instructions to be inserted before the branch to calculate the result. This increases the number of instructions that need to be executed and also negatively impacts performance if the branch is easy to predict.
This article delves into the design of conditional execution on aarch64 and its application in the 429.mcf workload SPECINT 2006 benchmark.
ARM Conditional Execution
Unlike some RISC architectures like RISC-V or MIPS, which lack a flags register, aarch64 has an additional CPSR(Current Program Status Register) to store certain flags, such as conditional flags (its N,Z,C,V bits). This register offers some useful flags for the cmp
instruction, similar to x86, and can be utilized in branch instructions like beq
, which checks if the Z bit is set in the CPSR register.
Then, the aarch64 ISA also provides the ccmp
instruction for calculating the cascade condition. The format of ccmp
can be CCMP <Xn>, #<imm>, #<nzcv>, <cond>
(imm) or CCMP <Wn>, <Wm>, #<nzcv>, <cond>
(register).
The cond
parameter can be a condition such as eq
or ne
, which checks the bits in the conditional flags stored in the CPSR register.
The #<nzcv>
parameter is a 4-bit immediate value used to set the CPSR.nczv
flag when the condition specified by <cond>
does not meet.
The register version of ccmp
works as follows:
if (cond) {
CPSR.nczv = Compare(<Wn>, <Wm>);
}
else {
CPSR.nczv = #<nzcv>;
}
We can use the cmp
and ccmp
operators in conjunction to create cascading conditions, similar to how we use the and
and or
operators in boolean expressions.
Here is an example:
int foo(int a, int b, int c, int d) {
if (a == b && c == d) return 1;
else return 2;
}
Compiled on aarch64 GCC14.2.0 with -O3
:
foo:
cmp w0, w1
mov w1, 2
ccmp w2, w3, 0, eq
cset w0, eq
sub w0, w1, w0
ret
But on x86-64 (without APX), we got:
foo:
cmp edi, esi
sete al
cmp edx, ecx
sete dl
movzx edx, dl
and edx, eax
mov eax, 2
sub eax, edx
ret
And on RISC-V (RVA23 with b_zicond), the compiler even split the basic block:
foo:
bne a0,a1,.L3
sub a2,a2,a3
snez a0,a2
addi a0,a0,1
ret
.L3:
li a0,2
ret
SPECINT 2006 429.mcf
The CCMP helps SPECINT 2006 429.mcf benchmark on aarch64 processors.
In mcf, we have a function like this:
int bea_is_dual_infeasible( arc_t *arc, cost_t red_cost )
{
return( (red_cost < 0 && arc->ident == AT_LOWER)
|| (red_cost > 0 && arc->ident == AT_UPPER) );
}
This function will then be inlined in an if condition, but let’s analyze the function itself.
On aarch64 with GCC 14.2.0 -O3 -fverbose-asm -S
, we will see the instructions like:
.L34:
// pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
ccmp w5, 2, 0, ne // _21,,,
beq .L35 //,
.L33:
// pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add x0, x0, x8 // arc, arc, _34
// pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
cmp x2, x0 // stop_arcs, arc
bls .L32 //,
However, on x86 or RISC-V, the things went differently:
.L37:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
je .L36 #,
cmpl $2, %ecx #, _21
je .L38 #,
.p2align 4
.p2align 4
.p2align 3
.L36:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
addq %rdi, %rax # _34, arc
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
cmpq %rsi, %rax # stop_arcs, arc
jnb .L35 #,
.L35:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
beq a4,zero,.L34 #, red_cost,,
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
beq a0,a3,.L36 #, _21, tmp268,
.L34:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add a5,a5,t1 # _34, arc, arc
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
bleu a2,a5,.L33 #, stop_arcs, arc,
When the red_cost == 0
is hard to predict, these instructions may cause significantly more branch mispredictions, which flush the entire CPU pipeline, but are only used for skipping a condition that does not need to be checked.
A better RISC-V code might be (Using Zicond, assuming a3 is a free register):
.L35:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
xor a3,a0,t6 #, tmp268, _21,
seqz a3,a3
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
czero.eqz a3,a3,a4 #,,red_cost,
bnez a3,.L36
.L34:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add a5,a5,t1 # _34, arc, arc
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
bleu a2,a5,.L33 #, stop_arcs, arc,
Performance Evaluation
I emulated the code pattern with 2 branches on aarch64 on Apple M1 CPU:
I patched the ASM on aarch64 in this way:
--- pbeampp.s
+++ pbeampp.s
@@ -269,6 +269,7 @@
.p2align 2,,3
.L34:
// pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
+ beq .L33
ccmp w5, 2, 0, ne // _21,,,
beq .L35 //,
.L33:
Then I run the benchmark on Apple M1, we sees:
Original version:
Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':
2,024,810,367 apple_firestorm_pmu/branch_mispred_nonspec:u/ (100.00%)
63,058,938,380 apple_firestorm_pmu/inst_branch:u/ (100.00%)
2,024,807,836 apple_firestorm_pmu/branch_cond_mispred_nonspec:u/ (100.00%)
288,346,808,544 apple_firestorm_pmu/instructions:u/ # 0.81 insn per cycle (100.00%)
355,818,120,265 apple_firestorm_pmu/cycles:u/ (100.00%)
112.009282105 seconds time elapsed
111.437283000 seconds user
0.073728000 seconds sys
Patched two branch version:
Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':
3,449,189,303 apple_firestorm_pmu/branch_mispred_nonspec:u/ (100.00%)
67,591,263,076 apple_firestorm_pmu/inst_branch:u/ (100.00%)
3,449,186,323 apple_firestorm_pmu/branch_cond_mispred_nonspec:u/ (100.00%)
291,092,191,936 apple_firestorm_pmu/instructions:u/ # 0.76 insn per cycle (100.00%)
381,611,914,976 apple_firestorm_pmu/cycles:u/ (100.00%)
120.090912907 seconds time elapsed
119.539129000 seconds user
0.056814000 seconds sys
The second version we received resulted in 70% (2e9 to 3.4e9) more branch mispredict events and a 7.2% (111.42s to 119.54s) increase in running time.
About other ISAs
x86
Intel also introduced the CCMP instruction for x86 in the APX instruction sets in 2023. However, there is currently no chip support for this extension. Nevertheless, we can observe Intel’s efforts to support CCMP in GCC and LLVM.
On GCC master branch with -O3 -mapxf
, the code will be:
.L34:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
ccmpnel {dfv=} $2, %ecx #,,, _21
je .L35 #,
.L33:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
RISC-V
We might need some extensions like Ziccmp
on RISC-V?
See also
The AArch64 processor (aka arm64), part 16: Conditional execution – The Old New Thing