Is TSO a historical burden for modern CPU? Case study on Apple M1

Table of Contents

Is TSO a historical burden for modern CPU?

Preface

x86-TSO (Total Store Order) is the memory consistency model used by x86 and x86-64 architectures. It provides a balance between performance and ease of programming by allowing certain types of instruction reordering while maintaining a relatively strong consistency model. However, as modern CPUs have evolved, there has been ongoing debate about whether TSO remains the optimal choice for contemporary computing needs. Newer architectures, such as aarch64 and RISC-V, have adopted relaxed memory models that allow more reordering of memory operations between different threads, which can potentially lead to better performance in multi-core systems.

I’m interested in exploring whether TSO is becoming a historical burden for modern CPUs, especially in the context of multi-core and high-performance computing. Luckily, Apple Silicon from M1 have an custom implementation of TSO on aarch64 architecture, and can be toggled on and off via a MSR register. This provides a unique opportunity to compare the performance of TSO and relaxed memory models on the same hardware platform.

And I have developed tools to evaluate it on Apple M1 with Asahi Linux by m1tso-linux and m1-pmu-gen. Make it possible to measure the performance counter events under different memory models.

I assume the reader have basic knowledge of memory consistency models, you should at least know what is the difference between TSO and relaxed memory model. You can refer to Memory Ordering for a quick introduction. (I’m sorry that post has only Chinese version for now.)

Intuition

The reordering type is the ordering of memory operations that a CPU is allowed to perform out of program order. The following table summarizes the differences in reordering types allowed by TSO and relaxed memory models:

Reordering Type Seq Cst TSO Relaxed
Load to Load Disallowed Disallowed Allowed
Load to Store Disallowed Disallowed Allowed
Store to Load Disallowed Allowed Allowed
Store to Store Disallowed Disallowed Allowed

As we can see from the table, TSO allows store to load reordering, which allows store buffers to exist for better performance. However, TSO disallows load to load and store to store reordering, and also load to store reordering. In contrast, relaxed memory models allow all types of reordering, which can lead to better performance in multi-core systems.

Thus, how can we implement a hardware TSO on a real OoO (Out-of-Order) CPU in addition to Relaxed memory model with store buffer already present? While the store buffer ensures that store to store reordering is prevented, also store to load reordering is allowed. Then, the only thing we need to do is to prevent load to load reordering.

One possible solution is to implement a mechanism that tracks cache snooping for completed but not yet retired loads. If a load is found to have been snooped by another core, the CPU can replay the loads in program order to ensure that the load to load reordering is prevented. This approach can be implemented with minimal performance overhead, as it only requires additional tracking for loads that have been snooped by other cores.

That’s the simplest case. In practice, complex OoO CPUs may have other optimizations that make use of relaxed memory models, Apple M1 will tell us the real performance impact of TSO.

A simple TSO test

Motivated by testtso, I wrote a similar test program to evaluate the performance impact of TSO on Apple M1.

#include <atomic>
#include <thread>
#include <cassert>

constexpr size_t BUFFER_SIZE = 1 << 12;
std::atomic<int> buffer[BUFFER_SIZE];

void writer() {
    int local_counter = 0;
    while (true) {
        // 1. Sequentially write local_counter to the entire buffer
        for (size_t i = 0; i < BUFFER_SIZE; ++i) {
            buffer[i].store(local_counter, std::memory_order_relaxed);
        }
        local_counter ++;
    }
}

void reader() {
    while (true) {
        for (size_t i = 0; i < BUFFER_SIZE - 1; ++i) {
            // 2. Check that each adjacent pair in the buffer is non-decreasing
            int val1 = buffer[i+1].load(std::memory_order_relaxed);
            int val2 = buffer[i  ].load(std::memory_order_relaxed);
#ifndef NO_ASSERT
            /* In TSO, writes are seen in order, loads are not reordered
               with other loads.  Thus, buffer[i+1] should never be less
               than buffer[i] in global view in TSO.  But in relaxed memory
               model, it's possible to observe val1 > val2 due to store to
               store can be reordered, load to load can also be reordered. */
            assert(val1 <= val2);
#endif
        }
    }
}

int main() {
    std::thread writer_thread(writer);
    std::thread reader_thread(reader);

    writer_thread.join();
    reader_thread.join();

    return 0;
}

Then compile on GCC 15 with -O3 and run it on different systems.

On typical x86-64 systems, this test should pass without any assertion failures (unless you observed integer overflow after a long run). However, I have observed that on Apple M1 (Firestorm cores) and Raspberry Pi 4 (Cortex-A72), the test fails frequently when TSO is disabled, indicating that the relaxed memory model allows reordering that violates the expected behavior.

Thus, here comes a question: what is the performance impact of TSO on this test? How x86-64 systems perform on this test compared to Apple M1 with TSO enabled and disabled?

We first look into the IPC (Instructions Per Cycle) metric on Apple M1 with TSO enabled and disabled, using perf stat --per-thread -e instructions,cycles and modified the test by comment out the assertion line to skip the failure issue. All the experiments have pinned the threads to only one Firestorm cluster on Apple M1 by numactl --physcpubind=4-5.

Thread RMO IPC TSO IPC
Writer 1.11 3.64
Reader 4.12 2.68

The reader thread shows a significant performance drop when TSO is enabled, while the writer thread shows a substantial performance improvement.

Here comes a question: What happens to the writer thread when TSO is enabled?

Let’s dive into the performance counter events. Since there is no so many performance counter can be configured at once on Apple M1, I sample 2 events every one second to iterate over all events on Apple M1, and filter out the events that occurred 0.0001 per instruction or more in RMO mode. Since our program is doing the exact same work in both modes no matter what the other thread is doing, we can compare the per-instruction event counts between RMO and TSO modes to see what changed.

For the reader thread:

Event RMO per instr TSO per inst diff
CORE_ACTIVE_CYCLE 0.2427 0.3728 1.536
INST_ALL 1.0000 1.0000 1.000
INST_BRANCH 0.1666 0.1670 1.002
INST_BRANCH_TAKEN 0.1666 0.1666 1.000
INST_INT_ALU 0.4995 0.4998 1.001
INST_INT_LD 0.3329 0.3334 1.001
INST_LDST 0.3331 0.3331 1.000
L1D_CACHE_MISS_LD 0.1290 0.3272 2.536
L1D_CACHE_MISS_LD_NONSPEC 0.1185 0.2948 2.489
L1D_TLB_ACCESS 0.3336 0.3463 1.038
LD_UNIT_UOP 0.4388 0.6087 1.387
MAP_DISPATCH_BUBBLE 0.0002 0.0003 2.096
MAP_INT_UOP 0.6684 0.7086 1.060
MAP_LDST_UOP 0.3346 0.3545 1.059
MAP_REWIND 0.0002 0.0021 11.135
MAP_STALL 0.0880 0.2271 2.582
MAP_STALL_DISPATCH 0.0872 0.2226 2.553
RETIRE_UOP 0.9994 0.9998 1.000
SCHEDULE_EMPTY 0.0004 0.0025 6.561
SCHEDULE_UOP 0.8355 0.8798 1.053

As we can see, there is more LD_UNIT_UOP and L1D_CACHE_MISS_LD in TSO mode, indicating that loads are taking longer to complete due to the additional constraints imposed by TSO, perhaps due to the load flushes/replays when loads are snooped by other cores. This leads to more stalls in the scheduling and mapping stages, resulting in lower IPC.

For the writer thread:

Event RMO per inst TSO per inst diff
CORE_ACTIVE_CYCLE 0.9002 0.2748 0.305
INST_ALL 1.0000 1.0000 1.000
INST_BRANCH 0.2492 0.2497 1.002
INST_BRANCH_TAKEN 0.2492 0.2499 1.003
INST_INT_ALU 0.4999 0.4997 1.000
INST_INT_LD 0.0005 0.0001 0.300
INST_INT_ST 0.2488 0.2497 1.004
INST_LDST 0.2494 0.2498 1.002
INTERRUPT_PENDING 0.0002 0.0001 0.310
L1D_CACHE_MISS_ST 0.0042 0.0001 0.016
L1D_CACHE_MISS_ST_NONSPEC 0.0021 0.0001 0.032
L1D_CACHE_WRITEBACK 0.0155 0.0080 0.513
L1D_TLB_ACCESS 0.2494 0.2501 1.003
LD_UNIT_UOP 0.0006 0.0002 0.350
MAP_DISPATCH_BUBBLE 0.0002 0.0015 7.083
MAP_INT_UOP 0.7498 0.7520 1.003
MAP_LDST_UOP 0.2503 0.2509 1.002
MAP_REWIND 0.0002 0.0002 0.893
MAP_STALL 0.7536 0.0320 0.043
MAP_STALL_DISPATCH 0.7532 0.0318 0.042
RETIRE_UOP 1.0009 0.9996 0.999
SCHEDULE_EMPTY 0.0005 0.0001 0.252
SCHEDULE_UOP 0.7494 0.7512 1.002
ST_NT_UOP 0.2175 0.0000 0.000
ST_UNIT_UOP 0.2495 0.2498 1.001

Here we can see a significant reduction in ST_NT_UOP (Store Non-Temporal UOP) in TSO mode. While nearly all stores in RMO mode are non-temporal stores, which bypass the cache and go directly to memory, in TSO mode, there is no non-temporal stores observed. This suggests that TSO mode restricts the use of non-temporal stores in Apple M1, likely due to the need to maintain store order for TSO compliance. As a result, stores may have to go through the cache hierarchy, leading to increased cache writebacks and potentially higher latency for store operations. This change in store behavior could explain the improved performance of the writer thread in TSO mode, as it may benefit from better cache utilization and reduced memory access latency compared to RMO mode.

From the above analysis, we can conclude that TSO imposes additional constraints on memory operations that can lead to performance degradation in certain scenarios, particularly for load-heavy workloads. However, when it meets the wrong non-temporal store usage, Apple’s TSO implementation on M1 can actually improve performance by enforcing a more cache-friendly store behavior. That’s quite interesting!

Real world applications

To further evaluate the performance impact of TSO on real-world applications, I ran SPECrate 2006 and SPECrate 2017 benchmarks with only 1 copy (Single Thread) on Apple M1 with TSO enabled and disabled. All the benchmarks were compiled with GCC 15 with -O3 -march=armv8.5-a. Note that I didn’t enabled LTO like most SPECrate submissions do to reduce the complexity of the build process. The results are as follows:

SPECrate 2006:

benchmark wmo ratio tso ratio diff
400.perlbench 66.13 64.14 0.970
401.bzip2 38.80 38.45 0.991
403.gcc 79.43 78.48 0.988
429.mcf 83.03 83.32 1.003
445.gobmk 55.62 54.64 0.983
456.hmmer 108.70 108.64 0.999
458.sjeng 44.90 44.98 1.002
462.libquantum 167.27 167.30 1.000
464.h264ref 96.40 23.81 0.247
471.omnetpp 40.04 39.48 0.986
473.astar 40.12 39.96 0.996
483.xalancbmk 71.44 64.92 0.909
SPECint_rate 67.21 58.94 0.877
410.bwaves 132.59 112.64 0.850
416.gamess 75.95 71.12 0.936
433.milc 85.82 78.73 0.917
434.zeusmp 124.30 104.40 0.840
435.gromacs 64.64 63.93 0.989
436.cactusADM 103.22 37.49 0.363
437.leslie3d 121.80 95.74 0.786
444.namd 54.84 54.85 1.000
447.dealII 63.63 62.87 0.988
450.soplex 87.28 85.44 0.979
453.povray 99.17 88.44 0.892
454.calculix 21.84 21.75 0.996
459.GemsFDTD 94.95 86.32 0.909
465.tonto 93.31 69.35 0.743
470.lbm 164.07 145.85 0.889
481.wrf 134.64 127.36 0.946
482.sphinx3 106.57 102.15 0.958
SPECfp_rate 88.28 76.14 0.863

SPECrate 2017:

benchmark wmo ratio tso ratio cycle diff
500.perlbench_r 8.50 8.45 0.995
502.gcc_r 12.34 11.16 0.904
505.mcf_r 8.25 8.16 0.989
520.omnetpp_r 4.69 4.60 0.981
523.xalancbmk_r 5.37 5.36 0.998
525.x264_r 23.01 17.44 0.758
531.deepsjeng_r 5.19 5.19 1.000
541.leela_r 5.64 5.33 0.945
548.exchange2_r 25.23 25.17 0.998
557.xz_r 3.82 3.79 0.993
SPECint_rate 8.15 7.77 0.953
503.bwaves_r 58.88 41.17 0.699
507.cactuBSSN_r 6.50 4.74 0.729
508.namd_r 9.57 9.46 0.988
510.parest_r 7.30 7.21 0.988
511.povray_r 11.12 9.99 0.898
519.lbm_r 8.60 8.09 0.942
521.wrf_r 16.35 13.71 0.839
526.blender_r 9.44 9.31 0.987
527.cam4_r 14.24 10.49 0.736
538.imagick_r 12.22 12.11 0.991
544.nab_r 10.05 10.01 0.996
549.fotonik3d_r 22.57 13.90 0.616
554.roms_r 10.35 8.62 0.832
SPECfp_rate 12.39 10.59 0.855

From the benchmark results, we can observe that several benchmarks experience significant performance degradation when TSO is enabled. Notably, benchmarks such as 464.h264ref, 483.xalancbmk, 410.bwaves, 434.zeusmp, 436.cactusADM, 437.leslie3d, 453.povray, 459.GemsFDTD, 465.tonto, and 470.lbm in SPECrate 2006 show performance drops ranging from approximately 11% to 64%. Similarly, in SPECrate 2017, benchmarks like 525.x264_r, 503.bwaves_r, 507.cactuBSSN_r, 511.povray_r, 521.wrf_r, 527.cam4_r, 549.fotonik3d_r, and 554.roms_r exhibit performance reductions between roughly 17% to 38%.

How will x86 systems outcome on these benchmarks? It’s hard to say without the same hardware platform to compare. But based on the previous analysis of the simple TSO test, we can speculate that workloads with high memory access patterns, especially those involving frequent loads and stores, are more likely to be negatively impacted by TSO due to the additional constraints it imposes on memory operation ordering.

Conclusion

From the experiments and analysis conducted on Apple M1 with TSO enabled and disabled, we can conclude that TSO can indeed be a performance burden for modern CPUs in certain scenarios. The simple TSO test demonstrated that TSO imposes additional constraints on memory operations, leading to increased load latencies and stalls in load-heavy workloads. However, not all optimizations to overcome TSO’s limitations are effective. For instance, Apple’s non-temporal store optimization is disabled under TSO, which can lead to performance improvements in specific workloads when TSO is enabled.

It’s better to implement such microarchitectural features on gem5 and characterize them more thoroughly. Hope to do it in the future.

Source code of the simple TSO test is available at github.com/cyyself/tsotest.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to Top