Is TSO a historical burden for modern CPU? Case study on Apple M1
Is TSO a historical burden for modern CPU?
Preface
x86-TSO (Total Store Order) is the memory consistency model used by x86 and x86-64 architectures. It provides a balance between performance and ease of programming by allowing certain types of instruction reordering while maintaining a relatively strong consistency model. However, as modern CPUs have evolved, there has been ongoing debate about whether TSO remains the optimal choice for contemporary computing needs. Newer architectures, such as aarch64 and RISC-V, have adopted relaxed memory models that allow more reordering of memory operations between different threads, which can potentially lead to better performance in multi-core systems.
I’m interested in exploring whether TSO is becoming a historical burden for modern CPUs, especially in the context of multi-core and high-performance computing. Luckily, Apple Silicon from M1 have an custom implementation of TSO on aarch64 architecture, and can be toggled on and off via a MSR register. This provides a unique opportunity to compare the performance of TSO and relaxed memory models on the same hardware platform.
And I have developed tools to evaluate it on Apple M1 with Asahi Linux by m1tso-linux and m1-pmu-gen. Make it possible to measure the performance counter events under different memory models.
I assume the reader have basic knowledge of memory consistency models, you should at least know what is the difference between TSO and relaxed memory model. You can refer to Memory Ordering for a quick introduction. (I’m sorry that post has only Chinese version for now.)
Intuition
The reordering type is the ordering of memory operations that a CPU is allowed to perform out of program order. The following table summarizes the differences in reordering types allowed by TSO and relaxed memory models:
| Reordering Type | Seq Cst | TSO | Relaxed | 
|---|---|---|---|
| Load to Load | Disallowed | Disallowed | Allowed | 
| Load to Store | Disallowed | Disallowed | Allowed | 
| Store to Load | Disallowed | Allowed | Allowed | 
| Store to Store | Disallowed | Disallowed | Allowed | 
As we can see from the table, TSO allows store to load reordering, which allows store buffers to exist for better performance. However, TSO disallows load to load and store to store reordering, and also load to store reordering. In contrast, relaxed memory models allow all types of reordering, which can lead to better performance in multi-core systems.
Thus, how can we implement a hardware TSO on a real OoO (Out-of-Order) CPU in addition to Relaxed memory model with store buffer already present? While the store buffer ensures that store to store reordering is prevented, also store to load reordering is allowed. Then, the only thing we need to do is to prevent load to load reordering.
One possible solution is to implement a mechanism that tracks cache snooping for completed but not yet retired loads. If a load is found to have been snooped by another core, the CPU can replay the loads in program order to ensure that the load to load reordering is prevented. This approach can be implemented with minimal performance overhead, as it only requires additional tracking for loads that have been snooped by other cores.
That’s the simplest case. In practice, complex OoO CPUs may have other optimizations that make use of relaxed memory models, Apple M1 will tell us the real performance impact of TSO.
A simple TSO test
Motivated by testtso, I wrote a similar test program to evaluate the performance impact of TSO on Apple M1.
#include <atomic>
#include <thread>
#include <cassert>
constexpr size_t BUFFER_SIZE = 1 << 12;
std::atomic<int> buffer[BUFFER_SIZE];
void writer() {
    int local_counter = 0;
    while (true) {
        // 1. Sequentially write local_counter to the entire buffer
        for (size_t i = 0; i < BUFFER_SIZE; ++i) {
            buffer[i].store(local_counter, std::memory_order_relaxed);
        }
        local_counter ++;
    }
}
void reader() {
    while (true) {
        for (size_t i = 0; i < BUFFER_SIZE - 1; ++i) {
            // 2. Check that each adjacent pair in the buffer is non-decreasing
            int val1 = buffer[i+1].load(std::memory_order_relaxed);
            int val2 = buffer[i  ].load(std::memory_order_relaxed);
#ifndef NO_ASSERT
            /* In TSO, writes are seen in order, loads are not reordered
               with other loads.  Thus, buffer[i+1] should never be less
               than buffer[i] in global view in TSO.  But in relaxed memory
               model, it's possible to observe val1 > val2 due to store to
               store can be reordered, load to load can also be reordered. */
            assert(val1 <= val2);
#endif
        }
    }
}
int main() {
    std::thread writer_thread(writer);
    std::thread reader_thread(reader);
    writer_thread.join();
    reader_thread.join();
    return 0;
}
Then compile on GCC 15 with -O3 and run it on different systems.
On typical x86-64 systems, this test should pass without any assertion failures (unless you observed integer overflow after a long run). However, I have observed that on Apple M1 (Firestorm cores) and Raspberry Pi 4 (Cortex-A72), the test fails frequently when TSO is disabled, indicating that the relaxed memory model allows reordering that violates the expected behavior.
Thus, here comes a question: what is the performance impact of TSO on this test? How x86-64 systems perform on this test compared to Apple M1 with TSO enabled and disabled?
We first look into the IPC (Instructions Per Cycle) metric on Apple M1 with TSO enabled and disabled, using perf stat --per-thread -e instructions,cycles and modified the test by comment out the assertion line to skip the failure issue. All the experiments have pinned the threads to only one Firestorm cluster on Apple M1 by numactl --physcpubind=4-5.
| Thread | RMO IPC | TSO IPC | 
|---|---|---|
| Writer | 1.11 | 3.64 | 
| Reader | 4.12 | 2.68 | 
The reader thread shows a significant performance drop when TSO is enabled, while the writer thread shows a substantial performance improvement.
Here comes a question: What happens to the writer thread when TSO is enabled?
Let’s dive into the performance counter events. Since there is no so many performance counter can be configured at once on Apple M1, I sample 2 events every one second to iterate over all events on Apple M1, and filter out the events that occurred 0.0001 per instruction or more in RMO mode. Since our program is doing the exact same work in both modes no matter what the other thread is doing, we can compare the per-instruction event counts between RMO and TSO modes to see what changed.
For the reader thread:
| Event | RMO per instr | TSO per inst | diff | 
|---|---|---|---|
| CORE_ACTIVE_CYCLE | 0.2427 | 0.3728 | 1.536 | 
| INST_ALL | 1.0000 | 1.0000 | 1.000 | 
| INST_BRANCH | 0.1666 | 0.1670 | 1.002 | 
| INST_BRANCH_TAKEN | 0.1666 | 0.1666 | 1.000 | 
| INST_INT_ALU | 0.4995 | 0.4998 | 1.001 | 
| INST_INT_LD | 0.3329 | 0.3334 | 1.001 | 
| INST_LDST | 0.3331 | 0.3331 | 1.000 | 
| L1D_CACHE_MISS_LD | 0.1290 | 0.3272 | 2.536 | 
| L1D_CACHE_MISS_LD_NONSPEC | 0.1185 | 0.2948 | 2.489 | 
| L1D_TLB_ACCESS | 0.3336 | 0.3463 | 1.038 | 
| LD_UNIT_UOP | 0.4388 | 0.6087 | 1.387 | 
| MAP_DISPATCH_BUBBLE | 0.0002 | 0.0003 | 2.096 | 
| MAP_INT_UOP | 0.6684 | 0.7086 | 1.060 | 
| MAP_LDST_UOP | 0.3346 | 0.3545 | 1.059 | 
| MAP_REWIND | 0.0002 | 0.0021 | 11.135 | 
| MAP_STALL | 0.0880 | 0.2271 | 2.582 | 
| MAP_STALL_DISPATCH | 0.0872 | 0.2226 | 2.553 | 
| RETIRE_UOP | 0.9994 | 0.9998 | 1.000 | 
| SCHEDULE_EMPTY | 0.0004 | 0.0025 | 6.561 | 
| SCHEDULE_UOP | 0.8355 | 0.8798 | 1.053 | 
As we can see, there is more LD_UNIT_UOP and L1D_CACHE_MISS_LD in TSO mode, indicating that loads are taking longer to complete due to the additional constraints imposed by TSO, perhaps due to the load flushes/replays when loads are snooped by other cores. This leads to more stalls in the scheduling and mapping stages, resulting in lower IPC.
For the writer thread:
| Event | RMO per inst | TSO per inst | diff | 
|---|---|---|---|
| CORE_ACTIVE_CYCLE | 0.9002 | 0.2748 | 0.305 | 
| INST_ALL | 1.0000 | 1.0000 | 1.000 | 
| INST_BRANCH | 0.2492 | 0.2497 | 1.002 | 
| INST_BRANCH_TAKEN | 0.2492 | 0.2499 | 1.003 | 
| INST_INT_ALU | 0.4999 | 0.4997 | 1.000 | 
| INST_INT_LD | 0.0005 | 0.0001 | 0.300 | 
| INST_INT_ST | 0.2488 | 0.2497 | 1.004 | 
| INST_LDST | 0.2494 | 0.2498 | 1.002 | 
| INTERRUPT_PENDING | 0.0002 | 0.0001 | 0.310 | 
| L1D_CACHE_MISS_ST | 0.0042 | 0.0001 | 0.016 | 
| L1D_CACHE_MISS_ST_NONSPEC | 0.0021 | 0.0001 | 0.032 | 
| L1D_CACHE_WRITEBACK | 0.0155 | 0.0080 | 0.513 | 
| L1D_TLB_ACCESS | 0.2494 | 0.2501 | 1.003 | 
| LD_UNIT_UOP | 0.0006 | 0.0002 | 0.350 | 
| MAP_DISPATCH_BUBBLE | 0.0002 | 0.0015 | 7.083 | 
| MAP_INT_UOP | 0.7498 | 0.7520 | 1.003 | 
| MAP_LDST_UOP | 0.2503 | 0.2509 | 1.002 | 
| MAP_REWIND | 0.0002 | 0.0002 | 0.893 | 
| MAP_STALL | 0.7536 | 0.0320 | 0.043 | 
| MAP_STALL_DISPATCH | 0.7532 | 0.0318 | 0.042 | 
| RETIRE_UOP | 1.0009 | 0.9996 | 0.999 | 
| SCHEDULE_EMPTY | 0.0005 | 0.0001 | 0.252 | 
| SCHEDULE_UOP | 0.7494 | 0.7512 | 1.002 | 
| ST_NT_UOP | 0.2175 | 0.0000 | 0.000 | 
| ST_UNIT_UOP | 0.2495 | 0.2498 | 1.001 | 
Here we can see a significant reduction in ST_NT_UOP (Store Non-Temporal UOP) in TSO mode. While nearly all stores in RMO mode are non-temporal stores, which bypass the cache and go directly to memory, in TSO mode, there is no non-temporal stores observed. This suggests that TSO mode restricts the use of non-temporal stores in Apple M1, likely due to the need to maintain store order for TSO compliance. As a result, stores may have to go through the cache hierarchy, leading to increased cache writebacks and potentially higher latency for store operations. This change in store behavior could explain the improved performance of the writer thread in TSO mode, as it may benefit from better cache utilization and reduced memory access latency compared to RMO mode.
From the above analysis, we can conclude that TSO imposes additional constraints on memory operations that can lead to performance degradation in certain scenarios, particularly for load-heavy workloads. However, when it meets the wrong non-temporal store usage, Apple’s TSO implementation on M1 can actually improve performance by enforcing a more cache-friendly store behavior. That’s quite interesting!
Real world applications
To further evaluate the performance impact of TSO on real-world applications, I ran SPECrate 2006 and SPECrate 2017 benchmarks with only 1 copy (Single Thread) on Apple M1 with TSO enabled and disabled. All the benchmarks were compiled with GCC 15 with -O3 -march=armv8.5-a. Note that I didn’t enabled LTO like most SPECrate submissions do to reduce the complexity of the build process. The results are as follows:
SPECrate 2006:
| benchmark | wmo ratio | tso ratio | diff | 
|---|---|---|---|
| 400.perlbench | 66.13 | 64.14 | 0.970 | 
| 401.bzip2 | 38.80 | 38.45 | 0.991 | 
| 403.gcc | 79.43 | 78.48 | 0.988 | 
| 429.mcf | 83.03 | 83.32 | 1.003 | 
| 445.gobmk | 55.62 | 54.64 | 0.983 | 
| 456.hmmer | 108.70 | 108.64 | 0.999 | 
| 458.sjeng | 44.90 | 44.98 | 1.002 | 
| 462.libquantum | 167.27 | 167.30 | 1.000 | 
| 464.h264ref | 96.40 | 23.81 | 0.247 | 
| 471.omnetpp | 40.04 | 39.48 | 0.986 | 
| 473.astar | 40.12 | 39.96 | 0.996 | 
| 483.xalancbmk | 71.44 | 64.92 | 0.909 | 
| SPECint_rate | 67.21 | 58.94 | 0.877 | 
| 410.bwaves | 132.59 | 112.64 | 0.850 | 
| 416.gamess | 75.95 | 71.12 | 0.936 | 
| 433.milc | 85.82 | 78.73 | 0.917 | 
| 434.zeusmp | 124.30 | 104.40 | 0.840 | 
| 435.gromacs | 64.64 | 63.93 | 0.989 | 
| 436.cactusADM | 103.22 | 37.49 | 0.363 | 
| 437.leslie3d | 121.80 | 95.74 | 0.786 | 
| 444.namd | 54.84 | 54.85 | 1.000 | 
| 447.dealII | 63.63 | 62.87 | 0.988 | 
| 450.soplex | 87.28 | 85.44 | 0.979 | 
| 453.povray | 99.17 | 88.44 | 0.892 | 
| 454.calculix | 21.84 | 21.75 | 0.996 | 
| 459.GemsFDTD | 94.95 | 86.32 | 0.909 | 
| 465.tonto | 93.31 | 69.35 | 0.743 | 
| 470.lbm | 164.07 | 145.85 | 0.889 | 
| 481.wrf | 134.64 | 127.36 | 0.946 | 
| 482.sphinx3 | 106.57 | 102.15 | 0.958 | 
| SPECfp_rate | 88.28 | 76.14 | 0.863 | 
SPECrate 2017:
| benchmark | wmo ratio | tso ratio | cycle diff | 
|---|---|---|---|
| 500.perlbench_r | 8.50 | 8.45 | 0.995 | 
| 502.gcc_r | 12.34 | 11.16 | 0.904 | 
| 505.mcf_r | 8.25 | 8.16 | 0.989 | 
| 520.omnetpp_r | 4.69 | 4.60 | 0.981 | 
| 523.xalancbmk_r | 5.37 | 5.36 | 0.998 | 
| 525.x264_r | 23.01 | 17.44 | 0.758 | 
| 531.deepsjeng_r | 5.19 | 5.19 | 1.000 | 
| 541.leela_r | 5.64 | 5.33 | 0.945 | 
| 548.exchange2_r | 25.23 | 25.17 | 0.998 | 
| 557.xz_r | 3.82 | 3.79 | 0.993 | 
| SPECint_rate | 8.15 | 7.77 | 0.953 | 
| 503.bwaves_r | 58.88 | 41.17 | 0.699 | 
| 507.cactuBSSN_r | 6.50 | 4.74 | 0.729 | 
| 508.namd_r | 9.57 | 9.46 | 0.988 | 
| 510.parest_r | 7.30 | 7.21 | 0.988 | 
| 511.povray_r | 11.12 | 9.99 | 0.898 | 
| 519.lbm_r | 8.60 | 8.09 | 0.942 | 
| 521.wrf_r | 16.35 | 13.71 | 0.839 | 
| 526.blender_r | 9.44 | 9.31 | 0.987 | 
| 527.cam4_r | 14.24 | 10.49 | 0.736 | 
| 538.imagick_r | 12.22 | 12.11 | 0.991 | 
| 544.nab_r | 10.05 | 10.01 | 0.996 | 
| 549.fotonik3d_r | 22.57 | 13.90 | 0.616 | 
| 554.roms_r | 10.35 | 8.62 | 0.832 | 
| SPECfp_rate | 12.39 | 10.59 | 0.855 | 
From the benchmark results, we can observe that several benchmarks experience significant performance degradation when TSO is enabled. Notably, benchmarks such as 464.h264ref, 483.xalancbmk, 410.bwaves, 434.zeusmp, 436.cactusADM, 437.leslie3d, 453.povray, 459.GemsFDTD, 465.tonto, and 470.lbm in SPECrate 2006 show performance drops ranging from approximately 11% to 64%. Similarly, in SPECrate 2017, benchmarks like 525.x264_r, 503.bwaves_r, 507.cactuBSSN_r, 511.povray_r, 521.wrf_r, 527.cam4_r, 549.fotonik3d_r, and 554.roms_r exhibit performance reductions between roughly 17% to 38%.
How will x86 systems outcome on these benchmarks? It’s hard to say without the same hardware platform to compare. But based on the previous analysis of the simple TSO test, we can speculate that workloads with high memory access patterns, especially those involving frequent loads and stores, are more likely to be negatively impacted by TSO due to the additional constraints it imposes on memory operation ordering.
Conclusion
From the experiments and analysis conducted on Apple M1 with TSO enabled and disabled, we can conclude that TSO can indeed be a performance burden for modern CPUs in certain scenarios. The simple TSO test demonstrated that TSO imposes additional constraints on memory operations, leading to increased load latencies and stalls in load-heavy workloads. However, not all optimizations to overcome TSO’s limitations are effective. For instance, Apple’s non-temporal store optimization is disabled under TSO, which can lead to performance improvements in specific workloads when TSO is enabled.
It’s better to implement such microarchitectural features on gem5 and characterize them more thoroughly. Hope to do it in the future.
Source code of the simple TSO test is available at github.com/cyyself/tsotest.