SPEC CPU 2017 mcf 与 2006 的差异浅析

内容纲要

背景

最近朋友 Easton Man 告诉了我一些SPEC CPU 2006 与 2017 的一些性能计数器 Topdown 统计结果,在观测他的结果后我惊讶地发现,SPEC CPU 2017中 mcf 这一 workload 对 Load 的占比明显低于 2006,这一发现吸引了我的兴趣。要知道,mcf 实际算的就是一个图上的车辆调度问题,具体可见SPEC文档,在SPEC 06中是一个十分具有代表性的对Cache压力较大的Workload。因此,我开始着手观测两个不同版本之间的区别。

第一眼

我首先分别使用最新版本的SPEC CPU 2006 与 SPEC CPU 2017 在13900K CPU上使用gcc 13.2并将spec的base编译的参数都调整为-O3,然后测试。

首先观察输入文件,对于mcf采用的是inp.in,结果是:

➜  ~ md5sum ~/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in
c9cbc843e228ad728ca48d99d8293b4a  /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in
➜  ~ md5sum ~/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in 
c9cbc843e228ad728ca48d99d8293b4a  /home/cyy/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in
➜  ~ head ~/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in
25137 185356
11220 13320
13020 15120
13560 15000
13680 15540
13680 15420
13920 16020
13980 14760
14040 16620
14040 15660

我们惊讶地发现,spec06与spec17的mcf居然采用的完全相同的输入,连md5都是一致的,然后我再使用time perf record ./mcf inp.in来运行他们,结果如下:

对于SPEC 06:

➜  run_base_ref_gcc-64bit.0000 git:(master) ✗ time perf record ./mcf_base.gcc-64bit inp.in 

MCF SPEC CPU2006 version 1.10
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel

nodes                      : 25137
active arcs                : 260767
simplex iterations         : 226605
objective value            : 15887540369
new implicit arcs          : 27067745
active arcs                : 27328512
simplex iterations         : 2591718
objective value            : 12129093300
erased arcs                : 27045002
new implicit arcs          : 27045002
active arcs                : 27328512
simplex iterations         : 4853421
objective value            : 11109478016
erased arcs                : 27032013
new implicit arcs          : 27032013
active arcs                : 27328512
simplex iterations         : 6008461
objective value            : 11006050760
erased arcs                : 27028145
new implicit arcs          : 27028145
active arcs                : 27328512
simplex iterations         : 6751264
objective value            : 10912231340
erased arcs                : 27022372
new implicit arcs          : 27022372
active arcs                : 27328512
simplex iterations         : 7224533
objective value            : 10860742840
checksum                   : 900463229
done
[ perf record: Woken up 64 times to write data ]
[ perf record: Captured and wrote 15.917 MB perf.data (346963 samples) ]
perf record ./mcf_base.gcc-64bit inp.in  86.77s user 0.41s system 99% cpu 1:27.25 total

对于SPEC 17:

➜  run_base_refspeed_mytest-m64.0002 git:(master) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in

MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)

nodes                      : 25137
active arcs                : 260767
simplex iterations         : 186941
objective value            : 15887540369
new implicit arcs          : 24900000
active arcs                : 25160767
simplex iterations         : 910670
objective value            : 12149258952
erased arcs                : 24879927
new implicit arcs          : 24879927
active arcs                : 25160767
simplex iterations         : 1602744
objective value            : 11190089623
erased arcs                : 24865961
new implicit arcs          : 24865961
active arcs                : 25160767
simplex iterations         : 2037871
objective value            : 11117842678
erased arcs                : 24863099
new implicit arcs          : 24863099
active arcs                : 25160767
simplex iterations         : 2571954
objective value            : 10973400330
erased arcs                : 24856474
new implicit arcs          : 24856474
active arcs                : 25160767
simplex iterations         : 2848325
objective value            : 10941929690
erased arcs                : 24855075
new implicit arcs          : 24855075
active arcs                : 25160767
simplex iterations         : 3119034
objective value            : 10930801300
erased arcs                : 24852293
new implicit arcs          : 24852293
active arcs                : 25160767
simplex iterations         : 3222238
objective value            : 10900737970
erased arcs                : 24851921
new implicit arcs          : 10692330
active arcs                : 11001176
simplex iterations         : 3254309
objective value            : 10870751560
erased arcs                : 10693073
new implicit arcs          : 2674507
active arcs                : 2982610
simplex iterations         : 3272271
objective value            : 10830739950
erased arcs                : 2674399
new implicit arcs          : 3274293
active arcs                : 3582504
simplex iterations         : 3274536
objective value            : 10830734910
erased arcs                : 3274083
new implicit arcs          : 4272
active arcs                : 312693
simplex iterations         : 3275714
objective value            : 10810737850
erased arcs                : 4988
new implicit arcs          : 22220299
active arcs                : 22528004
simplex iterations         : 3290089
objective value            : 10790735340
erased arcs                : 22219641
new implicit arcs          : 263324
active arcs                : 571687
simplex iterations         : 3290490
objective value            : 10790734340
erased arcs                : 263180
new implicit arcs          : 2604
active arcs                : 311111
simplex iterations         : 3294939
objective value            : 10760755250
erased arcs                : 5797
new implicit arcs          : 12456506
active arcs                : 12761820
simplex iterations         : 3314430
objective value            : 10740734340
erased arcs                : 12454714
new implicit arcs          : 13513
active arcs                : 320619
simplex iterations         : 3321014
objective value            : 10710779070
erased arcs                : 16245
new implicit arcs          : 24856393
active arcs                : 25160767
simplex iterations         : 3342920
objective value            : 10700734370
erased arcs                : 24852861
new implicit arcs          : 192455
active arcs                : 500361
simplex iterations         : 3343079
objective value            : 10700734340
erased arcs                : 192474
new implicit arcs          : 1430
active arcs                : 309317
simplex iterations         : 3344535
objective value            : 10690737070
erased arcs                : 2477
new implicit arcs          : 21678146
active arcs                : 21984986
simplex iterations         : 3350975
objective value            : 10690734340
erased arcs                : 21677479
new implicit arcs          : 4137
active arcs                : 311644
simplex iterations         : 3351006
objective value            : 10690734340
erased arcs                : 4131
new implicit arcs          : 410
active arcs                : 307923
simplex iterations         : 3352523
objective value            : 10680737460
erased arcs                : 1621
new implicit arcs          : 13908801
active arcs                : 14215103
simplex iterations         : 3361973
objective value            : 10670734910
erased arcs                : 13907466
new implicit arcs          : 162824
active arcs                : 470461
simplex iterations         : 3362344
objective value            : 10670734340
erased arcs                : 162809
new implicit arcs          : 2034
active arcs                : 309686
simplex iterations         : 3362404
objective value            : 10670734340
erased arcs                : 2030
new implicit arcs          : 1079
active arcs                : 308735
simplex iterations         : 3362433
objective value            : 10670734340
erased arcs                : 1079
new implicit arcs          : 361
active arcs                : 308017
simplex iterations         : 3363035
objective value            : 10660738330
erased arcs                : 801
new implicit arcs          : 12537096
active arcs                : 12844312
simplex iterations         : 3365534
objective value            : 10660734340
erased arcs                : 12536730
new implicit arcs          : 5984
active arcs                : 313566
simplex iterations         : 3365574
objective value            : 10660734340
erased arcs                : 5983
new implicit arcs          : 4247
active arcs                : 311830
simplex iterations         : 3365626
objective value            : 10660734340
erased arcs                : 4244
new implicit arcs          : 946
active arcs                : 308532
simplex iterations         : 3366658
objective value            : 10650739260
erased arcs                : 1658
new implicit arcs          : 15021075
active arcs                : 15327949
simplex iterations         : 3369381
objective value            : 10650734340
erased arcs                : 15020452
new implicit arcs          : 660
active arcs                : 308157
simplex iterations         : 3370156
objective value            : 10640736920
erased arcs                : 1250
new implicit arcs          : 16305429
active arcs                : 16612336
simplex iterations         : 3371933
objective value            : 10640734340
erased arcs                : 16304937
new implicit arcs          : 4906
active arcs                : 312305
simplex iterations         : 3371977
objective value            : 10640734340
erased arcs                : 4903
new implicit arcs          : 2168
active arcs                : 309570
simplex iterations         : 3372003
objective value            : 10640734340
erased arcs                : 2166
new implicit arcs          : 1058
active arcs                : 308462
simplex iterations         : 3373383
objective value            : 10630742500
erased arcs                : 2117
new implicit arcs          : 11433214
active arcs                : 11739559
simplex iterations         : 3377666
objective value            : 10630734340
erased arcs                : 11431887
new implicit arcs          : 314
active arcs                : 307986
simplex iterations         : 3377674
objective value            : 10630734340
erased arcs                : 314
checksum                   : 10630734340
done
[ perf record: Woken up 270 times to write data ]
[ perf record: Captured and wrote 68.263 MB perf.data (1490407 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in  371.66s user 1.57s system 99% cpu 6:13.56 tota

可以看出,spec06的完成时对于User态的时间是86.77s,而spec17则使用了371.66s。

面对同一输入数据这么离谱的差距,我打开perf一看究竟。

perf发现

我们继续打开perf report观察代码路径的不同:

可以发现,最大的不同在于代码中排序的占比。

我们继续在两边的代码中观测sort被调用的位置的代码。

左边是17,右边是06,可以看出,在pbeampp.cprimal_bea_mpp最后会调用一个排序函数,然而spec17采用的是定义一个可以传入比较函数的spec_qsort函数实现,spec06则是实现了一个与要排序内容紧耦合的快排的快排。

这种传入比较函数的做法,在没有lto的情况下可能会是个非常大的灾难。

我们先来试试看如果把这一部分替换成spec06同款排序函数,结果会如何。

diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index f359b83..616f20f 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -75,6 +75,50 @@ int cost_compare( b1, b2 )
     return -1;
 }

+#ifdef _PROTO_
+void sort_basket( long min, long max, BASKET **perm )
+#else
+void sort_basket( min, max )
+    long min, max;
+#endif
+{
+    long l, r;
+    cost_t cut;
+    int cut_id;
+    BASKET *xchange;
+
+    l = min; r = max;
+
+    cut = perm[ (long)( (l+r) / 2 ) ]->abs_cost;
+    cut_id = perm[ (long)( (l+r) / 2 ) ]->a->id;
+
+    do
+    {
+        while( perm[l]->abs_cost > cut || (perm[l]->abs_cost == cut && perm[l]->a->id < cut_id))
+            l++;
+        while( cut > perm[r]->abs_cost || (perm[r]->abs_cost == cut && cut_id < perm[r]->a->id))
+            r--;
+            
+        if( l < r )
+        {
+            xchange = perm[l];
+            perm[l] = perm[r];
+            perm[r] = xchange;
+        }
+        if( l <= r )
+        {
+            l++; r--;
+        }
+
+    }
+    while( l <= r );
+
+    if( min < r )
+        sort_basket( min, r, perm );
+    if( l < max)
+        sort_basket( l, max, perm ); 
+}
+

 #ifdef _PROTO_
 BASKET *primal_bea_mpp( LONG m,  arc_t *arcs, arc_t *stop_arcs,
@@ -175,6 +219,10 @@ LONG max_elems;
       return NULL;
     }

+    sort_basket(1, basket_sizes[thread], perm);
+
+    /*
+
 #if defined(SPEC)
     spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
             (int (*)(const void *, const void *))cost_compare);
@@ -182,6 +230,7 @@ LONG max_elems;
     qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
             (int (*)(const void *, const void *))cost_compare);
 #endif
+     */

     return perm[1];

然后运行起来看看:

➜  run_base_refspeed_mytest-m64.0009 git:(master) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in

MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)

nodes                      : 25137
active arcs                : 260767
...
checksum                   : 10630734340
done
[ perf record: Woken up 241 times to write data ]
[ perf record: Captured and wrote 60.397 MB perf.data (1318584 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in  328.86s user 1.38s system 99% cpu 5:30.53 total

可以看出,执行时间由原来的371.66s优化到了328.86s,性能一下子提升了13%。

再来和2006对比一下perf report:

排序占的时间比原来确实好些了,但问题依然存在。

然后我们不难看出,17的排序多考虑了id的比较,如果我们把这部分也删除(但会导致不等价)让其和06对齐,结果会如何呢?

diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index 616f20f..a13da11 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -94,9 +94,9 @@ void sort_basket( min, max )

     do
     {
-        while( perm[l]->abs_cost > cut || (perm[l]->abs_cost == cut && perm[l]->a->id < cut_id))
+        while( perm[l]->abs_cost > cut)
             l++;
-        while( cut > perm[r]->abs_cost || (perm[r]->abs_cost == cut && cut_id < perm[r]->a->id))
+        while( cut > perm[r]->abs_cost)
             r--;

         if( l < r )
➜  run_base_refspeed_mytest-m64.0010 git:(mcf_replace_sort) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in

MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)

nodes                      : 25137
active arcs                : 260767
simplex iterations         : 261564
objective value            : 15887540369
new implicit arcs          : 24900000
active arcs                : 25160767
simplex iterations         : 1015487
objective value            : 12169219830
erased arcs                : 24878556
new implicit arcs          : 24878556
active arcs                : 25160767
simplex iterations         : 1753460
objective value            : 11230266491
erased arcs                : 24865471
new implicit arcs          : 24865471
active arcs                : 25160767
simplex iterations         : 2143414
objective value            : 11108543865
erased arcs                : 24864129
new implicit arcs          : 24864129
active arcs                : 25160767
simplex iterations         : 2712227
objective value            : 10953288340
erased arcs                : 24856307
new implicit arcs          : 24856307
active arcs                : 25160767
simplex iterations         : 2901836
objective value            : 10932644130
erased arcs                : 24855784
new implicit arcs          : 24855784
active arcs                : 25160767
simplex iterations         : 3216746
objective value            : 10880756630
erased arcs                : 24851936
new implicit arcs          : 19467306
active arcs                : 19776137
simplex iterations         : 3294496
objective value            : 10820747660
erased arcs                : 19467809
new implicit arcs          : 11765603
active arcs                : 12073931
simplex iterations         : 3302523
objective value            : 10820734340
erased arcs                : 11765075
new implicit arcs          : 2460
active arcs                : 311316
simplex iterations         : 3304009
objective value            : 10810738360
erased arcs                : 3708
new implicit arcs          : 13756386
active arcs                : 14063994
simplex iterations         : 3324685
objective value            : 10760742560
erased arcs                : 13755546
new implicit arcs          : 760093
active arcs                : 1068541
simplex iterations         : 3327425
objective value            : 10760734340
erased arcs                : 760116
new implicit arcs          : 1441
active arcs                : 309866
simplex iterations         : 3329419
objective value            : 10750741810
erased arcs                : 3037
new implicit arcs          : 15972627
active arcs                : 16279456
simplex iterations         : 3337535
objective value            : 10740735780
erased arcs                : 15970876
new implicit arcs          : 15449
active arcs                : 324029
simplex iterations         : 3337640
objective value            : 10740734340
erased arcs                : 15411
new implicit arcs          : 1182
active arcs                : 309800
simplex iterations         : 3340281
objective value            : 10710757770
erased arcs                : 2686
new implicit arcs          : 12263971
active arcs                : 12571085
simplex iterations         : 3352689
objective value            : 10700734340
erased arcs                : 12262649
new implicit arcs          : 11165
active arcs                : 319601
simplex iterations         : 3353896
objective value            : 10690739140
erased arcs                : 11894
new implicit arcs          : 9782494
active arcs                : 10090201
simplex iterations         : 3362780
objective value            : 10670735150
erased arcs                : 9781899
new implicit arcs          : 7237
active arcs                : 315539
simplex iterations         : 3364220
objective value            : 10660739290
erased arcs                : 8345
new implicit arcs          : 7711402
active arcs                : 8018596
simplex iterations         : 3368616
objective value            : 10660734340
erased arcs                : 7710370
new implicit arcs          : 1887
active arcs                : 310113
simplex iterations         : 3368668
objective value            : 10660734340
erased arcs                : 1910
new implicit arcs          : 9812
active arcs                : 318015
simplex iterations         : 3368725
objective value            : 10660734340
erased arcs                : 9790
new implicit arcs          : 1066
active arcs                : 309291
simplex iterations         : 3369885
objective value            : 10650739860
erased arcs                : 1811
new implicit arcs          : 8634769
active arcs                : 8942249
simplex iterations         : 3372654
objective value            : 10650734340
erased arcs                : 8633832
new implicit arcs          : 7424
active arcs                : 315841
simplex iterations         : 3372747
objective value            : 10650734340
erased arcs                : 7415
new implicit arcs          : 672
active arcs                : 309098
simplex iterations         : 3372768
objective value            : 10650734340
erased arcs                : 672
new implicit arcs          : 81
active arcs                : 308507
simplex iterations         : 3372793
objective value            : 10650734340
erased arcs                : 85
new implicit arcs          : 2454
active arcs                : 310876
simplex iterations         : 3373971
objective value            : 10640739110
erased arcs                : 3383
new implicit arcs          : 17112782
active arcs                : 17420275
simplex iterations         : 3377663
objective value            : 10640734340
erased arcs                : 17111759
new implicit arcs          : 5765
active arcs                : 314281
simplex iterations         : 3377704
objective value            : 10640734340
erased arcs                : 5755
new implicit arcs          : 1778
active arcs                : 310304
simplex iterations         : 3377713
objective value            : 10640734340
erased arcs                : 1776
new implicit arcs          : 126
active arcs                : 308654
simplex iterations         : 3377729
objective value            : 10640734340
erased arcs                : 126
new implicit arcs          : 195
active arcs                : 308723
simplex iterations         : 3378321
objective value            : 10630736470
erased arcs                : 745
new implicit arcs          : 24852789
active arcs                : 25160767
simplex iterations         : 3380532
objective value            : 10630734340
erased arcs                : 24852258
new implicit arcs          : 3737
active arcs                : 312246
simplex iterations         : 3380595
objective value            : 10630734340
erased arcs                : 3736
new implicit arcs          : 1229
active arcs                : 309739
simplex iterations         : 3380627
objective value            : 10630734340
erased arcs                : 1227
checksum                   : 10630734340
done
[ perf record: Woken up 192 times to write data ]
[ perf record: Captured and wrote 48.189 MB perf.data (1051902 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in  262.20s user 1.37s system 99% cpu 4:24.41 tota

可以看出,尽管中间值不同,但最终的checksum还是10630734340。但计算时间进一步优化到了262.2s,相比baseline优化了惊人的41%。

再次观察perf report:

sort_backet仍然很多。

中间还做了其他尝试,包括修改MAX_NEW_ARCS_SMALL_NET,该值在spec06中定义为3000000,但17中为2000000,但未见显著差异。

我继续寻找,发现我在排序算法上依然与06不同,06中的快排在l<=B时不会继续递归,我继续修改:

diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index a13da11..efe9e43 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -115,7 +115,7 @@ void sort_basket( min, max )

     if( min < r )
         sort_basket( min, r, perm );
-    if( l < max)
+    if( l < max && l <= B )
         sort_basket( l, max, perm ); 
 }

这样显然会带来运行结果的差异了。

先试试看运行(肉眼可见收敛速度快了不少):

➜  run_base_refspeed_mytest-m64.0012 git:(mcf_replace_sort) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in

MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)

nodes                      : 25137
active arcs                : 260767
simplex iterations         : 261564
objective value            : 15887540369
new implicit arcs          : 24900000
active arcs                : 25160767
simplex iterations         : 1015487
objective value            : 12169219830
...
new implicit arcs          : 1229
active arcs                : 309739
simplex iterations         : 3380627
objective value            : 10630734340
erased arcs                : 1227
checksum                   : 10630734340
done
[ perf record: Woken up 151 times to write data ]
[ perf record: Captured and wrote 37.756 MB perf.data (824013 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in  205.08s user 1.45s system 99% cpu 3:27.38 tota

这样就直接优化到了205s了,相比最初提升了81%的性能。

我们再来看perf report,左边依然是和06对比,右边是新的17。

primal_bea_mpp占比已经与SPEC 06相当接近,至此,MCF的排序占比问题得到解决。

其他的不同等后续有了新结果再更。

结论

SPEC 06与SPEC 17的MCF差别很大一部分来自排序算法精度和稳定性的影响,SPEC 06排序算法使用inline比较+不比较id+快排mid左边元素足够多时不向下遍历(会导致结果不可靠),使其算法不被排序所干扰,但SPEC 17则不然,使用了非常精确的方式导致其大量的时间消耗在了排序上,而排序时访存模式非常具有局部性,造成缓存大小和内存延迟等硬件配置对性能影响减小,导致了17与06同样是mcf有着非常不同的硬件资源要求。

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

Back to Top