SPEC CPU 2017 mcf 与 2006 的差异浅析
背景
最近朋友 Easton Man 告诉了我一些SPEC CPU 2006 与 2017 的一些性能计数器 Topdown 统计结果,在观测他的结果后我惊讶地发现,SPEC CPU 2017中 mcf 这一 workload 对 Load 的占比明显低于 2006,这一发现吸引了我的兴趣。要知道,mcf 实际算的就是一个图上的车辆调度问题,具体可见SPEC文档,在SPEC 06中是一个十分具有代表性的对Cache压力较大的Workload。因此,我开始着手观测两个不同版本之间的区别。
第一眼
我首先分别使用最新版本的SPEC CPU 2006 与 SPEC CPU 2017 在13900K CPU上使用gcc 13.2并将spec的base编译的参数都调整为-O3
,然后测试。
首先观察输入文件,对于mcf采用的是inp.in
,结果是:
➜ ~ md5sum ~/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in
c9cbc843e228ad728ca48d99d8293b4a /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in
➜ ~ md5sum ~/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in
c9cbc843e228ad728ca48d99d8293b4a /home/cyy/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in
➜ ~ head ~/spec17/benchspec/CPU/505.mcf_r/data/refspeed/input/inp.in
25137 185356
11220 13320
13020 15120
13560 15000
13680 15540
13680 15420
13920 16020
13980 14760
14040 16620
14040 15660
我们惊讶地发现,spec06与spec17的mcf居然采用的完全相同的输入,连md5都是一致的,然后我再使用time perf record ./mcf inp.in
来运行他们,结果如下:
对于SPEC 06:
➜ run_base_ref_gcc-64bit.0000 git:(master) ✗ time perf record ./mcf_base.gcc-64bit inp.in
MCF SPEC CPU2006 version 1.10
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
nodes : 25137
active arcs : 260767
simplex iterations : 226605
objective value : 15887540369
new implicit arcs : 27067745
active arcs : 27328512
simplex iterations : 2591718
objective value : 12129093300
erased arcs : 27045002
new implicit arcs : 27045002
active arcs : 27328512
simplex iterations : 4853421
objective value : 11109478016
erased arcs : 27032013
new implicit arcs : 27032013
active arcs : 27328512
simplex iterations : 6008461
objective value : 11006050760
erased arcs : 27028145
new implicit arcs : 27028145
active arcs : 27328512
simplex iterations : 6751264
objective value : 10912231340
erased arcs : 27022372
new implicit arcs : 27022372
active arcs : 27328512
simplex iterations : 7224533
objective value : 10860742840
checksum : 900463229
done
[ perf record: Woken up 64 times to write data ]
[ perf record: Captured and wrote 15.917 MB perf.data (346963 samples) ]
perf record ./mcf_base.gcc-64bit inp.in 86.77s user 0.41s system 99% cpu 1:27.25 total
对于SPEC 17:
➜ run_base_refspeed_mytest-m64.0002 git:(master) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in
MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)
nodes : 25137
active arcs : 260767
simplex iterations : 186941
objective value : 15887540369
new implicit arcs : 24900000
active arcs : 25160767
simplex iterations : 910670
objective value : 12149258952
erased arcs : 24879927
new implicit arcs : 24879927
active arcs : 25160767
simplex iterations : 1602744
objective value : 11190089623
erased arcs : 24865961
new implicit arcs : 24865961
active arcs : 25160767
simplex iterations : 2037871
objective value : 11117842678
erased arcs : 24863099
new implicit arcs : 24863099
active arcs : 25160767
simplex iterations : 2571954
objective value : 10973400330
erased arcs : 24856474
new implicit arcs : 24856474
active arcs : 25160767
simplex iterations : 2848325
objective value : 10941929690
erased arcs : 24855075
new implicit arcs : 24855075
active arcs : 25160767
simplex iterations : 3119034
objective value : 10930801300
erased arcs : 24852293
new implicit arcs : 24852293
active arcs : 25160767
simplex iterations : 3222238
objective value : 10900737970
erased arcs : 24851921
new implicit arcs : 10692330
active arcs : 11001176
simplex iterations : 3254309
objective value : 10870751560
erased arcs : 10693073
new implicit arcs : 2674507
active arcs : 2982610
simplex iterations : 3272271
objective value : 10830739950
erased arcs : 2674399
new implicit arcs : 3274293
active arcs : 3582504
simplex iterations : 3274536
objective value : 10830734910
erased arcs : 3274083
new implicit arcs : 4272
active arcs : 312693
simplex iterations : 3275714
objective value : 10810737850
erased arcs : 4988
new implicit arcs : 22220299
active arcs : 22528004
simplex iterations : 3290089
objective value : 10790735340
erased arcs : 22219641
new implicit arcs : 263324
active arcs : 571687
simplex iterations : 3290490
objective value : 10790734340
erased arcs : 263180
new implicit arcs : 2604
active arcs : 311111
simplex iterations : 3294939
objective value : 10760755250
erased arcs : 5797
new implicit arcs : 12456506
active arcs : 12761820
simplex iterations : 3314430
objective value : 10740734340
erased arcs : 12454714
new implicit arcs : 13513
active arcs : 320619
simplex iterations : 3321014
objective value : 10710779070
erased arcs : 16245
new implicit arcs : 24856393
active arcs : 25160767
simplex iterations : 3342920
objective value : 10700734370
erased arcs : 24852861
new implicit arcs : 192455
active arcs : 500361
simplex iterations : 3343079
objective value : 10700734340
erased arcs : 192474
new implicit arcs : 1430
active arcs : 309317
simplex iterations : 3344535
objective value : 10690737070
erased arcs : 2477
new implicit arcs : 21678146
active arcs : 21984986
simplex iterations : 3350975
objective value : 10690734340
erased arcs : 21677479
new implicit arcs : 4137
active arcs : 311644
simplex iterations : 3351006
objective value : 10690734340
erased arcs : 4131
new implicit arcs : 410
active arcs : 307923
simplex iterations : 3352523
objective value : 10680737460
erased arcs : 1621
new implicit arcs : 13908801
active arcs : 14215103
simplex iterations : 3361973
objective value : 10670734910
erased arcs : 13907466
new implicit arcs : 162824
active arcs : 470461
simplex iterations : 3362344
objective value : 10670734340
erased arcs : 162809
new implicit arcs : 2034
active arcs : 309686
simplex iterations : 3362404
objective value : 10670734340
erased arcs : 2030
new implicit arcs : 1079
active arcs : 308735
simplex iterations : 3362433
objective value : 10670734340
erased arcs : 1079
new implicit arcs : 361
active arcs : 308017
simplex iterations : 3363035
objective value : 10660738330
erased arcs : 801
new implicit arcs : 12537096
active arcs : 12844312
simplex iterations : 3365534
objective value : 10660734340
erased arcs : 12536730
new implicit arcs : 5984
active arcs : 313566
simplex iterations : 3365574
objective value : 10660734340
erased arcs : 5983
new implicit arcs : 4247
active arcs : 311830
simplex iterations : 3365626
objective value : 10660734340
erased arcs : 4244
new implicit arcs : 946
active arcs : 308532
simplex iterations : 3366658
objective value : 10650739260
erased arcs : 1658
new implicit arcs : 15021075
active arcs : 15327949
simplex iterations : 3369381
objective value : 10650734340
erased arcs : 15020452
new implicit arcs : 660
active arcs : 308157
simplex iterations : 3370156
objective value : 10640736920
erased arcs : 1250
new implicit arcs : 16305429
active arcs : 16612336
simplex iterations : 3371933
objective value : 10640734340
erased arcs : 16304937
new implicit arcs : 4906
active arcs : 312305
simplex iterations : 3371977
objective value : 10640734340
erased arcs : 4903
new implicit arcs : 2168
active arcs : 309570
simplex iterations : 3372003
objective value : 10640734340
erased arcs : 2166
new implicit arcs : 1058
active arcs : 308462
simplex iterations : 3373383
objective value : 10630742500
erased arcs : 2117
new implicit arcs : 11433214
active arcs : 11739559
simplex iterations : 3377666
objective value : 10630734340
erased arcs : 11431887
new implicit arcs : 314
active arcs : 307986
simplex iterations : 3377674
objective value : 10630734340
erased arcs : 314
checksum : 10630734340
done
[ perf record: Woken up 270 times to write data ]
[ perf record: Captured and wrote 68.263 MB perf.data (1490407 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in 371.66s user 1.57s system 99% cpu 6:13.56 tota
可以看出,spec06的完成时对于User态的时间是86.77s,而spec17则使用了371.66s。
面对同一输入数据这么离谱的差距,我打开perf一看究竟。
perf发现
我们继续打开perf report
观察代码路径的不同:
可以发现,最大的不同在于代码中排序的占比。
我们继续在两边的代码中观测sort被调用的位置的代码。
左边是17,右边是06,可以看出,在pbeampp.c
的primal_bea_mpp
最后会调用一个排序函数,然而spec17采用的是定义一个可以传入比较函数的spec_qsort函数实现,spec06则是实现了一个与要排序内容紧耦合的快排的快排。
这种传入比较函数的做法,在没有lto的情况下可能会是个非常大的灾难。
我们先来试试看如果把这一部分替换成spec06同款排序函数,结果会如何。
diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index f359b83..616f20f 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -75,6 +75,50 @@ int cost_compare( b1, b2 )
return -1;
}
+#ifdef _PROTO_
+void sort_basket( long min, long max, BASKET **perm )
+#else
+void sort_basket( min, max )
+ long min, max;
+#endif
+{
+ long l, r;
+ cost_t cut;
+ int cut_id;
+ BASKET *xchange;
+
+ l = min; r = max;
+
+ cut = perm[ (long)( (l+r) / 2 ) ]->abs_cost;
+ cut_id = perm[ (long)( (l+r) / 2 ) ]->a->id;
+
+ do
+ {
+ while( perm[l]->abs_cost > cut || (perm[l]->abs_cost == cut && perm[l]->a->id < cut_id))
+ l++;
+ while( cut > perm[r]->abs_cost || (perm[r]->abs_cost == cut && cut_id < perm[r]->a->id))
+ r--;
+
+ if( l < r )
+ {
+ xchange = perm[l];
+ perm[l] = perm[r];
+ perm[r] = xchange;
+ }
+ if( l <= r )
+ {
+ l++; r--;
+ }
+
+ }
+ while( l <= r );
+
+ if( min < r )
+ sort_basket( min, r, perm );
+ if( l < max)
+ sort_basket( l, max, perm );
+}
+
#ifdef _PROTO_
BASKET *primal_bea_mpp( LONG m, arc_t *arcs, arc_t *stop_arcs,
@@ -175,6 +219,10 @@ LONG max_elems;
return NULL;
}
+ sort_basket(1, basket_sizes[thread], perm);
+
+ /*
+
#if defined(SPEC)
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
(int (*)(const void *, const void *))cost_compare);
@@ -182,6 +230,7 @@ LONG max_elems;
qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
(int (*)(const void *, const void *))cost_compare);
#endif
+ */
return perm[1];
然后运行起来看看:
➜ run_base_refspeed_mytest-m64.0009 git:(master) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in
MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)
nodes : 25137
active arcs : 260767
...
checksum : 10630734340
done
[ perf record: Woken up 241 times to write data ]
[ perf record: Captured and wrote 60.397 MB perf.data (1318584 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in 328.86s user 1.38s system 99% cpu 5:30.53 total
可以看出,执行时间由原来的371.66s优化到了328.86s,性能一下子提升了13%。
再来和2006对比一下perf report:
排序占的时间比原来确实好些了,但问题依然存在。
然后我们不难看出,17的排序多考虑了id的比较,如果我们把这部分也删除(但会导致不等价)让其和06对齐,结果会如何呢?
diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index 616f20f..a13da11 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -94,9 +94,9 @@ void sort_basket( min, max )
do
{
- while( perm[l]->abs_cost > cut || (perm[l]->abs_cost == cut && perm[l]->a->id < cut_id))
+ while( perm[l]->abs_cost > cut)
l++;
- while( cut > perm[r]->abs_cost || (perm[r]->abs_cost == cut && cut_id < perm[r]->a->id))
+ while( cut > perm[r]->abs_cost)
r--;
if( l < r )
➜ run_base_refspeed_mytest-m64.0010 git:(mcf_replace_sort) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in
MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)
nodes : 25137
active arcs : 260767
simplex iterations : 261564
objective value : 15887540369
new implicit arcs : 24900000
active arcs : 25160767
simplex iterations : 1015487
objective value : 12169219830
erased arcs : 24878556
new implicit arcs : 24878556
active arcs : 25160767
simplex iterations : 1753460
objective value : 11230266491
erased arcs : 24865471
new implicit arcs : 24865471
active arcs : 25160767
simplex iterations : 2143414
objective value : 11108543865
erased arcs : 24864129
new implicit arcs : 24864129
active arcs : 25160767
simplex iterations : 2712227
objective value : 10953288340
erased arcs : 24856307
new implicit arcs : 24856307
active arcs : 25160767
simplex iterations : 2901836
objective value : 10932644130
erased arcs : 24855784
new implicit arcs : 24855784
active arcs : 25160767
simplex iterations : 3216746
objective value : 10880756630
erased arcs : 24851936
new implicit arcs : 19467306
active arcs : 19776137
simplex iterations : 3294496
objective value : 10820747660
erased arcs : 19467809
new implicit arcs : 11765603
active arcs : 12073931
simplex iterations : 3302523
objective value : 10820734340
erased arcs : 11765075
new implicit arcs : 2460
active arcs : 311316
simplex iterations : 3304009
objective value : 10810738360
erased arcs : 3708
new implicit arcs : 13756386
active arcs : 14063994
simplex iterations : 3324685
objective value : 10760742560
erased arcs : 13755546
new implicit arcs : 760093
active arcs : 1068541
simplex iterations : 3327425
objective value : 10760734340
erased arcs : 760116
new implicit arcs : 1441
active arcs : 309866
simplex iterations : 3329419
objective value : 10750741810
erased arcs : 3037
new implicit arcs : 15972627
active arcs : 16279456
simplex iterations : 3337535
objective value : 10740735780
erased arcs : 15970876
new implicit arcs : 15449
active arcs : 324029
simplex iterations : 3337640
objective value : 10740734340
erased arcs : 15411
new implicit arcs : 1182
active arcs : 309800
simplex iterations : 3340281
objective value : 10710757770
erased arcs : 2686
new implicit arcs : 12263971
active arcs : 12571085
simplex iterations : 3352689
objective value : 10700734340
erased arcs : 12262649
new implicit arcs : 11165
active arcs : 319601
simplex iterations : 3353896
objective value : 10690739140
erased arcs : 11894
new implicit arcs : 9782494
active arcs : 10090201
simplex iterations : 3362780
objective value : 10670735150
erased arcs : 9781899
new implicit arcs : 7237
active arcs : 315539
simplex iterations : 3364220
objective value : 10660739290
erased arcs : 8345
new implicit arcs : 7711402
active arcs : 8018596
simplex iterations : 3368616
objective value : 10660734340
erased arcs : 7710370
new implicit arcs : 1887
active arcs : 310113
simplex iterations : 3368668
objective value : 10660734340
erased arcs : 1910
new implicit arcs : 9812
active arcs : 318015
simplex iterations : 3368725
objective value : 10660734340
erased arcs : 9790
new implicit arcs : 1066
active arcs : 309291
simplex iterations : 3369885
objective value : 10650739860
erased arcs : 1811
new implicit arcs : 8634769
active arcs : 8942249
simplex iterations : 3372654
objective value : 10650734340
erased arcs : 8633832
new implicit arcs : 7424
active arcs : 315841
simplex iterations : 3372747
objective value : 10650734340
erased arcs : 7415
new implicit arcs : 672
active arcs : 309098
simplex iterations : 3372768
objective value : 10650734340
erased arcs : 672
new implicit arcs : 81
active arcs : 308507
simplex iterations : 3372793
objective value : 10650734340
erased arcs : 85
new implicit arcs : 2454
active arcs : 310876
simplex iterations : 3373971
objective value : 10640739110
erased arcs : 3383
new implicit arcs : 17112782
active arcs : 17420275
simplex iterations : 3377663
objective value : 10640734340
erased arcs : 17111759
new implicit arcs : 5765
active arcs : 314281
simplex iterations : 3377704
objective value : 10640734340
erased arcs : 5755
new implicit arcs : 1778
active arcs : 310304
simplex iterations : 3377713
objective value : 10640734340
erased arcs : 1776
new implicit arcs : 126
active arcs : 308654
simplex iterations : 3377729
objective value : 10640734340
erased arcs : 126
new implicit arcs : 195
active arcs : 308723
simplex iterations : 3378321
objective value : 10630736470
erased arcs : 745
new implicit arcs : 24852789
active arcs : 25160767
simplex iterations : 3380532
objective value : 10630734340
erased arcs : 24852258
new implicit arcs : 3737
active arcs : 312246
simplex iterations : 3380595
objective value : 10630734340
erased arcs : 3736
new implicit arcs : 1229
active arcs : 309739
simplex iterations : 3380627
objective value : 10630734340
erased arcs : 1227
checksum : 10630734340
done
[ perf record: Woken up 192 times to write data ]
[ perf record: Captured and wrote 48.189 MB perf.data (1051902 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in 262.20s user 1.37s system 99% cpu 4:24.41 tota
可以看出,尽管中间值不同,但最终的checksum还是10630734340。但计算时间进一步优化到了262.2s,相比baseline优化了惊人的41%。
再次观察perf report:
sort_backet仍然很多。
中间还做了其他尝试,包括修改MAX_NEW_ARCS_SMALL_NET,该值在spec06中定义为3000000,但17中为2000000,但未见显著差异。
我继续寻找,发现我在排序算法上依然与06不同,06中的快排在l<=B时不会继续递归,我继续修改:
diff --git a/CPU/505.mcf_r/src/pbeampp.c b/CPU/505.mcf_r/src/pbeampp.c
index a13da11..efe9e43 100644
--- a/CPU/505.mcf_r/src/pbeampp.c
+++ b/CPU/505.mcf_r/src/pbeampp.c
@@ -115,7 +115,7 @@ void sort_basket( min, max )
if( min < r )
sort_basket( min, r, perm );
- if( l < max)
+ if( l < max && l <= B )
sort_basket( l, max, perm );
}
这样显然会带来运行结果的差异了。
先试试看运行(肉眼可见收敛速度快了不少):
➜ run_base_refspeed_mytest-m64.0012 git:(mcf_replace_sort) ✗ time perf record ./mcf_s_base.mytest-m64 inp.in
MCF SPEC CPU version 1.11
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
Copyright (c) 2006-2010 Dres. Loebel, Borndoerfer & Weider GbR (LBW)
nodes : 25137
active arcs : 260767
simplex iterations : 261564
objective value : 15887540369
new implicit arcs : 24900000
active arcs : 25160767
simplex iterations : 1015487
objective value : 12169219830
...
new implicit arcs : 1229
active arcs : 309739
simplex iterations : 3380627
objective value : 10630734340
erased arcs : 1227
checksum : 10630734340
done
[ perf record: Woken up 151 times to write data ]
[ perf record: Captured and wrote 37.756 MB perf.data (824013 samples) ]
perf record ./mcf_s_base.mytest-m64 inp.in 205.08s user 1.45s system 99% cpu 3:27.38 tota
这样就直接优化到了205s了,相比最初提升了81%的性能。
我们再来看perf report
,左边依然是和06对比,右边是新的17。
primal_bea_mpp
占比已经与SPEC 06相当接近,至此,MCF的排序占比问题得到解决。
其他的不同等后续有了新结果再更。
结论
SPEC 06与SPEC 17的MCF差别很大一部分来自排序算法精度和稳定性的影响,SPEC 06排序算法使用inline比较+不比较id+快排mid左边元素足够多时不向下遍历(会导致结果不可靠),使其算法不被排序所干扰,但SPEC 17则不然,使用了非常精确的方式导致其大量的时间消耗在了排序上,而排序时访存模式非常具有局部性,造成缓存大小和内存延迟等硬件配置对性能影响减小,导致了17与06同样是mcf有着非常不同的硬件资源要求。