GDB Watchpoint在x86、MIPS、ARM上的实验与MIPS Illegal instruction

Published on: 2021年3月28日2021年3月30日Author: Yangyu ChenComment: 0

Table of Contents

MIPS上遇到的glibc的坑

最近突发奇想可以用硬件提供的Debug功能来做一些简单的安全防护，于是进行了一个小实验测试在不支持硬件Debug的平台上设置GDB的Watchpoint带来的性能影响。但是在MIPS平台上测试的时候遇到了一个小问题。

首先将测试程序通过mipsel-linux-gnu-gcc在其它平台上交叉编译，只添加-static -g用于使用静态编译以及开启调试，并在x86上使用qemu-user-static测试通过，然后通过sftp传到MIPS平台上（手边除了FPGA的软核只有一台使用mt7621CPU的路由器为MIPS架构硬件，因此选择了我的路由器作为实验环境），结果发生了以下一幕。

root@OpenWrt:/tmp/gdb-perf# file a.out
a.out: ELF 32-bit LSB executable, MIPS, MIPS32 rel2 version 1 (SYSV), statically linked, BuildID[sha1]=d70d168be029f5285b50467e6446459f777b6d6d, for GNU/Linux 3.2.0, with debug_info, not stripped
root@OpenWrt:/tmp/gdb-perf# ./a.out
Illegal instruction
root@OpenWrt:/tmp/gdb-perf#

结果令我很是疑惑，因为自己代码中并没有编写任何浮点运算指令，按理来说是不应该需要添加编译选项来强制软浮点的。

然后我用gdb调试了一下，结果输出如下：

(gdb) r
Starting program: /tmp/gdb-perf/a.out

Program received signal SIGILL, Illegal instruction.
0x00438bb8 in __sigsetjmp_aux ()
(gdb) x/i $pc
=> 0x438bb8 <__sigsetjmp_aux+8>:   sdc1    $f20,56(a0)
(gdb) info registers
          zero       at       v0       v1       a0       a1       a2       a3
 R0   00000000 00000001 00496720 004271c0 7fffbcf8 00000000 7fffbce0 00000000
            t0       t1       t2       t3       t4       t5       t6       t7
 R8   0049643c 0049643c 2f2f2f2f ffffffff 00000000 00490000 00000000 00494c4c
            s0       s1       s2       s3       s4       s5       s6       s7
 R16  004011c0 00459c8c 7fffbd94 004308a1 77fd9ed8 78005000 77fffdb0 00000000
            t8       t9       k0       k1       gp       sp       s8       ra
 R24  0000000c 00438bb0 00470000 00000000 0049daf0 7fffbce0 00000000 00400c20
        status       lo       hi badvaddr    cause       pc
      01008413 0000000b 0000002c 00438a84 5080002c 00438bb8
          fcsr      fir      hi1      lo1      hi2      lo2      hi3      lo3
      00000000 00f30000 00000000 00000000 00000000 00000000 00000000 00000000
        dspctl  restart
      00000000 00000000

看到这里其实已经真相大白了，是因为该CPU没有Co-Processor 1（FPU），导致无法执行sdc1指令导致的。

那么没有浮点指令为什么会存在sdc1呢？我最后发现是glibc搞的鬼，相关代码如下：

int __attribute__ ((nomips16))
inhibit_stack_protector
__sigsetjmp_aux (jmp_buf env, int savemask, int sp, int fp)
{
#ifdef __mips_hard_float
  /* Store the floating point callee-saved registers...  */
  asm volatile ("s.d $f20, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[0]));
  asm volatile ("s.d $f22, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[1]));
  asm volatile ("s.d $f24, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[2]));
  asm volatile ("s.d $f26, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[3]));
  asm volatile ("s.d $f28, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[4]));
  asm volatile ("s.d $f30, %0" : : "m" (env[0].__jmpbuf[0].__fpregs[5]));
#endif
  /* .. and the PC;  */
  asm volatile ("sw $31, %0" : : "m" (env[0].__jmpbuf[0].__pc));
  /* .. and the stack pointer;  */
  env[0].__jmpbuf[0].__sp = (void *) sp;
  /* .. and the FP; it'll be in s8. */
  env[0].__jmpbuf[0].__fp = (void *) fp;
  /* .. and the GP; */
  asm volatile ("sw $gp, %0" : : "m" (env[0].__jmpbuf[0].__gp));
  /* .. and the callee-saved registers; */
  asm volatile ("sw $16, %0" : : "m" (env[0].__jmpbuf[0].__regs[0]));
  asm volatile ("sw $17, %0" : : "m" (env[0].__jmpbuf[0].__regs[1]));
  asm volatile ("sw $18, %0" : : "m" (env[0].__jmpbuf[0].__regs[2]));
  asm volatile ("sw $19, %0" : : "m" (env[0].__jmpbuf[0].__regs[3]));
  asm volatile ("sw $20, %0" : : "m" (env[0].__jmpbuf[0].__regs[4]));
  asm volatile ("sw $21, %0" : : "m" (env[0].__jmpbuf[0].__regs[5]));
  asm volatile ("sw $22, %0" : : "m" (env[0].__jmpbuf[0].__regs[6]));
  asm volatile ("sw $23, %0" : : "m" (env[0].__jmpbuf[0].__regs[7]));
  /* Save the signal mask if requested.  */
  return __sigjmp_save (env, savemask);
}

这是由于，编译glibc的时候加了-D __mips_hard_float导致的，然而我本地的glibc是预编译好直接链接上去的。因此有两种选择，一种是在本地重新编译一个正常的glibc，然后链接上去，另一种选择是在路由器上进行ld。

如何解决的

最后折腾了一晚上没搞定，各种奇怪的工具发现都不能快速解决，于是我决定直接在硬件上运行编译器。由于存储空间不足，但是ram很大，所以直接将gcc安装到ram上。opkg install gcc -d ram

GDB Watchpoint性能实验

首先讲一下我的测试程序，其实非常简单：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define loop_round 10000

int arr[2048];//8kb
int main() {
    //You should make sure the page size configured in the kernel .config is 4KB, some MIPS builds perfer to use 16KB Page
    if (((long long)arr & (~((1ll<<12)-1))) != ((long long)(arr+1) & (~((1ll<<12)-1)))) {
        printf("arr[0] and arr[1] are on the different pages.\n");
        return 1;
    }
    if (((long long)arr & (~((1ll<<12)-1))) == ((long long)(arr+2047) & (~((1ll<<12)-1)))) {
        printf("arr[0] and arr[2047] are on the same page.\n");
        return 1;
    }
    //gdb watch arr[1]
    clock_t start,end;
    arr[0] = 0xff;//make tlb hit
    start = clock();
    int i;
    for (i=0;i<loop_round;i++) arr[0] = 0xff;
    end = clock();
    printf("time_spent on the watching page = %lu\n",end-start);
    arr[2047] = 0xff;//make tlb hit
    start = clock();
    for (i=0;i<loop_round;i++) arr[2047] = 0xff;
    end = clock();
    printf("time_spent on the different page = %lu\n",end-start);
    return 0;
}

X86平台

以下是将loop_round设置为1e9时x86平台的测试结果：

直接运行：

time_spent on the watching page = 2131625
time_spent on the different page = 2057869

设置gdb watchpoint arr[1]：

time_spent on the watching page = 2124418
time_spent on the different page = 2060249

可以看到，设置了gdb的watchpoint后对性能几乎没有影响。

（已在AMD Zen+与Intel Coffee Lake上都经过测试，结论一致，所示结果为AMD Ryzen 2700运行结果）

ARM平台

在Raspberry Pi 2(armv7)与NVIDIA Jetson Nano(armv8)上均进行了测试，无性能损耗。这里结果和x86类似就不放数据了。

MIPS平台

而在MIPS上，为了快速得到测试结果，我将loop_round设置为10000

测试平台mt7621：

直接运行：

time_spent on the watching page = 159
time_spent on the different page = 203

设置gdb watchpoint arr[1]：

time_spent on the watching page = 1372008
time_spent on the different page = 198

可以看到，设置了Watchpoint后时间大幅增加，并且后续将Watchpoint设置到第二页面后依然可以复现结果。

由于在MIPS架构中，由于不像x86有D0~D3这些寄存器来实现硬件Watchpoint，而Watchpoint的实现原理只能通过在页表中取消W位使得每次对该页面的写入产生异常，再交给系统内核转发给gdb来实现。而这一过程导致访存过程至少需要4次上下文切换，是十分影响性能的。

最后放上一张在跑的时候的htop状态图：

  1  [|||||||||||||||||||||||| 77.4%]   Tasks: 37, 0 thr; 2 running
  2  [||||||||                 25.8%]   Load average: 0.50 0.40 0.52
  3  [|||||||||||              32.3%]   Uptime: 02:43:58
  4  [|||||||                  21.3%]
  Mem[||||||||||||         148M/501M]
  Swp[                         0K/0K]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 6995 root       20   0  6468  4596  3540 R 98.7  0.9  0:44.66 gdb a.out
16039 root       20   0   932   516   484 t  4.5  0.1  0:00.79 /tmp/lab/a.out

感受

体现了软硬件协同设计效果好。