已给予: 在BIOS中禁用了超线程的Broadwell CPU

[root@ny4srv03 ~]# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  44
  On-line CPU(s) list:   0-43
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  1
    Core(s) per socket:  22
    Socket(s):           2
    Stepping:            1
    CPU max MHz:         3700.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            4399.69
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aper
                         fmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_
                         l3 invpcid_single intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local d
                         therm ida arat pln pts
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.4 MiB (44 instances)
  L1i:                   1.4 MiB (44 instances)
  L2:                    11 MiB (44 instances)
  L3:                    110 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-21
  NUMA node1 CPU(s):     22-43
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX vulnerable, SMT disabled
  Mds:                   Vulnerable; SMT disabled
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable

根据Intel 64和IA-32体系 struct 软件开发人员手册

如果处理器核心由两个逻辑处理器共享,则每个逻辑处理器最多可以访问四个计数器(IA32 PMCO-IA32 PMC3).这与上一代基于Nehalem微体系 struct 的处理器相同. 如果处理器核心不是由两个逻辑处理器共享的,则最多可以看到八个通用计数器.如果CPUID.OAH:EAX[15:8]报告8个计数器,则IA32_PMC4-IA32_PMC7将抄送MSR地址0C5H至0C8H.每个计数器都伴随着一个事件 Select MS(IA32_PERFEVTSEL4-IA32_PERFEVTSEL7).

应该有8个可访问的性能计数器,并显示cpuid

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x8 (8)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

但是,如果我try 以以下方式使用perf(在root帐户下,将kernel.perf_event_paranoid设置为-1),我会得到一些奇怪的结果

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

                 0      cycles:u
            668753      instructions:u                                                ( +-  0.01% )
            131991      branches:u                                                    ( +-  0.00% )
              6936      branch-misses:u           #    5.25% of all branches          ( +-  0.33% )
             11105      cache-references:u                                            ( +-  0.13% )
                 6      cache-misses:u            #    0.055 % of all cache refs      ( +-  5.86% )
               103      faults:u                                                      ( +-  0.19% )

        0.00100211 +- 0.00000487 seconds time elapsed  ( +-  0.49% )

无论我运行perf多少次(请注意-r 100参数),它总是显示cycles:u等于0,直到我删除branches:ubranch-misses:ucache-references:ucache-misses:u事件之一.在这种情况下,perf按预期工作

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            614142      cycles:u                                                      ( +-  0.06% )
            668790      instructions:u            #    1.09  insn per cycle           ( +-  0.00% )
            132052      branches:u                                                    ( +-  0.00% )
              6874      branch-misses:u           #    5.21% of all branches          ( +-  0.11% )
             10735      cache-references:u                                            ( +-  0.05% )
               101      faults:u                                                      ( +-  0.06% )

        0.00095650 +- 0.00000108 seconds time elapsed  ( +-  0.11% )

在这些情况下,perf也可以按预期工作

  1. 在获得cycles个事件度量的情况下,或者根本没有修改器,或者具有:k个修改器
[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

           1841276      cycles                                                        ( +-  0.79% )
            668400      instructions:u                                                ( +-  0.00% )
            131966      branches:u                                                    ( +-  0.00% )
              6121      branch-misses:u           #    4.64% of all branches          ( +-  0.40% )
             10987      cache-references:u                                            ( +-  0.16% )
                 0      cache-misses:u            #    0.000 % of all cache refs
               102      faults:u                                                      ( +-  0.18% )

        0.00102359 +- 0.00000649 seconds time elapsed  ( +-  0.63% )

  1. 如果在BIOS中启用了超线程并被nosmt内核参数禁用
[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            618443      cycles:u                                                      ( +-  0.39% )
            668466      instructions:u            #    1.05  insn per cycle           ( +-  0.00% )
            131968      branches:u                                                    ( +-  0.00% )
              6529      branch-misses:u           #    4.95% of all branches          ( +-  0.34% )
             11096      cache-references:u                                            ( +-  0.47% )
                 1      cache-misses:u            #    0.010 % of all cache refs      ( +- 53.16% )
               107      faults:u                                                      ( +-  0.18% )

        0.00097825 +- 0.00000554 seconds time elapsed  ( +-  0.57% )

在这种情况下,cpuid还显示只有4个性能计数器可用

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x4 (4)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

所以我想知道是不是perf中有个错误,或者是某种系统配置错误.你能帮帮忙吗?

更新1

try 运行perf -d会显示已启用NMI watchdog

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

                 0      cycles:u
            709098      instructions:u
            140131      branches:u
              6826      branch-misses:u           #    4.87% of all branches
             11287      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               104      faults:u
            593753      L1-dcache-loads
             32677      L1-dcache-load-misses     #    5.50% of all L1-dcache accesses
              8679      LLC-loads
     <not counted>      LLC-load-misses                                               (0.00%)

       0.001102213 seconds time elapsed

       0.000000000 seconds user
       0.001134000 seconds sys


Some events weren't counted. Try disabling the NMI watchdog:
    echo 0 > /proc/sys/kernel/nmi_watchdog
    perf stat ...
    echo 1 > /proc/sys/kernel/nmi_watchdog

禁用它有助于获得预期的结果

echo 0 > /proc/sys/kernel/nmi_watchdog

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

            745760      cycles:u
            708833      instructions:u            #    0.95  insn per cycle
            140122      branches:u
              6757      branch-misses:u           #    4.82% of all branches
             11503      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               101      faults:u
            586223      L1-dcache-loads
             32856      L1-dcache-load-misses     #    5.60% of all L1-dcache accesses
              8794      LLC-loads
                29      LLC-load-misses           #    0.33% of all LL-cache accesses

       0.001000925 seconds time elapsed

       0.000000000 seconds user
       0.001080000 seconds sys

但它仍然不能解释为什么即使dmesg显示,cycles:u也是0,而nmi_watchdog是启用的

[    0.300779] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

更新2

我在likwid toolsuite份material 中发现了this nice comment

请注意,Intel Broadwell上的PMC4-7计数器已损坏.他们 如果应用了用户级或内核级筛选,则不要递增.用户级 过滤是LIKWID中的默认设置,因此添加了内核级过滤 自动用于PMC4-7.返回的计数可能要高得多.

所以它可以解释行为,所以现在很有趣的是,如果操作系统是这样的话,找到这个信息的来源是很有趣的.

推荐答案

这是勘误表BDE104,NMI看门狗占用固定计数器,因此cycles必须使用可编程计数器.


来自英特尔的Xeon-D "specification update"(勘误表)文档(我没有找到针对常规至强v4的勘误表)

BDE104: General-Purpose Performance Monitoring Counters 4-7 Will Not Increment Do Not Count With USR Mode Only Filtering

问题:IA32_PMC4-7 MSR(C5H-C8H)通用性能监视计数器 当IA32_PERFEVTSELx个MSR中的关联CPL筛选器 Select 时不会计入 (18AH-18DH)USR字段(位16)被设置,而OS字段(位17)未被设置.

含义:根据IA32_PMC4-7仅对USR事件进行计数的软件将不会作为 预期中.仅计算操作系统事件或将操作系统和USR事件一起计数不受此影响 勘误表.

解决方法:找不到任何标识.


NMI看门狗占用固定计数器1,该计数器可以正常计数cycles事件.这使得perf为它 Select 了一个可编程的计数器,显然是 Select 了一个有问题的计数器.

在禁用NMI看门狗的情况下,perfcycles使用固定计数器#1.(它显然支持用户/内核/两者都屏蔽.)

我在我的Skylake系统上进行了测试,启用了超线程,因此每个逻辑核心有4个可编程计数器,加上固定的计数器.

  • 禁用NMI看门狗:周期+指令+4个其他事件-无多路传输.
  • 禁用NMI看门狗:周期+指令+5个其他事件-多路传输.(右侧新列中的数字如(86.32%),表示该事件在多长时间内处于活动状态;PERF从该部分推断为总时间.)
  • NMI看门狗禁用:5个事件,不包括周期或指令多路传输.(确认cyclesinstructions使用固定计数器).

确认4个任意事件加上cycles,instructions个中的任何一个的限制 与启用NMI WatchDog时的对比:

  • NMI看门狗启用:4个事件,不包括cyclesinstructions-无多路传输,确认NMI看门狗使所有4个可编程计数器空闲

  • NMI看门狗启用:4个事件,不包括cyclesinstructions-无多路传输,确认NMI看门狗使所有4个可编程计数器空闲

  • 启用NMI WatchDog:4个事件加上cycles-多路传输,确认cycles现在必须使用可编程计数器,这意味着NMI Watchog使用了该固定计数器.

  • 启用NMI看门狗:周期+指令+3个其他事件-没有多路传输,正如我们预期的那样.进一步确认cycles成为竞争可编程计数器的事件之一.

如果我使用perf stat --all-usercycles:u,这些都是一样的.

例如(对于SO的窄代码块,删除了一些水平空格)

# with NMI watchdog enabled
$ taskset -c 0 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,idq.dsb_uops    -r1 ./a.out

 Performance counter stats for './a.out':

             40.74 msec task-clock            #    0.994 CPUs utilized          
                 0      context-switches      #    0.000 /sec                   
                 0      cpu-migrations        #    0.000 /sec                   
               119      page-faults           #    2.921 K/sec                  
       165,566,262      cycles                #    4.064 GHz      (61.39%)
       160,597,987      instructions          #    0.97  insn per cycle (83.46%)
       286,675,168      uops_issued.any       #    7.036 G/sec       (85.28%)
       286,258,415      uops_executed.thread  #    7.026 G/sec       (85.28%)
        76,619,024      idq.mite_uops         #    1.881 G/sec       (85.28%)
        77,238,565      idq.dsb_uops          #    1.896 G/sec       (82.77%)

       0.040990242 seconds time elapsed

       0.040912000 seconds user
       0.000000000 seconds sys
$ echo 0 | sudo tee  /proc/sys/kernel/nmi_watchdog
0
$ taskset -c 0 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,idq.dsb_uops    -r1 ./a.out

 Performance counter stats for './a.out':

             45.01 msec task-clock            #    0.992 CPUs utilized          
                 0      context-switches      #    0.000 /sec                   
                 0      cpu-migrations        #    0.000 /sec                   
               120      page-faults           #    2.666 K/sec                  
       177,494,136      cycles                #    3.943 GHz                    
       160,265,384      instructions          #    0.90  insn per cycle         
       287,253,352      uops_issued.any       #    6.382 G/sec                  
       286,705,343      uops_executed.thread  #    6.369 G/sec                  
        78,189,827      idq.mite_uops         #    1.737 G/sec                  
        75,911,530      idq.dsb_uops          #    1.686 G/sec                  

       0.045389998 seconds time elapsed

       0.045165000 seconds user
       0.000000000 seconds sys

https://perfmon-events.intel.com/broadwell_server.htmlCPU_CLK_UNHALTED.REF_TSC有第三个固定柜台.因此,它与计数INST_RETIRED.ANY(计数器#0)或CPU_CLK_UNHALTED.THREAD/CPU_CLK_UNHALTED.THREAD_ANY(计数器#1)的计数器是分开的.

REF_TSC是固定频率,而不是核心时钟周期;如果NMI看门狗可以使用它可能会更好,因为我预计它的使用范围要小得多.在英特尔CPU上,cycles个偶数是CPU_CLK_UNHALTED.THREAD,在此逻辑内核处于活动状态时计算内核时钟周期.默认情况下,PERF对其进行计数.

Linux相关问答推荐

将参数#0更改为shell脚本不工作

UTF-8输入和使用XGetICValues

pci_user_write_config_word在哪里实现?

shell中两个日期的天数差异

Linux内核模块构建过程中,许可信息添加了前缀

为什么控制台不接受反向换行?

用于判断 shell 脚本是否正在运行的 Linux 命令

如何更改目录中所有文件中所有出现的单词

在 Bash 中识别接收到的信号名称

Ubuntu 上 Java 应用程序中的丑陋字体

未定义的引用 'shm_open',已在此处添加 -lrt 标志

PuTTY:更改默认 SSH 登录目录

NGINX:connect() 到 unix:/var/run/php7.0-fpm.sock 失败(2:没有这样的文件或目录)

Mac OS X 中的 ldconfig 等效项?

如何从 Linux 终端找到特定文件?

初学者如何在 Linux 中开始使用 Mono?

在 Unix 上计算每行/字段的字符出现次数

Linux:处理成服务

如何查看线程在哪个 CPU 内核中运行?

Linux下Eclipse在哪里找eclipse.ini