Linux 为什么 perf stat 不计算 cycles：u 在 BIOS 中禁用超线程的 Broadwell CPU 上

发布于03月02日

已给予: 在BIOS中禁用了超线程的Broadwell CPU

[root@ny4srv03 ~]# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  44
  On-line CPU(s) list:   0-43
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  1
    Core(s) per socket:  22
    Socket(s):           2
    Stepping:            1
    CPU max MHz:         3700.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            4399.69
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aper
                         fmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_
                         l3 invpcid_single intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local d
                         therm ida arat pln pts
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.4 MiB (44 instances)
  L1i:                   1.4 MiB (44 instances)
  L2:                    11 MiB (44 instances)
  L3:                    110 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-21
  NUMA node1 CPU(s):     22-43
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX vulnerable, SMT disabled
  Mds:                   Vulnerable; SMT disabled
  Meltdown:              Vulnerable
  Mmio stale data:       Vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable

根据Intel 64和IA-32体系 struct 软件开发人员手册

如果处理器核心由两个逻辑处理器共享，则每个逻辑处理器最多可以访问四个计数器(IA32 PMCO-IA32 PMC3).这与上一代基于Nehalem微体系 struct 的处理器相同. 如果处理器核心不是由两个逻辑处理器共享的，则最多可以看到八个通用计数器.如果CPUID.OAH:EAX[15:8]报告8个计数器，则IA32_PMC4-IA32_PMC7将抄送MSR地址0C5H至0C8H.每个计数器都伴随着一个事件 Select MS(IA32_PERFEVTSEL4-IA32_PERFEVTSEL7).

应该有8个可访问的性能计数器，并显示cpuid个

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x8 (8)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

但是，如果我try 以以下方式使用perf(在root帐户下，将kernel.perf_event_paranoid设置为-1)，我会得到一些奇怪的结果

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

                 0      cycles:u
            668753      instructions:u                                                ( +-  0.01% )
            131991      branches:u                                                    ( +-  0.00% )
              6936      branch-misses:u           #    5.25% of all branches          ( +-  0.33% )
             11105      cache-references:u                                            ( +-  0.13% )
                 6      cache-misses:u            #    0.055 % of all cache refs      ( +-  5.86% )
               103      faults:u                                                      ( +-  0.19% )

        0.00100211 +- 0.00000487 seconds time elapsed  ( +-  0.49% )

无论我运行perf多少次(请注意-r 100参数)，它总是显示cycles:u等于0，直到我删除branches:u、branch-misses:u、cache-references:u、cache-misses:u事件之一.在这种情况下，perf按预期工作

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            614142      cycles:u                                                      ( +-  0.06% )
            668790      instructions:u            #    1.09  insn per cycle           ( +-  0.00% )
            132052      branches:u                                                    ( +-  0.00% )
              6874      branch-misses:u           #    5.21% of all branches          ( +-  0.11% )
             10735      cache-references:u                                            ( +-  0.05% )
               101      faults:u                                                      ( +-  0.06% )

        0.00095650 +- 0.00000108 seconds time elapsed  ( +-  0.11% )

在这些情况下，perf也可以按预期工作

在获得cycles个事件度量的情况下，或者根本没有修改器，或者具有:k个修改器

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

           1841276      cycles                                                        ( +-  0.79% )
            668400      instructions:u                                                ( +-  0.00% )
            131966      branches:u                                                    ( +-  0.00% )
              6121      branch-misses:u           #    4.64% of all branches          ( +-  0.40% )
             10987      cache-references:u                                            ( +-  0.16% )
                 0      cache-misses:u            #    0.000 % of all cache refs
               102      faults:u                                                      ( +-  0.18% )

        0.00102359 +- 0.00000649 seconds time elapsed  ( +-  0.63% )

如果在BIOS中启用了超线程并被nosmt内核参数禁用

[root@ny4srv03 ~]# perf stat \
  -r 100 \
  -e cycles:u \
  -e instructions:u \
  -e branches:u \
  -e branch-misses:u \
  -e cache-references:u \
  -e cache-misses:u \
  -e faults:u \
  ls>/dev/null

 Performance counter stats for 'ls' (100 runs):

            618443      cycles:u                                                      ( +-  0.39% )
            668466      instructions:u            #    1.05  insn per cycle           ( +-  0.00% )
            131968      branches:u                                                    ( +-  0.00% )
              6529      branch-misses:u           #    4.95% of all branches          ( +-  0.34% )
             11096      cache-references:u                                            ( +-  0.47% )
                 1      cache-misses:u            #    0.010 % of all cache refs      ( +- 53.16% )
               107      faults:u                                                      ( +-  0.18% )

        0.00097825 +- 0.00000554 seconds time elapsed  ( +-  0.57% )

在这种情况下，cpuid还显示只有4个性能计数器可用

[root@ny4srv03 ~]# cpuid -1 | grep counters
      number of counters per logical processor = 0x4 (4)
      number of contiguous fixed counters      = 0x3 (3)
      bit width of fixed counters              = 0x30 (48)

所以我想知道是不是perf中有个错误，或者是某种系统配置错误.你能帮帮忙吗？

更新1

try 运行perf -d会显示已启用NMI watchdog

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

                 0      cycles:u
            709098      instructions:u
            140131      branches:u
              6826      branch-misses:u           #    4.87% of all branches
             11287      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               104      faults:u
            593753      L1-dcache-loads
             32677      L1-dcache-load-misses     #    5.50% of all L1-dcache accesses
              8679      LLC-loads
     <not counted>      LLC-load-misses                                               (0.00%)

       0.001102213 seconds time elapsed

       0.000000000 seconds user
       0.001134000 seconds sys


Some events weren't counted. Try disabling the NMI watchdog:
    echo 0 > /proc/sys/kernel/nmi_watchdog
    perf stat ...
    echo 1 > /proc/sys/kernel/nmi_watchdog

禁用它有助于获得预期的结果

echo 0 > /proc/sys/kernel/nmi_watchdog

[root@ny4srv03 likwid]# perf stat \
   -e cycles:u \
   -e instructions:u \
   -e branches:u \
   -e branch-misses:u \
   -e cache-references:u \
   -e cache-misses:u \
   -e faults:u \
   -d \
   ls>/dev/null

 Performance counter stats for 'ls':

            745760      cycles:u
            708833      instructions:u            #    0.95  insn per cycle
            140122      branches:u
              6757      branch-misses:u           #    4.82% of all branches
             11503      cache-references:u
                 0      cache-misses:u            #    0.000 % of all cache refs
               101      faults:u
            586223      L1-dcache-loads
             32856      L1-dcache-load-misses     #    5.60% of all L1-dcache accesses
              8794      LLC-loads
                29      LLC-load-misses           #    0.33% of all LL-cache accesses

       0.001000925 seconds time elapsed

       0.000000000 seconds user
       0.001080000 seconds sys

但它仍然不能解释为什么即使dmesg显示，cycles:u也是0，而nmi_watchdog是启用的

[    0.300779] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

更新2

我在likwid toolsuite份material 中发现了this nice comment份

请注意，Intel Broadwell上的PMC4-7计数器已损坏.他们如果应用了用户级或内核级筛选，则不要递增.用户级过滤是LIKWID中的默认设置，因此添加了内核级过滤自动用于PMC4-7.返回的计数可能要高得多.

所以它可以解释行为，所以现在很有趣的是，如果操作系统是这样的话，找到这个信息的来源是很有趣的.

Linux 为什么 perf stat 不计算 cycles：u 在 BIOS 中禁用超线程的 Broadwell CPU 上

更新1

更新2

推荐答案

BDE104: General-Purpose Performance Monitoring Counters 4-7 Will Not Increment Do Not Count With USR Mode Only Filtering

Linux相关问答推荐

将参数#0更改为shell脚本不工作

UTF-8输入和使用XGetICValues

pci_user_write_config_word在哪里实现？

shell中两个日期的天数差异

Linux内核模块构建过程中，许可信息添加了前缀

为什么控制台不接受反向换行？

用于判断 shell 脚本是否正在运行的 Linux 命令

如何更改目录中所有文件中所有出现的单词

在 Bash 中识别接收到的信号名称

Ubuntu 上 Java 应用程序中的丑陋字体

未定义的引用 'shm_open'，已在此处添加 -lrt 标志

PuTTY：更改默认 SSH 登录目录

NGINX：connect() 到 unix：/var/run/php7.0-fpm.sock 失败(2：没有这样的文件或目录)

Mac OS X 中的 ldconfig 等效项？

如何从 Linux 终端找到特定文件？

初学者如何在 Linux 中开始使用 Mono？

在 Unix 上计算每行/字段的字符出现次数

Linux：处理成服务

如何查看线程在哪个 CPU 内核中运行？

Linux下Eclipse在哪里找eclipse.ini