为什么铁 rust S的默认排序功能比我对小数组的 Select 排序稍微慢一些

发布于12月04日

我对Rust还很陌生，所以我可能错过了一些简单的东西.我每晚都在使用铁 rust 1.70.0.以下是必要的代码:

fn selection_sort(original_vec: &mut Vec<i32>) -> Vec<i32> {
    for i in 0..original_vec.len()-1 {
        let mut smallest: usize = i;
        for j in i+1..original_vec.len() {
            if original_vec[j] < original_vec[smallest] {
                smallest = j;
            }
        }
        if smallest != i {
            original_vec.swap(i, smallest);
        }
    };
    original_vec.to_vec()
}

// helper function for testing (uses builtin sort function)
fn rust_sort<A, T>(mut array: A) -> A
where
    A: AsMut<[T]>,
    T: Ord,
{
    let slice = array.as_mut();
    slice.sort();

    array
}

const TEST_VECS: [[i32; 10]; 6] = [
    [1, 3, 2, 9, 6, 7, 4, 10, 8, 5],
    [1, 2, 7, 2, 9, 9, 7, 10, 2, 1],
    [0, 4, 1, 3, 9, 12, 3, 0, 13, 8],
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    [-1, -5, -10, 1, 10, 2, -123, -34, 0, -32],
    [i32::MAX, -32984, i32::MIN, 31648, 5349857, -30954, -2343285, 0, 1, i32::MIN],
];

#[bench]
fn bench_rust_sort(b: &mut Bencher) {
    b.iter(|| {
        for i in TEST_VECS {
            (&mut i.to_vec()).sort()
        }
    })
}

#[bench]
fn bench_selection_sort(b: &mut Bencher) {
    b.iter(|| {
        for i in TEST_VECS {
            selection_sort(&mut i.to_vec());
        }
    })
}

当我跑cargo bench米的时候:

$ cargo bench
  Compiling rust-algs v0.1.0 (/home/josia/projects/rust-algs)
    Finished bench [optimized] target(s) in 0.25s
     Running unittests src/lib.rs (target/release/deps/rustalgs-ae260c07593c3aad)

running 3 tests
test test_selection_sort ... ignored
test bench_rust_sort      ... bench:         106 ns/iter (+/- 8)
test bench_selection_sort ... bench:         102 ns/iter (+/- 9)

test result: ok. 0 passed; 0 failed; 1 ignored; 2 measured; 0 filtered out; finished in 7.65s

我try 了很多次，甚至重命名了测试函数以更改测试顺序.无论如何，我的自定义 Select 排序功能仍然执行得更快.我猜问题在于我必须调用一个函数来包装主要的默认排序函数.调用实际的函数将不起作用，因为即使我将阶跃函数中的TEST_VECS常量克隆为向量，排序函数也会继续对其进行排序，这不会让其他阶跃迭代对其进行排序.如果我克隆BENCHING闭包中的常量，它将极大地影响BASCH迭代的性能，并且我将不能仅对我试图运行的代码进行基准测试.

是我对这些函数进行基准测试的方式有问题，还是我的自定义函数只是速度更快？

# rustc nightly -C opt-level=3 for x86-64 # inside the main repeat loop in bench_selection_sort ... mov edi, 40 mov esi, 4 call rbp # call alloc test rax, rax je .LBB0_4 # check for null # pointer to a slice of TEST_VECS in R15 # pointer to the newly-allocated Vec<i32> in RAX movups xmm0, xmmword ptr [r15] movups xmm1, xmmword ptr [r15 + 16] movups xmmword ptr [rax], xmm0 # with -C target-cpu=x86-64-v3 movups xmmword ptr [rax + 16], xmm1 # just one YMM 32-byte copy is needed, vs. 2x 16-byte with baselien SSE2 mov rdx, qword ptr [r15 + 32] mov qword ptr [rax + 32], rdx # copy last 8 of the 40 bytes # and now the sorting mov ecx, dword ptr [rax] mov edi, dword ptr [rax + 8] xor esi, esi cmp dword ptr [rax + 4], ecx setl sil # 0 or 1 according to vec[1] < vec[0] mov r9d, 2 cmp edi, dword ptr [rax + 4*rsi] # compare at a data-dependent address; surprising for Selection Sort jl .LBB0_9 mov r9, rsi # strange that this isn't a CMOV; the only branch to .LBB0_9 is the preceding line .LBB0_9: mov esi, dword ptr [rax + 28] # load several elements mov edi, dword ptr [rax + 24] # that it's going to cmp/branch on later mov r10d, dword ptr [rax + 16] mov r8d, dword ptr [rax + 20] mov ebx, dword ptr [rax + 12] mov r11d, 3 cmp ebx, dword ptr [rax + 4*r9] jge .LBB0_10 ... over 400 more instructions before the bottom of the loop, mostly mov / cmp/jcc

use core::hint::黑匣子; // I don't know how to get it to show asm for the #[bench] version pub fn xbench_selection_sort(a: i32) { // allocate correct amount of storage once, and redundantly copy into it because I don't know Rust very well let mut test_vec = TEST_VECS[0].to_vec(); for _b in 0..a { // just a counted repeat loop for i in TEST_VECS { //let mut test_vec = i.to_vec(); // would alloc/free inside the loop test_vec.clone_from_slice(&i); // copy into already-allocated space 黑匣子(&mut test_vec.as_slice()); // force the data (but not the pointers and size) to exist in memory selection_sort(&mut test_vec); 黑匣子(&mut test_vec.as_slice()); // force the result data to exist. Apparently costs at least 1 extra asm instruction somewhere, unfortunately. // actually this does store the pointers+length to the stack, not avoiding it like I hoped. //黑匣子(&mut test_vec[4]); // or just ask for one element; often enough to stop a compiler from optimizing away a whole loop. } } }

为什么铁 rust S的默认排序功能比我对小数组的 Select 排序稍微慢一些

推荐答案

`selection_sort` inlined and fully unrolled, `.sort` didn't

测试数据

像 Select 排序这样的O(N^2)排序在较大的数组上速度较慢

`黑匣子`

避免alloc/dealloc

Rust相关问答推荐

如果成员都实现特征，是否在多态集合上实现部分重叠的特征？

从Type：：new()调用函数

默认特征实现中的生命周期问题

在自定义序列化程序中复制serde(With)的行为

为什么铁 rust S似乎有内在的易变性？

如何循环遍历0..V.len()-1何时v可能为空？

像这样的铁 rust 图案除了‘选项’之外，还有其他 Select 吗？

装箱特性如何影响传递给它的参数的生命周期？(举一个非常具体的例子)

了解Rust'；s特征对象和不同函数签名中的生存期注释

使用占位符获取用户输入

Rust 如何将链表推到前面？

为什么我们有两种方法来包含 serde_derive？

为什么 Rust 字符串没有短字符串优化 (SSO)？

为什么要这样编译？

一个函数调用会产生双重borrow 错误，而另一个则不会

如何连接 Rust 中的相邻切片

如何将 Rust 中的树状 struct 展平为 Vec<&mut ...>？

如何在 Rust Polars 中可靠地连接 LazyFrames

使用 `.` 将 T 转换为 &mut T？

相互调用的递归异步函数：检测到循环