Zero-cost abstractions,看Introduction to rust: a low-level language with high-level abstractions,我试图比较两种计算向量点积的方法:一种使用for循环,另一种使用迭代器.

#![feature(test)]

extern crate rand;
extern crate test;

use std::cmp::min;

fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
    let mut result: f64 = 0.0;
    for i in 0..min(x.len(), y.len()) {
        result += x[i] * y[i];
    }
    return result;
}

fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y).map(|(&a, &b)| a * b).sum::<f64>()
}

#[cfg(test)]
mod bench {
    use test::Bencher;
    use rand::{Rng,thread_rng};
    use super::*;

    const LEN: usize = 30;

    #[test]
    fn test_1() {
        let x = [1.0, 2.0, 3.0];
        let y = [2.0, 4.0, 6.0];
        let result = dot_product_1(&x, &y);
        assert_eq!(result, 28.0);
    }

    #[test]
    fn test_2() {
        let x = [1.0, 2.0, 3.0];
        let y = [2.0, 4.0, 6.0];
        let result = dot_product_2(&x, &y);
        assert_eq!(result, 28.0);
    }

    fn rand_array(cnt: u32) -> Vec<f64> {
        let mut rng = thread_rng();
        (0..cnt).map(|_| rng.gen::<f64>()).collect()

    }

    #[bench]
    fn bench_small_1(b: &mut Bencher) {
        let samples = rand_array(2*LEN as u32);
        b.iter(|| {
            dot_product_1(&samples[0..LEN], &samples[LEN..2*LEN])
        })
    }

    #[bench]
    fn bench_small_2(b: &mut Bencher) {
        let samples = rand_array(2*LEN as u32);
        b.iter(|| {
            dot_product_2(&samples[0..LEN], &samples[LEN..2*LEN])
        })
    }
}

上面后面的链接声称,带有迭代器的版本应该具有类似的性能,"而且实际上要快一点".然而,在对两者进行基准测试时,我得到了非常不同的结果:

running 2 tests
test bench::bench_small_loop ... bench:          20 ns/iter (+/- 1)
test bench::bench_small_iter ... bench:          24 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

那么,"零成本抽象"go 了哪里?

Update:添加@wimh提供的foldr个示例,并使用split_at代替切片,得到以下结果.

running 3 tests
test bench::bench_small_fold ... bench:          18 ns/iter (+/- 1)
test bench::bench_small_iter ... bench:          21 ns/iter (+/- 1)
test bench::bench_small_loop ... bench:          24 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 3 measured; 0 filtered out

因此,额外的时间似乎直接或间接地来自于在测量代码中构造切片.为了验证情况是否属实,我try 了以下两种方法,得到了相同的结果(这里显示的是foldr个 case ,使用map+sum):

#[bench]
fn bench_small_iter(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let s0 = &samples[0..LEN];
    let s1 = &samples[LEN..2 * LEN];
    b.iter(|| dot_product_iter(s0, s1))
}

#[bench]
fn bench_small_fold(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_fold(s0, s1))
}

推荐答案

对我来说似乎是零成本.我用更惯用的方式编写代码,在两个测试中使用相同的随机值,然后进行多次测试:

fn dot_product_1(x: &[f64], y: &[f64]) -> f64 {
    let mut result: f64 = 0.0;
    for i in 0..min(x.len(), y.len()) {
        result += x[i] * y[i];
    }
    result
}

fn dot_product_2(x: &[f64], y: &[f64]) -> f64 {
    x.iter().zip(y).map(|(&a, &b)| a * b).sum()
}
fn rand_array(cnt: usize) -> Vec<f64> {
    let mut rng = rand::rngs::StdRng::seed_from_u64(42);
    rng.sample_iter(&rand::distributions::Standard).take(cnt).collect()
}

#[bench]
fn bench_small_1(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_1(s0, s1))
}

#[bench]
fn bench_small_2(b: &mut Bencher) {
    let samples = rand_array(2 * LEN);
    let (s0, s1) = samples.split_at(LEN);
    b.iter(|| dot_product_2(s0, s1))
}
bench_small_1   20 ns/iter (+/- 6)
bench_small_2   17 ns/iter (+/- 1)

bench_small_1   19 ns/iter (+/- 3)
bench_small_2   17 ns/iter (+/- 2)

bench_small_1   19 ns/iter (+/- 5)
bench_small_2   17 ns/iter (+/- 3)

bench_small_1   19 ns/iter (+/- 3)
bench_small_2   24 ns/iter (+/- 7)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   24 ns/iter (+/- 1)

bench_small_1   27 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 0)
bench_small_2   25 ns/iter (+/- 1)

bench_small_1   28 ns/iter (+/- 1)
bench_small_2   17 ns/iter (+/- 1)

在10次运行中有9次,惯用代码比for循环快.这是在2.9 GHz内核i9(i9-8950HK)上完成的,带有32 GB RAM,用rustc 1.31.0-nightly (fc403ad98 2018-09-30)字节编译.

Rust相关问答推荐

如何将元素添加到向量并返回对该元素的引用?

亚性状上位性状上的 rust 病伴生型界限

如何提高自定义迭代器的`extend`性能

在Rust中有没有办法在没有UB的情况下在指针和U64之间进行转换?

这是不是在不造成嵌套的情况下从枚举中取出想要的变体的惯用方法?

在铁 rust 中传递所有权

将PathBuf转换为字符串

如何创建一个可变的嵌套迭代器?

如何在嵌套的泛型 struct 中调用泛型方法?

无法将 rust 蚀向量附加到另一个向量

如何在Rust中基于字符串 Select struct ?

如何轮询 Pin>?

Rust与_有何区别?

一旦令牌作为文字使用,声明宏不匹配硬编码值?

为什么要这样编译?

使用自定义 struct 收集 Vec

如何在 Rust 中将 bson::Bson 转换为 Vec

是否可以预测堆栈溢出?

从函数返回 u32 的数组/切片

如何将切片推入数组?