当我编译 Rust 代码时，我是否缺少 AVX512 的目标功能

发布于09月02日

我已经编写了一些Rust函数，它们使用AVX2和AVX512指令来加速图像合成.我使用的是AMD 7950x CPU.

当我跑RUSTFLAGS="-C target-cpu=native" cargo bench分时，我得到:

test overlay_using_avx2   ... bench:     483,596 ns/iter (+/- 10,006)
test overlay_using_avx512 ... bench:     317,818 ns/iter (+/- 729)

然而，我想在一台机器上构建可执行文件，然后在另一台机器上运行它.因此，我显式地启用我的代码需要的特性，并在运行时判断它们是否存在.然而，当我这样做时，AVX512基准测试运行得更慢，我不明白为什么.我在奔跑:

RUSTFLAGS="-C target-feature=+avx2,+avx,+sse2,+avx512f,+avx512bw" cargo bench:

test overlay_using_avx2   ... bench:     490,664 ns/iter (+/- 13,172)
test overlay_using_avx512 ... bench:   1,519,720 ns/iter (+/- 38,608)

我需要启用rustc --print target-features列表中的其他功能(S)吗？是否可以查看通过设置target-cpu=native启用了哪些功能？

我的基准代码如下，每晚运行:

#![feature(stdsimd)]
#![feature(test)]

use std::arch::x86_64::*;

unsafe fn overlay_chunk_avx2(this_chunk: &mut [u8], image_chunk: &[u8], c1: __m256i, c2: __m256i) {
    let this_ptr = this_chunk.as_mut_ptr() as *mut __m128i;
    let image_ptr = image_chunk.as_ptr() as *const __m128i;

    let this_argb = _mm_loadu_si128(this_ptr);
    let image_argb = _mm_loadu_si128(image_ptr);

    let this_u16 = _mm256_cvtepu8_epi16(this_argb);
    let image_u16 = _mm256_cvtepu8_epi16(image_argb);

    let image_alpha = _mm256_shuffle_epi8(image_u16, c1);
    let image_inv_alpha = _mm256_sub_epi8(c2, image_alpha);

    let this_blended = _mm256_mullo_epi16(this_u16, image_inv_alpha);
    let image_blended = _mm256_mullo_epi16(image_u16, image_alpha);

    let blended = _mm256_add_epi16(this_blended, image_blended);
    let divided = _mm256_srli_epi16(blended, 8);

    let lo_lane = _mm256_castsi256_si128(divided);
    let hi_lane = _mm256_extracti128_si256(divided, 1);

    let divided_u8 = _mm_packus_epi16(lo_lane, hi_lane);

    _mm_storeu_si128(this_ptr, divided_u8);
}

unsafe fn overlay_chunk_avx512(this_chunk: &mut [u8], image_chunk: &[u8], c1: __m512i, c2: __m512i) {
    let this_ptr = this_chunk.as_mut_ptr() as *mut i8;
    let image_ptr = image_chunk.as_ptr() as *const i8;

    let this_argb = _mm256_loadu_epi8(this_ptr);
    let image_argb = _mm256_loadu_epi8(image_ptr);

    let this_u16 = _mm512_cvtepu8_epi16(this_argb);
    let image_u16 = _mm512_cvtepu8_epi16(image_argb);

    let image_alpha = _mm512_shuffle_epi8(image_u16, c1);
    let image_inv_alpha = _mm512_sub_epi8(c2, image_alpha);

    let this_blended = _mm512_mullo_epi16(this_u16, image_inv_alpha);
    let image_blended = _mm512_mullo_epi16(image_u16, image_alpha);

    let blended = _mm512_add_epi16(this_blended, image_blended);

    let divided = _mm512_srli_epi16(blended, 8);
    let divided_u8 = _mm512_cvtepi16_epi8(divided);

    _mm256_storeu_epi8(this_ptr, divided_u8);
}

extern crate test;

#[bench]
fn overlay_using_avx2(bencher: &mut test::Bencher) {
    let mut frame = vec![0; 1920 * 1080 * 4];
    let image = vec![0; 1920 * 1080 * 4];

    let constant1 = unsafe { _mm256_set_epi8(-1, 24, -1, 24, -1, 24, -1, -1, -1, 16, -1, 16, -1, 16, -1, -1, -1, 8, -1, 8,  -1, 8, -1, -1, -1, 0, -1, 0, -1, 0, -1, -1) };
    let constant2 = unsafe { _mm256_set_epi8(0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0) };

    bencher.iter(|| {
        let frame_chunks = frame.chunks_exact_mut(128 / 8);
        let image_chunks = image.chunks_exact(128 / 8);

        for (frame_chunk, image_chunk) in frame_chunks.zip(image_chunks) {
            unsafe { overlay_chunk_avx2(frame_chunk, image_chunk, constant1, constant2); }
        }
    });
}

#[bench]
fn overlay_using_avx512(bencher: &mut test::Bencher) {
    let mut frame = vec![0; 1920 * 1080 * 4];
    let image = vec![0; 1920 * 1080 * 4];

    let constant1 = unsafe { _mm512_set_epi8(-1, 56, -1, 56, -1, 56, -1, -1, -1, 48, -1, 48, -1, 48, -1, -1, -1, 40, -1, 40, -1, 40, -1, -1, -1, 32, -1, 32, -1, 32, -1, -1, -1, 24, -1, 24, -1, 24, -1, -1, -1, 16, -1, 16, -1, 16, -1, -1, -1, 8, -1, 8, -1, 8, -1, -1, -1, 0, -1, 0, -1, 0,  -1, -1) };
    let constant2 = unsafe { _mm512_set_epi8(0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0, 0, -1, 0, -1, 0, -1, 1, 0) };

    bencher.iter(|| {
        let frame_chunks = frame.chunks_exact_mut(256 / 8);
        let image_chunks = image.chunks_exact(256 / 8);

        for (frame_chunk, image_chunk) in frame_chunks.zip(image_chunks) {
            unsafe { overlay_chunk_avx512(frame_chunk, image_chunk, constant1, constant2); }
        }
    });
}

当我编译 Rust 代码时，我是否缺少 AVX512 的目标功能

推荐答案

Rust相关问答推荐

如何从接收&；self的方法克隆RC

如果A == B，则将Rc A下推到Rc B

Rust：跨多个线程使用hashmap Arc和rwlock

rust 蚀生命周期行为

当一个箱子有自己的依赖关系时，两个人如何克服S每箱1库+n箱的限制？

捕获FnMut闭包的时间不够长

如何在递归数据 struct 中移动所有权时变异引用？

Rust中WPARAM和VIRTUAL_KEY的比较

Rust面向对象设计模式

我应该如何表达具有生命周期参数的类型的总排序，同时允许与不同生命周期进行比较？

std mpsc 发送者通道在闭包中使用时关闭

面临意外的未对齐指针取消引用：地址必须是 0x8 的倍数，但为 0x__错误

部署Rust发布二进制文件的先决条件

实现泛型的 Trait 方法中的文字

仅当函数写为闭包时才会出现生命周期错误

我什么时候应该使用特征作为 Rust 的类型？

我可以在不调用 .clone() 的情况下在类型转换期间重用 struct 字段吗？

为什么我不能为 Display+Debug 的泛型类型实现 std：：error：：Error 但有一个不是泛型参数的类型？

在 Rust 中有条件地导入？

如何从 Rust 中不同类型的多个部分加入 Path？