C++ 生成每个位给定概率为 0 或 1 的伪随机位的快速方法

发布于03月05日

通常，随机数生成器返回一个比特流，在每个位置观察到0或1的概率相等(即50%).让我们称之为无偏见的PRNG.

我需要生成一个具有以下属性的伪随机比特串:在每个位置看到1的概率是p(即看到0的概率是1-p).参数p是介于0和1之间的实数；在我的问题中，它恰好有0.5%的分辨率，也就是说，它可以取0%，0.5%，1%，1.5%...，99.5%, 100%.

注意p是一个概率，而不是一个精确的分数.在n比特流中设置为1的实际比特数必须遵循二项分布B(n，p).

有一种简单的方法可以使用无偏PRNG来生成每个位(伪代码)的值:

generate_biased_stream(n, p):
  result = []
  for i in 1 to n:
    if random_uniform(0, 1) < p:
      result.append(1)
    else:
      result.append(0)
  return result

这种实现比生成无偏流慢得多，因为它每比特调用一次随机数生成器函数；而无偏流生成器按每个字大小调用它一次(例如，一次调用可以生成32或64个随机位).

我想要一个更快的实现，即使它稍微牺牲了随机性.我想到的一个 idea 是预先计算一个查找表:对于p的200个可能值中的每一个，使用较慢的算法计算c8位值，并将它们保存在一个表中.然后，快速算法将随机选取其中一个来生成8个倾斜位.

A back of the envelope calculation to see how much memory is needed: C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024. Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.

这种解决方案的一个缺点是，它一次只能生成8位，即使需要大量工作，而无偏PRNG只需几条算术指令就可以一次生成64位.

我想知道是否有一种更快的方法，基于位运算而不是查找表.例如，直接修改随机数生成代码， for each 位引入偏差.这将实现与无偏PRNG相同的性能.

编辑3月5日

谢谢大家的建议，我有很多有趣的 idea 和建议.以下是最重要的:

更改问题要求，使p的分辨率为1/256，而不是1/200.这允许更有效地使用BIT，也提供了更多优化机会.我想我可以改变.
使用算术编码有效地消耗来自无偏生成器的位.通过以上分辨率的改变，这变得容易多了.
一些人认为PRNG非常快，因此使用算术编码实际上可能会由于引入的开销而使代码变慢.相反，我应该始终使用最坏情况下的位数并优化代码.参见下面的基准.
@rici建议使用SIMD.这是一个好主意，只有当我们总是消耗固定数量的比特时，它才会起作用.

基准测试(无算术解码)

注意:正如你们很多人所建议的，我将分辨率从1/200改为1/256.

我写了several implementations个朴素的方法，只需要8个随机无偏位，并生成1个有偏位:

没有SIMD
SIMD使用Agner Fog的矢量类库，as suggested by @rici
使用内部函数的SIMD

我使用两个无偏伪随机数生成器:

xorshift128plus
阿格纳·福格图书馆的Ranvec1 (Mersenne Twister-like)人.

为了进行比较，我还测量了无偏PRNG的速度.以下是结果:


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

与标量方法相比，SIMD将性能提高了3倍.正如预期的那样，它比无偏发生器慢8倍.

最快的偏置发生器达到2.1 Gb/s.


RNG: xorshift128plus

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical   : 104,857,600

对于xorshift，与标量方法相比，SIMD将性能提高了5倍.它比无偏发生器慢4倍.请注意，这是xorshift的标量实现.

最快的偏置发生器达到4.9 Gb/s.


RNG: xorshift128plus_avx2

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical   : 104,857,600

此实现使用AVX2并行运行4个无偏异或移位生成器.

最快的偏置发生器达到19.5 Gb/s.

算术解码基准

简单的测试表明，算术解码码是瓶颈，而不是PRNG.因此，我只是对最昂贵的PRNG进行基准测试.


RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)

Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical   : 10,240,000

Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical   : 10,240,000

Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical   : 104,857,600

简单定点方法达到0.25 GB/s，朴素标量法快3倍，朴素SIMD法快8倍.可能有一些方法可以进一步优化和/或并行化算术解码方法，但由于其复杂性，我决定停在这里， Select 简单的SIMD实现.

谢谢大家的帮助.

C++ 生成每个位给定概率为 0 或 1 的伪随机位的快速方法

编辑3月5日

基准测试(无算术解码)

算术解码基准

推荐答案

C++相关问答推荐

C中空终止符后面的数字？

数据包未从DPDK端口传输到内核端口

增加getaddrinfo返回的IP地址数量

整型文字后缀在左移中的用途

&；(str[i])和(&；str)[i]有什么区别？

什么是.c.h文件？

==284==错误：AddressSaniizer：堆栈缓冲区下溢

将变量或参数打包到 struct /联合中是否会带来意想不到的性能损失？

C代码可以在在线编译器上运行，但不能在Leetcode上运行

C语言中MPI发送接收字符串时出现的分段错误

安全倒计时循环

C++中PUTS函数的返回值

Linux Posix消息队列

分支预测和UB(未定义的行为)

If语句默认为true

将char*铸造为空**

Struct 内的数组赋值

C/C++编译器可以在编译过程中通过按引用传递来优化按值传递吗？

如何用用户输入的多个字符串填充数组？

在链表中插入一个值