通常,随机数生成器返回一个比特流,在每个位置观察到0或1的概率相等(即50%).让我们称之为无偏见的PRNG.
我需要生成一个具有以下属性的伪随机比特串:在每个位置看到1的概率是p(即看到0的概率是1-p).参数p是介于0和1之间的实数;在我的问题中,它恰好有0.5%的分辨率,也就是说,它可以取0%,0.5%,1%,1.5%...,99.5%, 100%.
注意p是一个概率,而不是一个精确的分数.在n比特流中设置为1的实际比特数必须遵循二项分布B(n,p).
有一种简单的方法可以使用无偏PRNG来生成每个位(伪代码)的值:
generate_biased_stream(n, p):
result = []
for i in 1 to n:
if random_uniform(0, 1) < p:
result.append(1)
else:
result.append(0)
return result
这种实现比生成无偏流慢得多,因为它每比特调用一次随机数生成器函数;而无偏流生成器按每个字大小调用它一次(例如,一次调用可以生成32或64个随机位).
我想要一个更快的实现,即使它稍微牺牲了随机性.我想到的一个 idea 是预先计算一个查找表:对于p的200个可能值中的每一个,使用较慢的算法计算c8位值,并将它们保存在一个表中.然后,快速算法将随机选取其中一个来生成8个倾斜位.
A back of the envelope calculation to see how much memory is needed: C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024. Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.
这种解决方案的一个缺点是,它一次只能生成8位,即使需要大量工作,而无偏PRNG只需几条算术指令就可以一次生成64位.
我想知道是否有一种更快的方法,基于位运算而不是查找表.例如,直接修改随机数生成代码, for each 位引入偏差.这将实现与无偏PRNG相同的性能.
编辑3月5日
谢谢大家的建议,我有很多有趣的 idea 和建议.以下是最重要的:
- 更改问题要求,使p的分辨率为1/256,而不是1/200.这允许更有效地使用BIT,也提供了更多优化机会.我想我可以改变.
- 使用算术编码有效地消耗来自无偏生成器的位.通过以上分辨率的改变,这变得容易多了.
- 一些人认为PRNG非常快,因此使用算术编码实际上可能会由于引入的开销而使代码变慢.相反,我应该始终使用最坏情况下的位数并优化代码.参见下面的基准.
- @rici建议使用SIMD.这是一个好主意,只有当我们总是消耗固定数量的比特时,它才会起作用.
基准测试(无算术解码)
注意:正如你们很多人所建议的,我将分辨率从1/200改为1/256.
我写了several implementations个朴素的方法,只需要8个随机无偏位,并生成1个有偏位:
- 没有SIMD
- SIMD使用Agner Fog的矢量类库,as suggested by @rici
- 使用内部函数的SIMD
我使用两个无偏伪随机数生成器:
- xorshift128plus
- 阿格纳·福格图书馆的Ranvec1 (Mersenne Twister-like)人.
为了进行比较,我还测量了无偏PRNG的速度.以下是结果:
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
与标量方法相比,SIMD将性能提高了3倍.正如预期的那样,它比无偏发生器慢8倍.
最快的偏置发生器达到2.1 Gb/s.
RNG: xorshift128plus
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
对于xorshift,与标量方法相比,SIMD将性能提高了5倍.它比无偏发生器慢4倍.请注意,这是xorshift的标量实现.
最快的偏置发生器达到4.9 Gb/s.
RNG: xorshift128plus_avx2
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical : 104,857,600
此实现使用AVX2并行运行4个无偏异或移位生成器.
最快的偏置发生器达到19.5 Gb/s.
算术解码基准
简单的测试表明,算术解码码是瓶颈,而不是PRNG.因此,我只是对最昂贵的PRNG进行基准测试.
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical : 10,240,000
Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical : 10,240,000
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
简单定点方法达到0.25 GB/s,朴素标量法快3倍,朴素SIMD法快8倍.可能有一些方法可以进一步优化和/或并行化算术解码方法,但由于其复杂性,我决定停在这里, Select 简单的SIMD实现.
谢谢大家的帮助.