I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?

大多数人都熟悉基本代码(为了以防万一,我将在这里重复):

MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
    md.update( buffer, 0, read );
ios.close();
md.digest();

要最大化吞吐量,理想的缓冲区大小是多少?我知道这是依赖于系统的,我非常确定它的操作系统、文件系统、and个硬盘相关,可能还有其他硬件/软件混合在一起.

(我应该指出,我对Java有些陌生,所以这可能只是一些我不知道的Java API调用.)

Edit:我事先不知道这将用于什么样的系统,所以我不能完全假设.(我使用Java就是为了这个原因.)

上面的代码缺少一些东西,比如try..抓住使柱子变小

推荐答案

最佳缓冲区大小与许多因素有关:文件系统块大小、CPU缓存大小和缓存延迟.

Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.

这就是为什么您会看到大多数缓冲区的大小是2的幂,并且通常大于(或等于)磁盘挡路大小.这意味着您的一个流读取可能会导致多个磁盘挡路读取-但这些读取将始终使用完整的挡路-不会浪费读取.

Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.

因此,我怀疑,如果您使用不同的缓存大小运行测试(我自己没有这样做),您可能会发现缓存大小对文件系统块的大小有很大影响.除此之外,我怀疑事情会很快稳定下来.

There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).

这就引出了"真实世界"的答案:如果你的应用程序有99%的可用性,请将缓存大小设置为8192,然后继续(更好的是, Select 封装而不是性能,并使用BufferedInputStream隐藏细节).如果有1%的应用程序高度依赖于磁盘吞吐量,请精心设计实现,这样你就可以交换不同的磁盘交互策略,并提供旋钮和刻度盘,让用户能够测试和优化(或想出一些self 优化的系统).

Java相关问答推荐

获取拦截器内部的IP地址

ActivityCompat.请求收件箱自动拒绝权限

BiPredicate和如何使用它

Listview—在Android Java中正确链接项目时出错

如何在访问完所有文件后加入所有线程?

为什么我的ArrayList索引的索引总是返回-1?

JavaFX Maven Assembly插件一直打包到错误的JDK版本

Spark上下文在向Spark提交数据集时具有内容,但Spark在实际构建它时发现它为空

使用传递的参数构造异常的Mockito-doThrow(或thenThrow)

使用GridBagLayout正确渲染

在Spring终结点中,是否可以同时以大写和小写形式指定枚举常量?

把一条整型短裤和两条短裤装成一条长的

在VS代码中,如何启用Java Main函数的&Q;Run|DEBUG&Q;代码?

buildDir:File!&#的getter解决方案是什么?39.被抛弃

从泛型枚举创建EnumMap

在Spring Boot JPA for MySQL中为我的所有类创建Bean时出错?

Java嵌套流查找任意值

在Java中将对象&转换为&q;HashMap(&Q)

try 添加;按流派搜索;在Web应用程序上,但没有;I don’我不知道;It’这个代码错了

Java 21保护模式的穷尽性