我目前正在使用ZstdNet库进行试验,该库使用预先生成的词典来压缩小文本.虽然压缩在大多数情况下都能正常工作,但在某些情况下,我会遇到一个异常,消息是"Src Size is inflong"(源大小不正确).

我已经创建了一个最小测试来重现该问题:

var text =  "bla bla, bla bla bla";
var bytes = Encoding.UTF8.GetBytes(text);
var dic = ZstdNet.DictBuilder.TrainFromBuffer(text.Split(' ').Select(Encoding.UTF8.GetBytes) );

有什么见解或建议吗?

推荐答案

首先判断这是否与comment of that method中包含的警告有关:

/*! ZDICT_trainFromBuffer():
 *  Train a dictionary from an array of samples.
 *  Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
 *  f=20, and accel=1.
 *  Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
 *  supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
 *  The resulting dictionary will be saved into `dictBuffer`.
 * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
 *          or an error code, which can be tested with ZDICT_isError().
 *  Note:  Dictionary training will fail if there are not enough samples to construct a
 *         dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
 *         If dictionary training fails, you should use zstd without a dictionary, as the dictionary
 *         would've been ineffective anyways. If you believe your samples would benefit from a dictionary
 *         please open an issue with details, and we can look into it.
 *  Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
 *  Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
 *        It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
 *        In general, it's recommended to provide a few thousands samples, though this can vary a lot.
 *        It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
 */
ZDICTLIB_API size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                                    const void* samplesBuffer,
                                    const size_t* samplesSizes, unsigned nbSamples);

(这来自C++facebook/zstd,但同样的 idea 也适用于包装库skbkontur/ZstdNet)

从这一 comments 中,警告是:

如果没有足够的样本来构建字典,或者如果大多数样本太小(<;8字节是下限),则字典训练将失败.

如果字典训练失败,您应该使用不带字典的zstd,因为字典无论如何都是无效的.

如果您认为您的样品将受益于一本词典,请打开一个详细的问题,我们可以调查它.

该方法需要一组字节数组作为字典创建的训练样本,但生成这些样本的方式可能并不适用于所有场景,特别是对于您提供的最少且重复的文本示例.

So make sure the input text provides enough unique samples for dictionary training. A larger and more varied dataset might be necessary.
Try and manually creating a larger and more diverse set of samples for the dictionary training process if your use case involves small or very specific text samples.

var text =  "bla bla, bla bla bla";
var bytes = Encoding.UTF8.GetBytes(text);
// Make sure the text split results in sufficiently diverse and numerous byte arrays for training
var samples = new List<byte[]>{
    Encoding.UTF8.GetBytes("sample text 1"),
    Encoding.UTF8.GetBytes("sample text 2"),
    // Add more varied samples
};
var dic = ZstdNet.DictBuilder.TrainFromBuffer(samples);

.net相关问答推荐

使用.NET 8时无法识别运行标识符

NETSDK1083:无法识别指定的 RuntimeIdentifierwin10-x64

无法对.NET MAUI类库进行单元测试

部署时如何控制红隼端口?

每当属性值发生变化时引发事件?

Gacutil.exe 成功添加程序集,但在资源管理器中无法查看程序集.为什么?

Owin Twitter登录-根据验证程序远程证书无效

如何在 C# 中创建表达式树来表示String.Contains("term")?

发布版本中的 Debug.WriteLine

为什么循环引用被认为是有害的?

使用 Windows 服务和 C# 检测 USB 驱动器插入和移除

如何退出所有正在运行的线程?

您可以在 C# 代码中捕获本机异常吗?

哪个更快:清除集合或实例化新的

从 Windows 窗体打开 URL

X509Certificate 构造函数异常

等待 Async Void 方法调用以进行单元测试

.Net 中 AOP 的最佳实现是什么?

泛型类的默认构造函数的语法是什么?

MultipleActiveResultSets=True 还是多个连接?