首先判断这是否与comment of that method中包含的警告有关:
/*! ZDICT_trainFromBuffer():
* Train a dictionary from an array of samples.
* Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
* f=20, and accel=1.
* Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
* supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
* The resulting dictionary will be saved into `dictBuffer`.
* @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
* or an error code, which can be tested with ZDICT_isError().
* Note: Dictionary training will fail if there are not enough samples to construct a
* dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
* If dictionary training fails, you should use zstd without a dictionary, as the dictionary
* would've been ineffective anyways. If you believe your samples would benefit from a dictionary
* please open an issue with details, and we can look into it.
* Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
* Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
* It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
* In general, it's recommended to provide a few thousands samples, though this can vary a lot.
* It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
*/
ZDICTLIB_API size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
const void* samplesBuffer,
const size_t* samplesSizes, unsigned nbSamples);
(这来自C++facebook/zstd
,但同样的 idea 也适用于包装库skbkontur/ZstdNet
)
从这一 comments 中,警告是:
如果没有足够的样本来构建字典,或者如果大多数样本太小(<;8字节是下限),则字典训练将失败.
如果字典训练失败,您应该使用不带字典的zstd,因为字典无论如何都是无效的.
如果您认为您的样品将受益于一本词典,请打开一个详细的问题,我们可以调查它.
该方法需要一组字节数组作为字典创建的训练样本,但生成这些样本的方式可能并不适用于所有场景,特别是对于您提供的最少且重复的文本示例.
So make sure the input text provides enough unique samples for dictionary training. A larger and more varied dataset might be necessary.
Try and manually creating a larger and more diverse set of samples for the dictionary training process if your use case involves small or very specific text samples.
var text = "bla bla, bla bla bla";
var bytes = Encoding.UTF8.GetBytes(text);
// Make sure the text split results in sufficiently diverse and numerous byte arrays for training
var samples = new List<byte[]>{
Encoding.UTF8.GetBytes("sample text 1"),
Encoding.UTF8.GetBytes("sample text 2"),
// Add more varied samples
};
var dic = ZstdNet.DictBuilder.TrainFromBuffer(samples);