为什么清理sklearn决策树可以生成a pickle thousands times bigger (in terms of memory) than the original estimator?
我在工作中遇到了这个问题,一个随机森林估计器(有100棵决策树)在一个大约有1\u 000\u 000个样本和7个特征的数据集上生成了一个大于2GB的pickle.
我能够追踪到问题到单个决策树的清理,并且能够使用生成的数据集复制问题,如下所示.
对于内存估计,我使用了pympler个库.使用的Sklearn版本为1.0.1
# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1 # using a dataset generation function from sklear
from pympler import asizeof
# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
estimator.fit(X, y)
return X, y, estimator
# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
num = size_in_bytes
for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f%s%s" % (num, 'Yi', suffix)
def print_size(obj, skip_detail=False):
obj_size = asizeof.asized(obj).size
print(readable_size(obj_size))
return obj_size
def compare_with_pickle(obj):
size_obj = print_size(obj)
size_pickle = print_size(pickle.dumps(obj))
print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)
输出:
1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84