Python 为什么这么大的 sklearn 决策树泡菜(大 30K 倍)

发布于06月15日

为什么清理sklearn决策树可以生成a pickle thousands times bigger (in terms of memory) than the original estimator？

我在工作中遇到了这个问题，一个随机森林估计器(有100棵决策树)在一个大约有1\u 000\u 000个样本和7个特征的数据集上生成了一个大于2GB的pickle.

我能够追踪到问题到单个决策树的清理，并且能够使用生成的数据集复制问题，如下所示.

对于内存估计，我使用了pympler个库.使用的Sklearn版本为1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

输出:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

def print_tree_estimate(tree): print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes") print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}") print_tree_estimate(model100K.tree_) print() print_tree_estimate(model1M.tree_)

A tree with max_depth 37 can have up to 68719476736 nodes This tree has node_count 80159 and a size estimate is 4.9 MB A tree with max_depth 46 can have up to 35184372088832 nodes This tree has node_count 807881 and a size estimate is 49.3 MB

Python 为什么这么大的 sklearn 决策树泡菜(大 30K 倍)

推荐答案

Python相关问答推荐

如何在Python中使用时区夏令时获取任何给定本地时间的纪元值？

使用scipy. optimate.least_squares()用可变数量的参数匹配两条曲线

Pythind 11无法弄清楚如何访问tuple元素

acme错误-Veritas错误：模块收件箱没有属性linear_util'

如何列举Pandigital Prime Set

log 1 p numpy的意外行为

通过pandas向每个非空单元格添加子字符串

利用Selenium和Beautiful Soup实现Web抓取JavaScript表

计算每个IP的平均值

Pandas—在数据透视表中占总数的百分比

形状弃用警告与组合多边形和多边形如何解决

如何在图中标记平均点？

实现神经网络代码时的TypeError

在嵌套span下的span中擦除信息

使用字典或列表的值组合

在Python中控制列表中的数据步长

如何反转一个框架中列的值？

如果不使用. to_list()[0]，我如何从一个pandas DataFrame中获取一个值？

根据过滤后的牛郎星图表中的数据计算新系列

使用Scikit的ValueError-了解