如果我们查看50k元素以下集合的调整大小行为:

>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
...     size = sys.getsizeof(s)
...     if size not in seen:
...         seen[size] = len(s)
...         print(f"{size=} {len(s)=}")
...     s.add(i)
... 
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661

此模式与quadrupling of the backing storage size once the set is 3/5ths full一致,外加PySetObject的一些假定恒定的开销:

>>> for i in range(9, 22, 2):
...     print(2**i + 216)
... 
728
2264
8408
32984
131288
524504
2097368

即使对于更大的集合,类似的模式也会继续,但调整大小系数将切换为加倍,而不是四倍.

小集合的报告大小是一个离群值.不是大小344字节,即16*8+216(新创建的空集的存储数组有8个槽可用,直到第一次调整大小至32个槽为止),只有216个字节被报告为sys.getsizeof.

我遗漏了什么?这些小集合是如何存储的,以便它们只使用216字节而不是344字节?

推荐答案

我们将判断小集合如何使用216个字节.

首先,用下面的C struct 来表示一个set的对象.

typedef struct {
    PyObject_HEAD

    Py_ssize_t fill;            /* Number active and dummy entries*/
    Py_ssize_t used;            /* Number active entries */

    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;

    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */

    setentry smalltable[PySet_MINSIZE];
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;

现在记住,getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

好的,set implements the __sizeof__个.

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

现在让我们判断一下线路

res = _PyObject_SIZE(Py_TYPE(so));

_PyObject_SIZE is just a macro,这将扩展到(typeobj)->tp_basicsize.

#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )

这段代码实质上是试图访问tp_basicsize slot to get the size in bytes of instances of the type,它只有sizeof(PySetObject) in case of set.

PyTypeObject PySet_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "set",                              /* tp_name */
    sizeof(PySetObject),                /* tp_basicsize */
    0,                                  /* tp_itemsize */
    # Skipped rest of the code for brevity.

我对set_sizeofC函数进行了以下更改.

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
    unsigned long fill_size = sizeof(so->fill);
    unsigned long used_size = sizeof(so->used);
    unsigned long mask_size = sizeof(so->mask);
    unsigned long table_size = sizeof(so->table);
    unsigned long hash_size = sizeof(so->hash);
    unsigned long finger_size = sizeof(so->finger);
    unsigned long smalltable_size = sizeof(so->smalltable);
    unsigned long weakreflist_size = sizeof(so->weakreflist);
    int is_using_fixed_size_smalltables = so->table == so->smalltable;

    printf("| PySetObject Fields   | Size(bytes) |\n");
    printf("|------------------------------------|\n");
    printf("|    PyObject_HEAD     |     '%zu'    |\n", py_object_head_size);
    printf("|      fill            |      '%zu'    |\n", fill_size);
    printf("|      used            |      '%zu'    |\n", used_size);
    printf("|      mask            |      '%zu'    |\n", mask_size);
    printf("|      table           |      '%zu'    |\n", table_size);
    printf("|      hash            |      '%zu'    |\n", hash_size);
    printf("|      finger          |      '%zu'    |\n", finger_size);
    printf("|    smalltable        |    '%zu'    |\n", smalltable_size); 
    printf("|    weakreflist       |      '%zu'    |\n", weakreflist_size);
    printf("-------------------------------------|\n");
    printf("|       Total          |    '%zu'    |\n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
    printf("\n");
    printf("Total size of PySetObject '%zu' bytes\n", sizeof(PySetObject));
    printf("Is Using fixed size smalltables: '%s'\n", is_using_fixed_size_smalltables ? "True": "False");

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

编译和运行这些更改为我提供了

>>> import sys
>>> 
>>> empty_set = set()
>>> sys.getsizeof(empty_set)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Is Using fixed size smalltables: 'True'
216

返回值是216字节,因为sys.getsize add 16 bytes of GC overhead.

但这里需要注意的是这一行.

|    smalltable        |    '128'    |

因为对于较小的表(在第一次调整大小之前),so->table只有a referencefixed size(8) so->smalltable(没有错误分配的内存),所以sizeof(PySetObject)足以获得大小,因为它还包括存储大小(128(16(size of setentry) * 8)).

现在,当调整大小时会发生什么.它构造entirely new table(malloc'ed)并使用so->smalltables中的that table instead,这意味着已调整大小的集合还执行128字节的重载(大小为fixed size small table)以及Malloc‘ed so->table的大小.

else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;

Python相关问答推荐

Pandas 有条件轮班操作

不理解Value错误:在Python中使用迭代对象设置时必须具有相等的len键和值

如何在polars(pythonapi)中解构嵌套 struct ?

如何列举Pandigital Prime Set

当独立的网络调用不应该互相阻塞时,'

如果条件不满足,我如何获得掩码的第一个索引并获得None?

我对我应该做什么以及我如何做感到困惑'

实现自定义QWidgets作为QTimeEdit的弹出窗口

将输入聚合到统一词典中

如何使用SentenceTransformers创建矢量嵌入?

在方法中设置属性值时,如何处理语句不可达[Unreacable]";的问题?

为什么Python内存中的列表大小与文档不匹配?

如何为需要初始化的具体类实现依赖反转和接口分离?

Python OPCUA,modbus通信代码运行3小时后出现RuntimeError

pytest、xdist和共享生成的文件依赖项

对包含JSON列的DataFrame进行分组

删除另一个div中的特定div容器

如何在不遇到IndexError的情况下将基数10的整数转换为基数80?

如何在networkx图中提取和绘制直接邻居(以及邻居的邻居)?

如何在Python中使用Polars向SQLite数据库写入数据?