我有一个dictlist,它遵循一个一致的 struct ,每个dict都有一个由list组成的整数.但是,我需要确保每个dict的ByteSize(转换为JSON字符串时)小于指定的阈值.

如果dict超过字节大小阈值,我需要分块该DICT的整数list.

try :


import json

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4,5,6,7]}
]

# Max size in bytes we can allow. This is static and a hard limit that is not variable.
MAX_SIZE: int = 25

def check_and_chunk(arr: list):

    def check_size_bytes(item):
        return True if len(json.dumps(item).encode("utf-8")) > MAX_SIZE else False

    def chunk(item, num_chunks: int=2):
        for i in range(0, len(item), num_chunks):
            yield item[i:i+num_chunks]

    # First check if the entire payload is smaller than the MAX_SIZE
    if not check_size_bytes(arr):
        return arr

    # Lets find the items that are small and items that are too big, respectively
    small, big = [], []

    # Find the indices in the payload that are too big
    big_idx: list = [i for i, j in enumerate(list(map(check_size_bytes, arr))) if j]

    # Append these items respectively to their proper lists
    item_append = (small.append, big.append)
    for i, item in enumerate(arr):
        item_append[i in set(big_idx)](item)
    
    # Modify the big items until they are small enough to be moved to the small_items list
    for i in big:
        print(i)
    # This is where I am unsure of how best to proceed. I'd like to essentially split the big dictionaries in the 'big' list such that it is small enough where each element is in the  'small' result.

可能的预期结果示例:

payload: list[dict] = [
    {"data1": [1,2,3,4]},
    {"data2": [8,9,10]},
    {"data3": [1,2,3,4]},
    {"data3": [5,6,7]}
]

推荐答案

正如注释中所讨论的,此任务的一种简单方法是递归拆分输入列表,直到输出字典满足大小要求.这将在输出中提供更多大小均匀的列表,但可能会产生比绝对必要的更多的字典(并且将由一种长度累加方法产生).

import json

def split_list_dict(dl, limit):
    def split_dict_list(dd, limit):
        def json_len(ll):
            return sum(map(len, map(str, ll))) + 2 * len(ll)    # 2 * len(ll) allows for [] and , 
        
        ll = next(iter(dd.values()))
        key = next(iter(dd.keys()))
        dict_jsonlen = len(json.dumps(dd))
        if dict_jsonlen <= limit:
            yield dd
            return
        list_jsonlen = json_len(ll)
        keylen = dict_jsonlen - list_jsonlen
        split_point = len(ll) // 2
        yield from split_dict_list({ key : ll[:split_point] }, limit)
        yield from split_dict_list({ key : ll[split_point:] }, limit)
    
    for dd in dl:
        yield from split_dict_list(dd, limit)

MAX_SIZE: int = 25

payload: list[dict] = [
    {"data1": [1, 2, 3, 4]},
    {"long_data_name": [1, 2, 3, 4]},
    {"data3": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]},
    {"data4": [100, 200, -1, -10, 200, 300, 12, 13]},
]

print(list(split_list_dict(payload, MAX_SIZE)))

输出:

[
  {'data1': [1, 2, 3, 4]},
  {'long_data_name': [1]},
  {'long_data_name': [2]},
  {'long_data_name': [3]},
  {'long_data_name': [4]},
  {'data3': [1, 2, 3]},
  {'data3': [4, 5, 6]},
  {'data3': [7, 8, 9]},
  {'data3': [10, 11, 12]},
  {'data4': [100, 200]},
  {'data4': [-1, -10]},
  {'data4': [200, 300]},
  {'data4': [12, 13]}
]

Python相关问答推荐

customtkinter中使用的这个小部件的名称是什么

Django注释:将时差转换为小数或小数

为什么基于条件的过滤会导致pandas中的空数据框架?

pandas DataFrame中类型转换混乱

优化在numpy数组中非零值周围创建缓冲区的函数的性能

如何在Deliveryter笔记本中从同步上下文正确地安排和等待Delivercio代码中的结果?

从webhook中的短代码(而不是电话号码)接收Twilio消息

对Numpy函数进行载体化

@Property方法上的inspect.getmembers出现意外行为,引发异常

处理带有间隙(空)的duckDB上的重复副本并有效填充它们

Odoo 16使用NTFS使字段只读

启用/禁用shiny 的自动重新加载

用渐近模计算含符号的矩阵乘法

* 动态地 * 修饰Python中的递归函数

在Python中计算连续天数

matplotlib图中的复杂箭头形状

Pandas:计算中间时间条目的总时间增量

提高算法效率的策略?

Pandas—MultiIndex Resample—我不想丢失其他索引的信息´

如何合并具有相同元素的 torch 矩阵的行?