I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

这是我数据的形状:(441079,12)

My desired output is:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

推荐答案

pandas >= 0.25

假设所有列都有相同数量的列表,可以对每列调用Series.explode.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


也是faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Json相关问答推荐

Jolt转换,如果任何字段为空,则将对象值设置为空

序列化从/到空值

PowerShell:使用JSON原生的Short命令处理JSON?

Jolt 规范将父对象中的所有键应用于数组中的所有对象

使用 jq 将非统一的 json 输出转换为汇总表

无法使用 vue.js 访问 JSON 数组的项目

如何将从嵌套 Select 返回的空值转换为空数组?

序列化为json时如何忽略空列表?

如何比较 JSON 文档并返回与 Jackson 或 Gson 的差异?

SyntaxError:Object.parse(本机)AngularJS中的意外标记o

Qt使用QJsonDocument、QJsonObject、QJsonArray解析JSON

在 Postgres 中向 JSON 对象添加元素

Spring Security 和 JSON 身份验证

jQuery JSON 响应总是触发 ParseError

如何访问 JSON 对象数组的第一个元素?

在 HTML 数据属性上添加 JSON 是不是很糟糕?

MVC JsonResult camelCase 序列化

如何使用 Javascript 将数据写入 JSON 文件

如何使用 SwiftyJSON 遍历 JSON?

在rails中将JSON字符串转换为JSON数组?