我有两个CSV文件,一个名为web_file,有25,000行,另一个名为inv_file,有320,000行.

我需要从web_file的第1列读取每一行,并从inv_file的第1列的每一行找到所有匹配值,然后将inv_file的行写到一个新的CSV文件中.

使用只有5-10行的示例文件也不能说明这个问题,所以我列出了一些随机数字,例如下面.

示例Web_FILE:

Inv_SKU,Web_SKU,Brand,Barcode
225481-34,225481-34,brand1,987654321
0486592,0486592,brand2,654871233
AB56412,AB56412,brand2,651273214
LL-123456,LL-123456,brand3,748912349
JLPD-65,JLPD-65,brand6,341541648
20143966,20143966,brand3,82193714
39585824,39585824,brand5,36837329
78066099,78066099,brand4,98398987
44381051,44381051,brand1,9090428
86529443,86529443,brand4,6861670
DF 5645 12,DF 5645 12,brand1,489456138
9845671325,9845671325,brand4,498451315
59634923,59634923,brand4,35828574
85290760,85290760,brand2,64562216
41217184,41217184,brand4,12816236
AE48915,AE48915,brand1,342536125
93981723,93981723,brand2,58155601

示例inv_file:

Inv_SKU,Web_SKU,Brand,Barcode
0486592,0486592,brand2,654871233
LL-123456,LL-123456,brand3,748912349
9845671325,9845671325,brand4,498451315
OI3248967,OI3248967,brand2,891513211
AB56412,AB56412,brand2,651273214
DF 5645 12,DF 5645 12,brand1,489456138
225481-34,225481-34,brand1,987654321
123456789,123456789,brand5,654986413
9841531,9841531,brand3,543254512
AE48915,AE48915,brand1,342536125
JLPD-65,JLPD-65,brand6,341541648
MMMM,MMMM,brand7,384941542
23481-4323,23481-4323,brand3,489123157
98451321,98451321,brand4,498121354
23454152,23454152,brand2,894165123
10275690,10275690,brand2,25612670
20143966,20143966,brand3,82193714
59634923,59634923,brand4,35828574
65800253,65800253,brand5,72318134
67722613,67722613,brand6,93290033
92617199,92617199,brand7,95078073
15379652,15379652,brand1,56281224
85290760,85290760,brand2,64562216
78066099,78066099,brand4,98398987
41217184,41217184,brand4,12816236
87152990,87152990,brand4,95058925
73813369,73813369,brand1,2395994
50201544,50201544,brand1,9167830
93981723,93981723,brand2,58155601
39585824,39585824,brand5,36837329
29082963,29082963,brand3,23393947
23856043,23856043,brand8,57295562
74249006,74249006,brand8,83219065
94376071,94376071,brand8,94887004
14553763,14553763,brand8,14223230
44381051,44381051,brand1,9090428
7598085,7598085,brand1,48967969
56383025,56383025,brand2,68864452
44338055,44338055,brand4,47043853
86529443,86529443,brand4,6861670

我try 使用这段代码,但最后得到了许多重复的行,我希望避免重复行,因为我实际使用的文件太大了,以至于最终有数百万行.

with open('inv_file.csv', 'r') as f1, open('web_file.csv', 'r') as f2:
    inv_file = f1.readlines()
    web_file = f2.readlines()


with open('result.csv', 'r+') as f3:
    result_file = f3.readlines()

    while len(result_file) < len(web_file):
        for row in inv_file:
            for row1 in web_file:
                if row[0] in row1[0]:
                    f3.write(row1)
        break

推荐答案

您真的应该使用CSV库来解析CSV文件.一种方法是存储一份网络SKU列表(希望我没有弄错),然后对照它判断库存SKU.这可以通过将生成器传递给CSV writerows()方法来高效地完成.

import csv
with open('inv_file.csv', 'r') as f1, open('web_file.csv', 'r') as f2, open('result.csv', 'w') as f3:
    web_skus = [row[0] for row in csv.reader(f2)]
    # web_skus = set([row[0] for row in csv.reader(f2)])  # uncomment to remove dupliate web skus
    inv_file = csv.reader(f1)
    rows = (row for row in inv_file if row[0] in web_skus)

    writer = csv.writer(f3)
    writer.writerows(rows)

Python相关问答推荐

Python 枕头上的图像背景变黑

当变量也可以是无或真时,判断是否为假

根据多列和一些条件创建新列

Snap 7- read_Area用于类似地址的变量

带有pandas的分区列上的过滤器的多个条件read_parquet

Altair -箱形图边界设置为黑色,中线设置为红色

如何编写一个正规表达式来查找序列中具有2个或更多相同辅音的所有单词

有没有方法可以修复删除了换码字符的无效的SON记录?

如何根据另一列值用字典中的值替换列值

通过优化空间在Python中的饼图中添加标签

大Pandas 胚胎中产生组合

Python解析整数格式说明符的规则?

将tdqm与cx.Oracle查询集成

计算每个IP的平均值

如何在Python中找到线性依赖mod 2

解决调用嵌入式函数的XSLT中表达式的语法移位/归约冲突

如何在Python中使用Pandas将R s Tukey s HSD表转换为相关矩阵''

在pandas/python中计数嵌套类别

基于另一列的GROUP-BY聚合将列添加到Polars LazyFrame

Gunicorn无法启动Flask应用,因为无法将应用解析为属性名或函数调用.'"'' "