我有一个样例名称列表和一个包含样例名称的文件名列表:

sample_names = ['SampleA', 'SampleB', 'SampleC']
file_names = ['/path/to/group1/bin_1/group_1_bin_1_SampleA.tsv', 
              '/path/to/group1/bin_1/group_1_bin_1_SampleB.tsv',
              '/path/to/group1/bin_1/group_1_bin_1_SampleC.tsv',
              '/path/to/group1/bin_2/group_1_bin_2_SampleA.tsv',
              '/path/to/group1/bin_2/group_1_bin_2_SampleB.tsv',
              '/path/to/group1/bin_2/group_1_bin_2_SampleC.tsv',
              '/path/to/group1/bin_3/group_1_bin_3_SampleA.tsv',
              '/path/to/group1/bin_3/group_1_bin_3_SampleB.tsv',
              '/path/to/group1/bin_3/group_1_bin_3_SampleC.tsv']

我希望将示例名称与所有文件名进行匹配,并创建一个具有多个键值的字典,如下面的dictionary中所示.有什么建议吗?

dictionary = {'SampleA': ['/path/to/group1/bin_1/group_1_bin_1_SampleA.tsv',
                          '/path/to/group1/bin_2/group_1_bin_2_SampleA.tsv',
                          '/path/to/group1/bin_3/group_1_bin_3_SampleA.tsv'],
              'SampleB': ['/path/to/group1/bin_1/group_1_bin_1_SampleB.tsv',
                          '/path/to/group1/bin_2/group_1_bin_2_SampleB.tsv',
                          '/path/to/group1/bin_3/group_1_bin_3_SampleB.tsv'],
              'SampleC': ['/path/to/group1/bin_1/group_1_bin_1_SampleC.tsv',
                          '/path/to/group1/bin_2/group_1_bin_2_SampleC.tsv',
                          '/path/to/group1/bin_3/group_1_bin_3_SampleC.tsv']}

推荐答案

您的预期结果含糊不清.你称它为字典,但实际上它是一个无效字典的列表.您可以获得单键词典的列表,也可以获得每个样例名称一个键的实际词典.在这两种情况下,键的值都需要是列表或元组,才能包含多个文件名:

对于一本词典:(这是最有意义的)

result = { sn:[fn for gr in file_names for fn in gr if sn in fn] 
           for sn in sample_names}

print(result) # single dictionary

{'SampleA': ['/path/to/group1/bin_1/group_1_bin_1_SampleA.tsv', 
             '/path/to/group1/bin_2/group_1_bin_2_SampleA.tsv', 
             '/path/to/group1/bin_3/group_1_bin_3_SampleA.tsv'],
 'SampleB': ['/path/to/group1/bin_1/group_1_bin_1_SampleB.tsv', 
             '/path/to/group1/bin_2/group_1_bin_2_SampleB.tsv', 
             '/path/to/group1/bin_3/group_1_bin_3_SampleB.tsv'], 
 'SampleC': ['/path/to/group1/bin_1/group_1_bin_1_SampleC.tsv', 
             '/path/to/group1/bin_2/group_1_bin_2_SampleC.tsv', 
             '/path/to/group1/bin_3/group_1_bin_3_SampleC.tsv']}

有关单键词典的列表:

result = [ {sn:[fn for gr in file_names for fn in gr if sn in fn]} 
           for sn in sample_names ]

print(result) # list of single-key dictionaries

[{'SampleA': ['/path/to/group1/bin_1/group_1_bin_1_SampleA.tsv',
              '/path/to/group1/bin_2/group_1_bin_2_SampleA.tsv', 
              '/path/to/group1/bin_3/group_1_bin_3_SampleA.tsv']},
 {'SampleB': ['/path/to/group1/bin_1/group_1_bin_1_SampleB.tsv', 
              '/path/to/group1/bin_2/group_1_bin_2_SampleB.tsv',
              '/path/to/group1/bin_3/group_1_bin_3_SampleB.tsv']},
 {'SampleC': ['/path/to/group1/bin_1/group_1_bin_1_SampleC.tsv', 
              '/path/to/group1/bin_2/group_1_bin_2_SampleC.tsv', 
              '/path/to/group1/bin_3/group_1_bin_3_SampleC.tsv']}]

[EDIT] New answer based on revised question:

您的自助回复帖子的工作方式与此相同(没有列表嵌套),但是,如果您有大量的样本名和/或文件名,这可能会花费更多的时间.

您可以通过仅遍历文件列表一次并使用正则表达式标识示例名称键来提高性能:

import re
from collections import defaultdict

pattern = re.compile("|".join(sample_names))
result  = defaultdict(list)
for fn in file_names:
    for key in re.findall(pattern,fn):
        result[key].append(fn)
        
print(result)

{'SampleA': ['/path/to/group1/bin_1/group_1_bin_1_SampleA.tsv',
             '/path/to/group1/bin_2/group_1_bin_2_SampleA.tsv',
             '/path/to/group1/bin_3/group_1_bin_3_SampleA.tsv'],
 'SampleB': ['/path/to/group1/bin_1/group_1_bin_1_SampleB.tsv',
             '/path/to/group1/bin_2/group_1_bin_2_SampleB.tsv',
             '/path/to/group1/bin_3/group_1_bin_3_SampleB.tsv'],
 'SampleC': ['/path/to/group1/bin_1/group_1_bin_1_SampleC.tsv',
             '/path/to/group1/bin_2/group_1_bin_2_SampleC.tsv',
             '/path/to/group1/bin_3/group_1_bin_3_SampleC.tsv']}

Using regular expressions will give you better control on pattern matching (e.g. if you need the sample name to be last and followed by a specific extension)

Python相关问答推荐

重命名变量并使用载体中的字符串存储 Select 该变量

手动为pandas中的列上色

预期LP_c_Short实例而不是_ctyles.PyCStructType

Pandas使用过滤器映射多列

保留包含pandas pandras中文本的列

使用Python进行网页抓取,没有页面

机器人与Pyton Minecraft服务器状态不和

NumPy中的右矩阵划分,还有比NP.linalg.inv()更好的方法吗?

用gekko解决的ADE方程系统突然不再工作,错误消息异常:@错误:模型文件未找到.& &

将DF中的名称与另一DF拆分并匹配并返回匹配的公司

点到面的Y距离

在Python中管理打开对话框

如何从在虚拟Python环境中运行的脚本中运行需要宿主Python环境的Shell脚本?

删除字符串中第一次出现单词后的所有内容

如何请求使用Python将文件下载到带有登录名的门户网站?

如何使用它?

avxspan与pandas period_range

使用特定值作为引用替换数据框行上的值

try 检索blob名称列表时出现错误填充错误""

如何杀死一个进程,我的Python可执行文件以sudo启动?