Python SAM对齐：提取查询序列中的特定区域及其在CIGAR字符串中的封闭部分

发布于10月05日

我需要对已进行全局比对的DNA序列的给定区域执行局部比对，并更新全球雪茄串的相应部分.

具体步骤如下:

1. Extract the region of interest from the "query" sequence according to the CIGAR string of the global alignment.
2. Perform the local alignment of the region and generate the local CIGAR string.
3. Extract from the global CIGAR string the operations "before" and "after" the region of interest.
4. Generate the new global CIGAR string

我想得到#1和#3的帮助

下面是一个具体的例子:

Input:
------
Reference sequence: ACGGCATCAGCGATCATCGGCATATCGAC (against which the global alignment was made)
Query sequence:     TCATCAAGCGTCGCCCTC
CIGAR string:       1S5M1I3M4D6M2D2M
CIGAR position:     5 (1-based, position of the leftmost mapped nucleotide in the reference)
Genomic region:     7-23 (1-based, inclusive)

以及对上述数据的解读:

Reference:   ACGGCA|TCA-GCGATCATCGGCAT|ATCGAC
Query:          .CA|TCAAGCG----TCGCCC-|-TC
CIGAR*:         SMM|MMMIMMMDDDDMMMMMMD|DMM
Position:        ↑

^{note:基因组区域在|个字符之间}

因此，在本例中，我需要从Query中提取TCAAGCGTCGCCC，从CIGAR*中提取SMM和DMM.

这是我笨拙的try :

import re
import itertools

# Input:
reference = 'ACGGCATCAGCGATCATCGGCATATCGAC'
query     = 'TCATCAAGCGTCGCCCTC'
cigar     = '1S5M1I3M4D6M2D2M'
region    = (7-1, 23)
position  = 5-1

# Now trying the extract the region in the query and the outer parts of the CIGAR:
x0, x1 = region       # range in reference
y0, y1 = (None, None) # range in query
c0, c1 = ([], [])     # cigar operations, before and after the region
x      = position     # current position in reference
y      = 0            # current position in query

for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):

    if x < x0: 
        c0.append(op)
    elif x < x1:
        if y0 == None:
            y0 = y 
        else:
            y1 = y 
    else:
        c1.append(op)

    if op == 'M':
        x += 1
        y += 1
    elif op in ['S','I']:
        y += 1
    elif op == 'D':
        x += 1

y1 += 1

print( (''.join(c0), query[y0:y1], ''.join(c1)) )

('SMM', 'TCAAGCGTCGCCCT', 'DMM')

问题是我在查询区域的末尾得到了多余的T.

increments = { 'M': {'x': 1, 'y': 1}, 'I': {'x': 0, 'y': 1}, 'D': {'x': 1, 'y': 0}, 'S': {'x': 0, 'y': 1}, } for op in itertools.chain.from_iterable( itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar) ): if x < x0: c0.append(op) elif x < x1: if increments[op]['y']: if y0 == None: y0 = y else: y1 = y else: c1.append(op) x += increments[op]['x'] y += increments[op]['y'] y1 += 1 print( (''.join(c0), query[y0:y1], ''.join(c1)) )

Python SAM对齐：提取查询序列中的特定区域及其在CIGAR字符串中的封闭部分

推荐答案

Python相关问答推荐

Python 3.12中的通用[T]类方法隐式类型检索

通过优化空间在Python中的饼图中添加标签

pandas DataFrame GroupBy.diff函数的意外输出

如何避免Chained when/then分配中的Mypy不兼容类型警告？

Excel图表-使用openpyxl更改水平轴与Y轴相交的位置(Python)

Pandas - groupby字符串字段并按时间范围 Select

从groupby执行计算后创建新的子框架

为什么以这种方式调用pd.ExcelWriter会创建无效的文件格式或扩展名？

实现自定义QWidgets作为QTimeEdit的弹出窗口

处理Gekko的非最优解

一个telegram 机器人应该发送一个测验如何做？""

Python—在嵌套列表中添加相同索引的元素，然后计算平均值

python的文件. truncate()意外地没有截断'

如何获得满足掩码条件的第一行的索引？

删除Dataframe中的第一个空白行并重新索引列

有了Gekko，可以创建子模型或将模型合并在一起吗？

仅取消堆叠最后三列

在聚合中使用python-polars时如何计算模式

为什么在生成时间序列时，元组索引会超出范围？

Python：在cmd中添加参数时的语法