我需要对已进行全局比对的DNA序列的给定区域执行局部比对,并更新全球雪茄串的相应部分.
具体步骤如下:
1. Extract the region of interest from the "query" sequence according to the CIGAR string of the global alignment.
2. Perform the local alignment of the region and generate the local CIGAR string.
3. Extract from the global CIGAR string the operations "before" and "after" the region of interest.
4. Generate the new global CIGAR string
我想得到#1和#3的帮助
下面是一个具体的例子:
Input:
------
Reference sequence: ACGGCATCAGCGATCATCGGCATATCGAC (against which the global alignment was made)
Query sequence: TCATCAAGCGTCGCCCTC
CIGAR string: 1S5M1I3M4D6M2D2M
CIGAR position: 5 (1-based, position of the leftmost mapped nucleotide in the reference)
Genomic region: 7-23 (1-based, inclusive)
以及对上述数据的解读:
Reference: ACGGCA|TCA-GCGATCATCGGCAT|ATCGAC
Query: .CA|TCAAGCG----TCGCCC-|-TC
CIGAR*: SMM|MMMIMMMDDDDMMMMMMD|DMM
Position: ↑
note:基因组区域在|
个字符之间
因此,在本例中,我需要从Query
中提取TCAAGCGTCGCCC
,从CIGAR*
中提取SMM
和DMM
.
这是我笨拙的try :
import re
import itertools
# Input:
reference = 'ACGGCATCAGCGATCATCGGCATATCGAC'
query = 'TCATCAAGCGTCGCCCTC'
cigar = '1S5M1I3M4D6M2D2M'
region = (7-1, 23)
position = 5-1
# Now trying the extract the region in the query and the outer parts of the CIGAR:
x0, x1 = region # range in reference
y0, y1 = (None, None) # range in query
c0, c1 = ([], []) # cigar operations, before and after the region
x = position # current position in reference
y = 0 # current position in query
for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):
if x < x0:
c0.append(op)
elif x < x1:
if y0 == None:
y0 = y
else:
y1 = y
else:
c1.append(op)
if op == 'M':
x += 1
y += 1
elif op in ['S','I']:
y += 1
elif op == 'D':
x += 1
y1 += 1
print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCCT', 'DMM')
问题是我在查询区域的末尾得到了多余的T
.