Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
Sample Output
AC

Solution

这道题要求找出最长的共有子序列,解题思路从最长开始找,一旦找到就是答案。而对于每一个长度,先从第一条序列里生成一个kmer集合,然后在其它的序列中寻找。

#!/usr/bin/env python3  
  
def readFASTA(file):
    f = open(file)
    lines = f.readlines()
    des = list()
    seqs = list()
    first_line = True
    for i in range(len(lines)):
        line = lines[i].rstrip()  
        if line[0] == '>' and first_line == True:
            des.append(line[1:])
            first_line = False
            seq = ''
        elif line[0] == '>' and first_line == False:
            des.append(line[1:])
            seqs.append(seq)
            seq = ''
        else:
            seq += line
    seqs.append(seq)      
    return des, seqs  
      
def lcs(strs):
    maxK = min([len(x) for x in strs])
    kmer = set()
    seq = strs[0]
    strs = strs[1:]      
    for k in reversed(range(maxK)):  
        for i in range(maxK+1-k):
            kmer.add(seq[i:i+k])          
        for kk in kmer:
            found = True
            for x in strs:  
                if x.find(kk) == -1:
                    found = False
                    break
            if found == True:  
                return(kk)
        kmer = set()  
  
if __name__ == "__main__":
  description, sequence = readFASTA("DATA/rosalind_lcsm.txt")
  print(lcs(sequence))

电梯