Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

Solution

#!/usr/bin/env python3  
  
f=open("DATA/rosalind_gc.txt", "r")
seq = f.readlines()  
gc = 0  
n = 0  
max_gc = 0  
for s in seq:
    s = s.strip()  
    if s[0] == '>':  
        if n == 0:
            desc = s[1:]  
            continue
        else:  
            if gc*100/n > max_gc:
                max_gc = gc*100/n
                max_desc = desc
        n = 0
        gc = 0
        desc = s[1:]  
     else:
        n += len(s)
        gc += s.count('G')
        gc += s.count('C')  
  
if max_gc < gc*100/n:
    max_gc = gc*100/n
    max_desc = desc

print(max_desc)
print(max_gc)

这道题第一次出现了fasta文件,直接解析文件算gc含量,当我们第二次看到fasta文件的时候,就该写函数来解析了,而不是像上面这样子。预先上biopython的SeqIO来解析fasta,下次我们将自己写一个解析fasta的函数。

#!/usr/bin/env python3  
  
from Bio import SeqIO

fasta_sequences = SeqIO.parse("DATA/rosalind_gc.txt",'fasta')  
  
def gc(seq):
    return (seq.count('G') + seq.count('C'))/len(seq)

id_vector = []
gc_vector = []  
  
for fasta in fasta_sequences:
    id_vector.append(fasta.id)
    gc_vector.append(gc(str(fasta.seq)))

max_gc = max(gc_vector)
max_gc_idx = gc_vector.index(max_gc)
max_gc_id = id_vector[max_gc_idx]

print(max_gc_id)
print(max_gc * 100)