Frequencies not adding up to one

Question

Frequencies not adding up to one

I am writing a function that must go through a .fasta file of DNA sequences and create a dictionary of nucleotide (nt) and dinucleotide (dnt) frequencies for each sequence in the file. Then I save each dictionary in a list with the name "frequency". This is a piece of code that acts strangely:

for fasta in seq_file: freq = {} dna = str(fasta.seq) for base1 in ['A', 'T', 'G', 'C']: onefreq = float(dna.count(base1)) / len(dna) freq[base1] = onefreq for base2 in ['A', 'T', 'G', 'C']: dinucleotide = base1 + base2 twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) freq[dinucleotide] = twofreq frequency.append(freq)

(By the way, I use biopython, so I don’t need to commit the entire fasta file in memory. It’s also for ssDNA, so I don’t have to consider the anti-meaning of dnt)

The frequencies that are recorded for a single nt add to 1.0, but the frequencies for dnt do not add to 1.0. What is od, since the method of calculating the two types of frequencies is identical in my eyes.

I left the diagnostic print statements and "control" variables in:

 for fasta in seq_file: freq = {} dna = str(fasta.seq) check = 0.0 check2 = 0.0 for base1 in ['A', 'T', 'G', 'C']: onefreq = float(dna.count(base1)) / len(dna) freq[base1] = onefreq check2 += onefreq for base2 in ['A', 'T', 'G', 'C']: dinucleotide = base1 + base2 twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) check += twofreq print(twofreq) freq[dinucleotide] = twofreq print("\n") print(check, check2) print(len(dna)) print("\n") frequency.append(freq)

to get this output: (for only one sequence)

 0.0894168466523 0.0760259179266 0.0946004319654 0.0561555075594 0.0431965442765 0.0423326133909 0.0747300215983 0.0488120950324 0.0976241900648 0.0483801295896 0.0539956803456 0.0423326133909 0.0863930885529 0.0419006479482 0.0190064794816 0.031101511879 (0.9460043196544274, 1.0) 2316

Here we can see the frequency of each of 16 different dnt possible, the sum of all dnt frequencies (0.946) and the sum of all nt frequencies (1.0) and the length of the sequence.

Why dnt frequency is not added before 1.0?

Thank you for your help. I am very new to python, and this is my first question, so I hope this is acceptable.

+6

python python-2.7 biopython

Bantha May 27 '15 at 16:28

source share

3 answers

your problem, try the following fasta:

  > test
 AAAAAA

 "AAAAAA".count("AA")

You are getting:

It should be

cause

from the documentation: count returns the number of (non-overlapping) occurrences of the substring sub in string s [start: end]

solution using Counter and chunk function

 from Bio import SeqIO def chunks(l, n): for i in xrange(0, len(l)-(n-1)): yield l[i:i+n] from collections import Counter frequency = [] input_file = "test.fasta" for fasta in SeqIO.parse(open(input_file), "fasta"): dna = str(fasta.seq) freq = Counter(dna) #get counter of single bases freq.update(Counter(chunks(dna,2))) #update with counter of dinucleotide frequency.append(freq)

for "AAAAAA" you get:

 Counter({'A': 6, 'AA': 5})

+3

Jose Ricardo Bustos M. May 27, '15 at 17:02

source share

You scan the string much more than you need - 20 times. This may not matter for small test sequences, but it will be noticeable as they get larger. I would recommend a different approach that fixes the overlap issue as a side effect:

 nucleotides = [ 'A', 'T', 'G', 'C' ] dinucleotides = [ x+y for x in nucleotides for y in nucleotides ] counts = { x : 0 for x in nucleotides + dinucleotides } # count the first nucleotide, which has no previous one n_nucl = 1 prevn = dna[0] counts[prevn] += 1 # count the rest, along with the pairs made with each previous one for nucl in dna[1:]: counts[nucl] += 1 counts[prevn + nucl] += 1 n_nucl += 1 prevn = nucl total = 0.0 for nucl in nucleotides: pct = counts[nucl] / float(n_nucl) total += pct print "{} : {} {}%".format(nucl, counts[nucl], pct) print "Total : {}%".format(total) total = 0.0 for dnucl in dinucleotides: pct = counts[dnucl] / float(n_nucl - 1) total += pct print "{} : {} {}%".format(dnucl, counts[dnucl], pct) print "Total : {}%".format(total)

This approach only scans the line once, although it is admittedly more code ...

+2

twalberg May 27, '15 at 17:49

source share

Data_addict · Accepted Answer · 2015-05-27T17:06:58+0000

str.count () does not count the matching motive that it finds.

Exemple:

If you have "AAAA" in your sequence and you are looking for the "AA" dinucleotide, you expect that "AAAA" .count ("AA") will return 3 to you, but it will return 2. So:

 print float('AAAA'.count('AA')) / (len('AAAA') - 1) 0.666666

instead of 1

You can simply change the line where you count the frequency:

 twofreq = len([i for i in range(len(dna)-1) if dna[i:i+2] == dinucleotide]) / float((len(dna) - 1))

Frequencies not adding up to one

More articles: