I am writing a function that must go through a .fasta file of DNA sequences and create a dictionary of nucleotide (nt) and dinucleotide (dnt) frequencies for each sequence in the file. Then I save each dictionary in a list with the name "frequency". This is a piece of code that acts strangely:
for fasta in seq_file: freq = {} dna = str(fasta.seq) for base1 in ['A', 'T', 'G', 'C']: onefreq = float(dna.count(base1)) / len(dna) freq[base1] = onefreq for base2 in ['A', 'T', 'G', 'C']: dinucleotide = base1 + base2 twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) freq[dinucleotide] = twofreq frequency.append(freq)
(By the way, I use biopython, so I donβt need to commit the entire fasta file in memory. Itβs also for ssDNA, so I donβt have to consider the anti-meaning of dnt)
The frequencies that are recorded for a single nt add to 1.0, but the frequencies for dnt do not add to 1.0. What is od, since the method of calculating the two types of frequencies is identical in my eyes.
I left the diagnostic print statements and "control" variables in:
for fasta in seq_file: freq = {} dna = str(fasta.seq) check = 0.0 check2 = 0.0 for base1 in ['A', 'T', 'G', 'C']: onefreq = float(dna.count(base1)) / len(dna) freq[base1] = onefreq check2 += onefreq for base2 in ['A', 'T', 'G', 'C']: dinucleotide = base1 + base2 twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1) check += twofreq print(twofreq) freq[dinucleotide] = twofreq print("\n") print(check, check2) print(len(dna)) print("\n") frequency.append(freq)
to get this output: (for only one sequence)
0.0894168466523 0.0760259179266 0.0946004319654 0.0561555075594 0.0431965442765 0.0423326133909 0.0747300215983 0.0488120950324 0.0976241900648 0.0483801295896 0.0539956803456 0.0423326133909 0.0863930885529 0.0419006479482 0.0190064794816 0.031101511879 (0.9460043196544274, 1.0) 2316
Here we can see the frequency of each of 16 different dnt possible, the sum of all dnt frequencies (0.946) and the sum of all nt frequencies (1.0) and the length of the sequence.
Why dnt frequency is not added before 1.0?
Thank you for your help. I am very new to python, and this is my first question, so I hope this is acceptable.