The binning sequence is read by the contents of the GC

I would like to “bin” (split into separate files) a sequence file with several nucleotide points (for example, Roche-454 ~ 500,000 run with an average reading length of 250 bits). I would like boxes based on the contents of the GC of each read. The output will be 8 files with several fasta:

<20% GC

21-30% GC

31-40% GC

41-50% GC

51-60% GC

61-70% GC

71-80% GC

> 80% GC

Does anyone know of a script or program that does this already? If not, can someone suggest how to sort a multi-fasta file based on the contents of the GC (what can I then split into the appropriate cells)?

+3
3

R/Bioconductor : (a) (b) fasta (c) gc% (d) (e) .

## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
    fileName = sprintf("%s.fasta", lvl)
    writeFasta(fa[bin==lvl], file=fileName)
}

R/Bioconductor, . http://bioconductor.org/install. 454 , script (, 7s 260 .).

+2

Python Biopython Perl Bioperl FASTA. script, C- Bioperl , Biopython . GC , , , GC-.

?

+1

, - (Python):

def GC(seq): # determine the GC content
    s = seq.upper()
    return 100.0 * (s.count('G') + s.count('C')) / len(s)

def bin(gc): # get the number of the 'bin' for this value of GC content
    if gc < 20: return 1
    elif gc > 80: return 8
    else:
        return int(gc/10)

Then you just need to read the entries from the file, calculate the contents of the GC, find the box you need and write the entry to the appropriate file. The following example implements this with the Python package that we use in the lab:

from pyteomics import fasta

def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""

    for entry in fasta.read(multifile):
        fasta.write((entry,), (multifile+'_bin_'+
                    str(bin(GC(entry[1])))))

Then you just call it like split_to_bin_files('mybig.fasta').

0
source

Source: https://habr.com/ru/post/1780447/


All Articles