SeqIO.parse on fasta.gz

Question

SeqIO.parse on fasta.gz

New for coding. New for Pytho / biopython; this is my first online question ever. How to open a compressed fasta.gz file to extract information and perform calculations in my function. Here is a simplified example of what I'm trying to do (I tried different methods), and what kind of error it is. The gzip command that I'm using does not seem to work.?

with gzip.open("practicezip.fasta.gz", "r") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id) Traceback (most recent call last): File "<ipython-input-192-a94ad3309a16>", line 2, in <module> for record in SeqIO.parse(handle, "fasta"): File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\__init__.py", line 600, in parse for r in i: File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\FastaIO.py", line 122, in FastaIterator for title, sequence in SimpleFastaParser(handle): File "C:\Users\Anaconda3\lib\site-packages\Bio\SeqIO\FastaIO.py", line 46, in SimpleFastaParser if line[0] == ">": IndexError: index out of range

+10

python bioinformatics gz biopython

MelBel88 Mar 13 '17 at 5:45

source share

2 answers

Here is the solution if you want to process both plain text and compressed files:

 import gzip from mimetypes import guess_type from functools import partial from Bio import SeqIO input_file = 'input_file.fa.gz' encoding = guess_type(input_file)[1] # uses file extension if encoding is None: _open = open elif encoding == 'gzip': _open = partial(gzip.open, mode='rt') else: raise ValueError('Unknown file encoding: "{}"'.format(encoding)) with _open(input_file) as f: for record in SeqIO.parse(f, 'fasta'): print(record)

NOTE: this is based on the fact that the file has the correct file extension, which I think is reasonable almost all the time (and the errors are obvious and obvious if this assumption is not fulfilled). However, read here to learn how to actually verify the contents of a file, rather than relying on this assumption.

+3

Chris_rands Oct 16 '18 at 15:45

source share

klim · Accepted Answer · 2017-03-13T08:48:06+0000

Are you using python3?

This ("r" → "rt") may solve your problem.

 import gzip from Bio import SeqIO with gzip.open("practicezip.fasta.gz", "rt") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id)

SeqIO.parse on fasta.gz

More articles: