Reading in block block by block using the specified delimiter in python

Question

Reading in block block by block using the specified delimiter in python

I have an input_file.fa file similar to this ( FASTA ):

> header1 description
data data
data
>header2 description
more data
data
data

I want to read one piece at a time in a file, so each piece contains one header and corresponding data, for example. block 1:

> header1 description
data data
data

Of course, I could just read in the file how to split it up:

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

But I want to avoid reading the entire file into memory, because the files are often large.

I can read in the file line at the rate:

with open("input_file.fa") as f:
    for line in f:
        pass

But ideally, I want something like this:

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

But I get an error message:

ValueError: illegal newline value:>

I also tried using the csv module , but without success.

3 , , , , , / ? , , , - :

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

, , , , , , . .

+2

python python-3.x bioinformatics fasta

Chris_Rands 29 . '16 9:25

3

, , - itertools.groupby key fuction:

from itertools import groupby


def make_grouper():
    counter = 0
    def key(line):
        nonlocal counter
        if line.startswith('>'):
            counter += 1
        return counter
    return key

:

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        fasta_section = ''.join(group)   # or list(group)

join, . , :

with open('filename') as f:
    for k, group in groupby(f, key=make_grouper()):
        # parse >header description
        header, description = next(group)[1:].split(maxsplit=1)
        for line in group:
            # handle the contents of the section line by line

+1

Bakuriu 29 . '16 16:18

def read_blocks(file):
    block = ''
    for line in file:
        if line.startswith('>') and len(block)>0:
            yield block
            block = ''
        block += line
    yield block


with open('input_file.fa') as f:
    for block in read_blocks(f):
        print(block)

This will be read in the file line by line, and you will return the blocks using the yield statement. This is lazy, so you don’t have to worry about big memory.

0

Gábor fekete Jul 29 '16 at 10:52

source share

Ashwini Chaudhary · Accepted Answer · 2016-07-29T10:46:50+0000

-, . .

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

:

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data

FASTA Biopython .

Reading in block block by block using the specified delimiter in python

More articles: