Using Biopython (Python) to extract a sequence from a FASTA file

Ok, so I need to extract part of the sequence from the FASTA file using python (biopython, http://biopython.org/DIST/docs/tutorial/Tutorial.html )

I need to get the first 10 databases from each sequence and put them in one file, saving the sequence information from the FASTA format. Worst of all is the worst, I could just use the databases if there is no way to save sequence information. So here is an example:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG 

I need a way to get the first 10 bases (and then I planned to do it again for the last 10 bases). This training site is pretty thorough, but I'm new to this, and since it doesn't go for it, I'm not even sure if this is possible. Thanks for any help you can give.

+4
source share
2 answers

Biopython is just perfect for such tasks. Seq -Object stores the sequence and information about it. Reading the fasta file format is straightforward. You can access the sequence as a simple list and, therefore, access certain positions right ahead:

 from Bio import SeqIO with open("outfile.txt","w") as f: for seq_record in SeqIO.parse("infile.fasta", "fasta"): f.write(str(seq_record.id) + "\n") f.write(str(seq_record.seq[:10]) + "\n") #first 10 base positions f.write(str(seq_record.seq[-10:]) + "\n") #last 10 base positions 
+5
source

The Biopython Seq object is basically an array, so you can specify its subsections and pass them to the new Seq object. Assuming you read them in seqrecord (dictionary) using the following code, you can simply specify the starting end position.

 SeqRecords[Seq][start:end].seq 

This will give you a SeqRecord sequence object between the start and end positions, which are integers. There is some kind of ridicule from the memory regarding indexing the beginning and the end, but play around to get this idea. You must also specify:

 SeqRecords[Seq][:end].seq 

To get the sequence from the beginning of SeqRecord.

For completeness, read in these files:

 inputSeqFile = open(filename, "rU") SeqDict = SeqIO.to_dict(SeqIO.parse(inputSeqFile, "fasta")) inputSeqFile.close() 

Hope this helps.

0
source

Source: https://habr.com/ru/post/1442894/


All Articles