Basically the input files are as follows:
> U51677 Hemin gene HMG1 (HMG1) human non-histone chromatin, complete
cds.
(some text)
> U51677 Hemin gene HMG1 (HMG1) human non-histone chromatin, complete
Length = 2575
(some text)
(etc....)
Now I wrote this to extract a string that starts with s> and a number for length
import re regex = re.compile("^(>.*)\r\n.*Length\s=\s(\d+)", re.MULTILINE) match = regex.findall(sample_blast.read()) print match[0]
which works great for retrieving records when a length string is the next line in a line.
Then I tried re.DOTALL, which should make any record match (. * Length) regardless of whether there is an extra line or not.
regex = re.compile("^(>.*)\r\n.*(?:\r\n*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)
But that will not work. I tried re.MULTILINE and re.DOTALL instead of a pipe, but it still doesn't work.
So, the question is how to create a regular expression that matches the records and return the desired group regardless of the extra line in the record or not. It would be nice if someone could show this in re.VERBOSE. Sorry for the long post and thanks for any help in advance. :)
source share