Using re.MULTILINE and re.DOTALL together python

Question

Using re.MULTILINE and re.DOTALL together python

Basically the input files are as follows:

> U51677 Hemin gene HMG1 (HMG1) human non-histone chromatin, complete
cds. #some records don't have this line (see below) Length = 2575 
(some text)
> U51677 Hemin gene HMG1 (HMG1) human non-histone chromatin, complete
  Length = 2575 
(some text)
(etc....)

Now I wrote this to extract a string that starts with s> and a number for length

 import re regex = re.compile("^(>.*)\r\n.*Length\s=\s(\d+)", re.MULTILINE) match = regex.findall(sample_blast.read()) print match[0]

which works great for retrieving records when a length string is the next line in a line.

Then I tried re.DOTALL, which should make any record match (. * Length) regardless of whether there is an extra line or not.

 regex = re.compile("^(>.*)\r\n.*(?:\r\n*.?)Length\s=\s(\d+)", re.MULTILINE|re.DOTALL)

But that will not work. I tried re.MULTILINE and re.DOTALL instead of a pipe, but it still doesn't work.

So, the question is how to create a regular expression that matches the records and return the desired group regardless of the extra line in the record or not. It would be nice if someone could show this in re.VERBOSE. Sorry for the long post and thanks for any help in advance. :)

+4

python regex

bioinformant Oct 28 '12 at 16:52

source share

2 answers

David wolever · Answer 1 · 2012-10-28T16:59:31+0000

Your problem is probably related to using \r\n . Instead, try using only \n :

  >>> x = "" "
 ...> U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
 ... 
 ... cds.  #some records don't have this line (see below)
 ... 
 ... Length = 2575
 ... (some text)
 ... 
 ...> U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete
 ... 
 ... Length = 2575
 ... (some text)
 ... 
 ... (etc ...)
 ... "" "
 >>> re.search ("^ (>. *) \ n. * (?: \ n *.?) Length \ s = \ s (\ d +)", x, re.MULTILINE | re.DOTALL)
 <_sre.SRE_Match object at 0x10c937e00>
 >>> _.group (2)
 '2575'

Also your first .* Too greedy. Instead, try using: ^(>.*?)$.*?Length\s=\s(\d+) :

  >>> re.findall ("^ (>. *?) $. *? Length \ s = \ s (\ d +)", x, re.MULTILINE | re.DOTALL)
 [('> U51677 Human non-histone chromatin protein HMG1 (HMG1) gene, complete', '2575')]

Martin ender · Answer 2 · 2012-10-28T17:01:53+0000

Try this regex:

 "^(>[^\r\n]*).*?Length\s=\s(\d+)"

With both settings (using channel notation).

The first capture group will match everyone, right up to the first line break after > (regardless of your operating system). Then .*? will match any characters until the first Length column is encountered. The rest is the same as your first attempt.

The problem with your previous attempt seems to be that you are using .* , Which can match anything and is greedy at the same time (so it will consume as much as it can, including the following Length = 2575 .

Using re.MULTILINE and re.DOTALL together python

More articles: