Regex: matching multiple patterns derived from a simple string

I have the following task: Starting with a character sequence of 30 characters (this is actually a DNA sequence, so as not to call it P30) I need to find in the text file all the lines starting with (^ agacatacag ...) with the exact P30 and then with 29 last characters 30, 28 and up to 10 characters. All I have to do is just delete the first character of the template and continue the search. For simplicity, I currently require an exact match, but resolving 1 mismatch for longer (20-30 characters) would be better.

My current, rather slow solution is to create a shell file with one truncated pattern per line and grep [1]. This means that I am reading huge 20 GB text files, and this may take a day +.

I can switch to python, create a list / tuple with all the necessary templates, and then read the file only once, looping instead of 20x for each sequence, speeding up with pypy.

  • Question 1: is there any regular expression that will be faster than such a loop?
  • Question 2: does it speed it up by switching to a faster, compiled language? (I'm trying to understand Dlang)

[1] Since this is a DNA sequence, and the search to be searched is in FASTQ format, I use fqgrep: https://github.com/indraniel/fqgrep with the tre library: https://github.com/laurikari/tre /

edit_1 example change (reduction template). Only the first few steps / shorter drawings are shown:

^abcde ^bcde ^cde 

Or, if you prefer this as DNA:

 ^GATACCA ^ATACCA ^TACCA 

edit_2 Simple grep doesn't actually cut. I need to post-process each 4-line FASTQ format, from which only line # 2 matches. If I do not use fqgrep, then I should:

read 4 lines of input
- check if line No. 2 (sequence) starts with any of 20 patterns (P30-P10)
- if I get a match, I need to cut the first N characters of lines # 2 and # 4, where N denotes the length of the matching pattern - print on the output / write to the lines of the file # 1- $ 4 in the match do nothing

For an internal solution, I can try to use GNU-parallel splitting of the input file, say, 4M from under the pieces and acceleration in that way. But if I want it to be possible for others, every new software, I ask end users to set ads to an additional level of complexity.

** edit 3 ** A simple example of your regular expressions and matching lines from Vyctor:

 starting P30 regex ^agacatacagagacatacagagacatacag matching sequence: ^agacatacagagacatacagagacatacagGAGGACCA P29: ^gacatacagagacatacagagacatacag matching sequence: ^gacatacagagacatacagagacatacagGACCACCA P28: ^acatacagagacatacagagacatacag matching sequence: ^acatacagagacatacagagacatacagGATTACCA 

I delete the DNA characters / bases on the left (or the 5-prime end in the DNA say), because this is how these sequences are degraded by real enzymes. The sequence of regular expressions is not interesting in itself once it is found. The desired result is a read sequence after regular expression. In the above examples, it is in UPERCASE, which can then be displayed in the next step of the genome. It should be emphasized that in addition to this toy example, I get longer, a priori unknown and diverse sequences after the regular expression pattern. In the real world, I don't need to deal with upper / lower case characters for DNA (everything is in upper case), but I will probably run into Ns (= unknown DNA base) in the sequences I'm looking for patterns. They can be ignored as a first approximation, but for a more sensitive version of the algorithm, they should probably be considered as simple inconsistencies. In an ideal scenario, it would be impossible to consider simple inconsistencies in a given position, but to calculate more complex fines taking into account the DNA sequence quality values ​​stored in line 4 of each record of 4 lines of a long sequence stored in the FASTQ format: http: //en.wikipedia. org / wiki / FASTQ_format # Quality

But this is a more complicated method, and so far the method of "accept only reads with excellent regular expression matching" has been quite good and has facilitated the analysis of the next steps.

+5
source share
2 answers

You can generate a regular expression programmatically to look below.
It is simply a progressive alternation of the beginning of a line or the next character.

What this will give you is the ability to do a single pass search.
All you have to do is get the string length of the match to tell you where you are.

Note. Use multi-line mode.

  # (?:^|a)(?:^|g)(?:^|a)(?:^|c)(?:^|a)(?:^|t)(?:^|a)(?:^|c)(?:^|a)(?:^|g)(?!^)0123456789 (?: ^ | a ) # P30 (?: ^ | g ) # P29 (?: ^ | a ) # P28 (?: ^ | c ) # P27 (?: ^ | a ) # P26 (?: ^ | t ) # P25 (?: ^ | a ) # P24 (?: ^ | c ) # P23 (?: ^ | a ) # P22 (?: ^ | g ) # P21 # .. # P11 (?! ^) # Not beginning of line 0123456789 # P10 - P1 

For example, matches these:

 agacatacag0123456789 cag0123456789 gacatacag0123456789 acatacag0123456789 acag0123456789 catacag0123456789 catacag0123456789 0123456789 atacag0123456789 tacag0123456789 ag0123456789 g0123456789 

But not these:

 agaatacag0123456789 ca0123456789 gacataca0123456789 acaacag0123456789 acg0123456789 cataca0123456789 caacag0123456789 123456789 atacg0123456789 tcag0123456789 ag012456789 g012356789 

Refresh

This is a graphic illustration of the fact that one regular expression can replace all 30.
There is no need for 30 separate regular expressions, all you need is 1 constant regular expression.
In this example, the cluster group is replaced by capture groups so you can see what it is doing.

  # (^|G)(^|A)(^|T)(^|A)(^|C)(^|C)(?!^)A ( ^ | G ) # (1) ( ^ | A ) # (2) ( ^ | T ) # (3) ( ^ | A ) # (4) ( ^ | C ) # (5) ( ^ | C ) # (6) (?! ^ ) # Not beginning of line A 

Input, 6 lines:

 GATACCA ATACCA TACCA ACCA CCA CA 

Exit:

  ** Grp 0 - ( pos 0 , len 7 ) GATACCA ** Grp 1 - ( pos 0 , len 1 ) G ** Grp 2 - ( pos 1 , len 1 ) A ** Grp 3 - ( pos 2 , len 1 ) T ** Grp 4 - ( pos 3 , len 1 ) A ** Grp 5 - ( pos 4 , len 1 ) C ** Grp 6 - ( pos 5 , len 1 ) C ------------------------------ ** Grp 0 - ( pos 9 , len 6 ) ATACCA ** Grp 1 - ( pos 9 , len 0 ) EMPTY ** Grp 2 - ( pos 9 , len 1 ) A ** Grp 3 - ( pos 10 , len 1 ) T ** Grp 4 - ( pos 11 , len 1 ) A ** Grp 5 - ( pos 12 , len 1 ) C ** Grp 6 - ( pos 13 , len 1 ) C ------------------------------ ** Grp 0 - ( pos 17 , len 5 ) TACCA ** Grp 1 - ( pos 17 , len 0 ) EMPTY ** Grp 2 - ( pos 17 , len 0 ) EMPTY ** Grp 3 - ( pos 17 , len 1 ) T ** Grp 4 - ( pos 18 , len 1 ) A ** Grp 5 - ( pos 19 , len 1 ) C ** Grp 6 - ( pos 20 , len 1 ) C ------------------------------ ** Grp 0 - ( pos 24 , len 4 ) ACCA ** Grp 1 - ( pos 24 , len 0 ) EMPTY ** Grp 2 - ( pos 24 , len 0 ) EMPTY ** Grp 3 - ( pos 24 , len 0 ) EMPTY ** Grp 4 - ( pos 24 , len 1 ) A ** Grp 5 - ( pos 25 , len 1 ) C ** Grp 6 - ( pos 26 , len 1 ) C ------------------------------ ** Grp 0 - ( pos 30 , len 3 ) CCA ** Grp 1 - ( pos 30 , len 0 ) EMPTY ** Grp 2 - ( pos 30 , len 0 ) EMPTY ** Grp 3 - ( pos 30 , len 0 ) EMPTY ** Grp 4 - ( pos 30 , len 0 ) EMPTY ** Grp 5 - ( pos 30 , len 1 ) C ** Grp 6 - ( pos 31 , len 1 ) C ------------------------------ ** Grp 0 - ( pos 35 , len 2 ) CA ** Grp 1 - ( pos 35 , len 0 ) EMPTY ** Grp 2 - ( pos 35 , len 0 ) EMPTY ** Grp 3 - ( pos 35 , len 0 ) EMPTY ** Grp 4 - ( pos 35 , len 0 ) EMPTY ** Grp 5 - ( pos 35 , len 0 ) EMPTY ** Grp 6 - ( pos 35 , len 1 ) C 
+2
source

If you correctly understood that you have a preset that you want to map:

 agacatacagagacatacagagacatacag 

And then match the lines that match:

 re: agacatacagagacatacagagacatacag 30: agacatacagagacatacagagacatacag 29: agacatacagagacatacagagacatacac 28: agacatacagagacatacagagacataccc 

You do not need regular expressions for this, you just need to find the difference between the two lines, and since this is DNA, I assume that the line abcde and aebcd has a difference of 4, because all the sequences should be in the right places.

If the order doesn't matter and you just want to look for strings that match with at least 28 characters, you can just count the differences in the strings .

 reg = 'agacatacagagacatacagagacatacag' for row in file: letters = diff_letters(reg, row.strip()) if letters == 30: # complete match elif letters == 29: # one different character # so on 

If you need a match that actually starts with the correct sequence, you can just get the difference points between the lines , and if the first difference is at >=28

 reg = 'agacatacagagacatacagagacatacag' for row in file: diffs = list(i for i,(a1,a2) in enumerate(zip(s1,s2)) if a1!=a2) if not len(diffs): difference = len(reg) else: difference = diffs[0] if difference == 30: # First difference is at last offset 
+2
source

Source: https://habr.com/ru/post/1208905/


All Articles