I have the following task: Starting with a character sequence of 30 characters (this is actually a DNA sequence, so as not to call it P30) I need to find in the text file all the lines starting with (^ agacatacag ...) with the exact P30 and then with 29 last characters 30, 28 and up to 10 characters. All I have to do is just delete the first character of the template and continue the search. For simplicity, I currently require an exact match, but resolving 1 mismatch for longer (20-30 characters) would be better.
My current, rather slow solution is to create a shell file with one truncated pattern per line and grep [1]. This means that I am reading huge 20 GB text files, and this may take a day +.
I can switch to python, create a list / tuple with all the necessary templates, and then read the file only once, looping instead of 20x for each sequence, speeding up with pypy.
- Question 1: is there any regular expression that will be faster than such a loop?
- Question 2: does it speed it up by switching to a faster, compiled language? (I'm trying to understand Dlang)
[1] Since this is a DNA sequence, and the search to be searched is in FASTQ format, I use fqgrep: https://github.com/indraniel/fqgrep with the tre library: https://github.com/laurikari/tre /
edit_1 example change (reduction template). Only the first few steps / shorter drawings are shown:
^abcde ^bcde ^cde
Or, if you prefer this as DNA:
^GATACCA ^ATACCA ^TACCA
edit_2 Simple grep doesn't actually cut. I need to post-process each 4-line FASTQ format, from which only line # 2 matches. If I do not use fqgrep, then I should:
read 4 lines of input
- check if line No. 2 (sequence) starts with any of 20 patterns (P30-P10)
- if I get a match, I need to cut the first N characters of lines # 2 and # 4, where N denotes the length of the matching pattern - print on the output / write to the lines of the file # 1- $ 4 in the match do nothing
For an internal solution, I can try to use GNU-parallel splitting of the input file, say, 4M from under the pieces and acceleration in that way. But if I want it to be possible for others, every new software, I ask end users to set ads to an additional level of complexity.
** edit 3 ** A simple example of your regular expressions and matching lines from Vyctor:
starting P30 regex ^agacatacagagacatacagagacatacag matching sequence: ^agacatacagagacatacagagacatacagGAGGACCA P29: ^gacatacagagacatacagagacatacag matching sequence: ^gacatacagagacatacagagacatacagGACCACCA P28: ^acatacagagacatacagagacatacag matching sequence: ^acatacagagacatacagagacatacagGATTACCA
I delete the DNA characters / bases on the left (or the 5-prime end in the DNA say), because this is how these sequences are degraded by real enzymes. The sequence of regular expressions is not interesting in itself once it is found. The desired result is a read sequence after regular expression. In the above examples, it is in UPERCASE, which can then be displayed in the next step of the genome. It should be emphasized that in addition to this toy example, I get longer, a priori unknown and diverse sequences after the regular expression pattern. In the real world, I don't need to deal with upper / lower case characters for DNA (everything is in upper case), but I will probably run into Ns (= unknown DNA base) in the sequences I'm looking for patterns. They can be ignored as a first approximation, but for a more sensitive version of the algorithm, they should probably be considered as simple inconsistencies. In an ideal scenario, it would be impossible to consider simple inconsistencies in a given position, but to calculate more complex fines taking into account the DNA sequence quality values ββstored in line 4 of each record of 4 lines of a long sequence stored in the FASTQ format: http: //en.wikipedia. org / wiki / FASTQ_format # Quality
But this is a more complicated method, and so far the method of "accept only reads with excellent regular expression matching" has been quite good and has facilitated the analysis of the next steps.