Fuzzy string matching with grep

I am trying to match strings in a file containing a string, for example ACTGGGTAAACTA. If i do

grep "ACTGGGTAAACTA" file 

He gives me strings that have exact matches. Is there a way to resolve a certain number of inconsistencies (replace, insert, or delete)? For example, I'm looking for sequences

  • Up to 3 permitted subtitles, such as "AGTGGGTAACCAA", etc.

  • Insertions / deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")

+4
source share
3 answers

There used to be a tool called agrepfuzzy matching regular expressions, but it was abandoned.

http://en.wikipedia.org/wiki/Agrep .

https://github.com/Wikinaut/agrep , .

, , tre-agrep .

+3

tre-agrep -E. , foo:

cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF

9 :

tre-agrep -s -9 -w ACTGGGTAAACTA foo

:

4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA
+1

: .

Long answer . As @JDB said , regex is inherently more accurate. You can manually add inconsistencies, such as [ATGC], rather than Ain some place, but there is no way to resolve only a small number of any inconsistencies. I suggest you write your own code to analyze it or try to find a DNA parser somewhere.

0
source

Source: https://habr.com/ru/post/1589054/


All Articles