Combine partial matched strings

I am trying to combine partially matched strings from two files.

File 1 contains a list of unique lines. These lines are partially mapped to multiple lines in file 2. How to combine the lines in file 1 with file 2 for each agreed case

File1

mmu-miR-677-5p_MIMAT0017239 mmu-miR-181a-1-3p_MIMAT0000660 

File2

 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC 

Required conclusion

 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC 

I tried using pmatch() in R, but don't understand. I look like something perl will handle ??

Maybe something like this:

 perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1 
+6
source share
3 answers

This is a short Perl solution that saves all the data from file1 in a hash and then retrieves it when scanning file2

 use strict; use warnings; use autodie; my @files = qw/ file1.txt file2.txt /; my %file1 = do { open my $fh, '<', $files[0]; map /([^_]+)_(\S+)/, <$fh>; }; open my $fh, '<', $files[1]; while (<$fh>) { my ($key) = /([^_]+)/; printf "%-32s%s", "${key}_$file1{$key}", $_; } 

Output

 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC 
+4
source

Of course, you can do this in R. Indeed, pmatch whole lines will not give the desired result - you must match the corresponding substrings.

I assume that in file 1, the first identifier is 677, not 667, otherwise it is difficult to guess the pattern of matching (I assume that your example is only part of a larger database).

 file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239 mmu-miR-181a-1-3p_MIMAT0000660')) file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC')) library(stringi) file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)") file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)") cbind(file1=file1[match(file2_id, file1_id)], file2=file2) ## file1 file2 ## [1,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA" ## [2,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT" ## [3,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT" ## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC" ## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC" 
+3
source

You can agrep for a fuzzy search. You have to play from a distance. Here I fix it manually to 11.

I basically do this to extract a line number that matches every word in file1:

 sapply(file1,agrep,file2,max=11) $`mmu-miR-677-5p_MIMAT0017239` [1] 1 2 3 $`mmu-miR-181a-1-3p_MIMAT0000660` [1] 4 5 

To get the result of data.frame:

 do.call(rbind, lapply(file1, function(x) data.frame(file1=x, file2=agrep(x,file2,max=11,value=T)))) file1 file2 1 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA 2 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT 3 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT 4 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC 5 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC 
+2
source

Source: https://habr.com/ru/post/970865/


All Articles