Implementation of proximity matrix for clustering

Please, I'm a little new to this field, so forgive me if the question sounds trivial or simple.

I have a group of data sets (a bag of words for a specific one) and I need to create a proximity matrix using their editing distance from each other to find and create a proximity matrix.

I am, however, rather confused as to how I will track my data / rows in the matrix. I need a proximity matrix for clustering.

Or As usual, you approach similar problems in this area. I use perl and R to implement this.

Here is a typical perl code I wrote that reads from a text file containing my bag of words

use strict ; use warnings ; use Text::Levenshtein qw(distance) ; main(@ARGV); sub main { my @TokenDistances ; my $Tokenfile = 'TokenDistinct.txt'; my @Token ; my $AppendingCount = 0 ; my @Tokencompare ; my %Levcount = (); open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!"); while(<FH>) { chomp $_; $_ =~ s/^(\s+)$//g; push (@Token , $_ ); } close(FH); @Tokencompare = @Token ; foreach my $tokenWord(@Tokencompare) { my $lengthoffile = scalar @Tokencompare; my $i = 0 ; chomp $tokenWord ; #@TokenDistances = levDistance($tokenWord , \@Tokencompare ); for($i = 0 ; $i < $lengthoffile ;$i++) { if(scalar @TokenDistances == scalar @Tokencompare) { print "Yipeeeeeeeeeeeeeeeeeeeee\n"; } chomp $tokenWord ; chomp $Tokencompare[$i]; #print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n"; #$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i]; $Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] ); } StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount); $AppendingCount++; %Levcount = () ; } # %Levcount = (); } sub levDistance { my $string1 = shift ; #my @StringList = @{(shift)}; my $string2 = shift ; return distance($string1 , $string2); } sub StoreSortedValues { my $Levcount = shift; my $tokenWordTopMost = ${(shift)} ; my $j = ${(shift)}; my @ListToken; my $Tokenfile = 'LevResult.txt'; if($j == 0 ) { open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!"); } else { open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!"); } print $tokenWordTopMost; my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}}; @ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster; #@ListToken = keys %tokenWordMaster; print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n"; #print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} @ListToken; foreach my $tokey (@ListToken) { print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n" } close(FH) or die ("Error Closing File. $!"); } 

the problem is how can I imagine the proximity matrix from it and still be able to track what comparison is what is in my matrix.

+6
source share
2 answers

The RecordLinkage package has a levenshteinDist function, which is one way of calculating the editing distance between lines.

 install.packages("RecordLinkage") library(RecordLinkage) 

Set up some data:

 fruit <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", "Blackcurrant", "Blueberry", "Currant", "Cherry") 

Now create a matrix of zeros to reserve memory for the distance table. Then use nested for loops to calculate individual distances. We end up with a row and column matrix for each fruit. Thus, we can rename columns and rows so that they are identical to the original vector.

 fdist <- matrix(rep(0, length(fruit)^2), ncol=length(fruit)) for(i in seq_along(fruit)){ for(j in seq_along(fruit)){ fdist[i, j] <- levenshteinDist(fruit[i], fruit[j]) } } rownames(fdist) <- colnames(fdist) <- fruit 

Results:

 fdist Apple Apricot Avocado Banana Bilberry Blackberry Blackcurrant Apple 0 5 6 6 7 9 12 Apricot 5 0 6 7 8 10 10 Avocado 6 6 0 6 8 9 10 Banana 6 7 6 0 7 8 8 Bilberry 7 8 8 7 0 4 9 Blackberry 9 10 9 8 4 0 5 Blackcurrant 12 10 10 8 9 5 0 Blueberry 8 9 9 8 3 3 8 Currant 7 5 6 5 8 10 6 Cherry 6 7 7 6 4 6 10 
+7
source

A proximity or similarity (or dissimilarity) matrix is ​​just a table that stores a similarity score for pairs of objects. So, if you have N objects, then the code R can be simMat <- matrix(nrow = N, ncol = N) , and then each record, (i, j) from simMat points to the similarity between point i and point j .

In R, you can use several packages, including vwr , to calculate the editing distance of Levenshtein.

You can also find this Wikibook interesting: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

+2
source

Source: https://habr.com/ru/post/894577/


All Articles