I am trying to find nearby duplicate values ββin a set of fields to allow an administrator to clear them.
There are two criteria that I compare to
- One line is entirely contained inside another and at least 1/4 of its length
- Lines have an editing distance of less than 5% of the total length of two lines
Pseudo-PHP code:
foreach($values as $value){
$matches = array();
foreach($values as $match){
if(
(
$value['length'] < $match['length']
&&
$value['length'] * 4 > $match['length']
&&
stripos($match['value'], $value['value']) !== false
)
||
(
$match['length'] < $value['length']
&&
$match['length'] * 4 > $value['length']
&&
stripos($value['value'], $match['value']) !== false
)
||
(
abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
&&
0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
&&
$match['changes'] * 20 <= ($value['length'] + $match['length'])
)
){
$matches[] = &$match;
}
}
}
I tried to reduce calls to relatively expensive striposand levenshteinfeatures where possible, which reduced the execution time quite a bit. However, like the O (n ^ 2) operation, it just does not scale for large sets of values, and it seems that a significant part of the processing time is spent simply iterating through arrays.
Some properties of multiple sets of values ββthat run on
Total | Strings | # of matches per string | |
Strings | With Matches | Average | Median | Max | Time (s) |
--------+--------------+---------+--------+------+----------+
844 | 413 | 1.8 | 1 | 58 | 140 |
593 | 156 | 1.2 | 1 | 5 | 62 |
272 | 168 | 3.2 | 2 | 26 | 10 |
157 | 47 | 1.5 | 1 | 4 | 3.2 |
106 | 48 | 1.8 | 1 | 8 | 1.3 |
62 | 47 | 2.9 | 2 | 16 | 0.4 |
- , , , , , - (, ), ?
:
$values_count = count($values);
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){
if(
(
$value['length'] * 4 > $match['length']
&&
stripos($match['value'], $value['value']) !== false
)
||
(
($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
&&
0 < ($changes = levenshtein($value['value'], $match['value']))
&&
$changes * 20 <= ($value['length'] + $match['length'])
)
){
$matches[$vid][$mid] = true;
$matches[$mid][$vid] = true;
}
}
}
foreach($matches as $vid => $mids){
}