Search nearby duplication optimization

I am trying to find nearby duplicate values ​​in a set of fields to allow an administrator to clear them.

There are two criteria that I compare to

  • One line is entirely contained inside another and at least 1/4 of its length
  • Lines have an editing distance of less than 5% of the total length of two lines

Pseudo-PHP code:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

I tried to reduce calls to relatively expensive striposand levenshteinfeatures where possible, which reduced the execution time quite a bit. However, like the O (n ^ 2) operation, it just does not scale for large sets of values, and it seems that a significant part of the processing time is spent simply iterating through arrays.

Some properties of multiple sets of values ​​that run on

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

- , , , , , - (, ), ?


:

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}
+1
2

(O (N)), , , levenshtein , .

, N x N, . N x N, , .

, , . , () . , , - . . , . - , . , , , .

0

100% - , . ?

( , a-z0-9, , , stripos, , ). () (). , , , , .

0

Source: https://habr.com/ru/post/1781580/


All Articles