PHP code for comparing two large text files with ~ 300,000 entries and output differences

I have two lists A and B, B = A + C - D. All elements are unique, not duplicated. How to get lists:
(1) added new items, C
(2) removed old items, D

C and D are no more than 10,000 items.

Edit

Crap, sorry guys - forgot an important detail - these are both text files, not memory elements.

+3
source share
6 answers

You said that you already have two files A and B.

Here is the simplest and fastest solution, assuming you are running on a Unix system.

system("comm -13 A B > C");
system("comm -23 A B > D");

//read C and D in PHP
+1
source

, , , , .. .

, , , array_diff()

$a = array( 1, 2, 3, 4 );
$b = array( 1, 3, 5, 7 ); // 2 and 4 removed, 5 and 7 added

$c = array_diff( $b, $a ); // [5, 7]
$d = array_diff( $a, $b ); // [2, 4]
+4

- . :

<?php

sort($a, SORT_NUMERIC);
sort($b, SORT_NUMERIC);
$c = array();
$d = array();
while (($currA = array_pop($a)) !== null) {
        while (($currB = array_pop($b)) !== null) {
                if ($currB == $currA) {
                        // exists in both, skip value
                        continue 2;
                }
                if ($currA > $currB) {
                        // exists in A only, add to D, push B back on to stack
                        $d[] = $currA;
                        $b[] = $currB;
                        continue 2;
                }
                // exists in B only, add to C
                $c[] = $currB;
        }
        // exists in A only, for values of A < all of B
        $d[] = $currA;
}

, 2 _ .

+3
function diffLists($listA,$listB) {

  $resultAdded = array();
  $resultRemoved = array();
  foreach($listB AS $item) {
    if (!in_array($item,$listA)) {
       $resultAdded[] = $item;
    }
  }
  foreach($listA AS $item) {
    if (!in_array($item,$listB)) {
      $resultRemoved[] = $item;
    }
  }
  return array($resultAdded,$resultRemoved);
}



$myListA = array('item1','item2','item3');
$myListB = array('item1','item3','item4');
print_r(diffLists($myListA,$myListB));

2 . - , B, - , B.

0

You may need the Levenshtein algorithm if you want it more efficiently,

http://en.wikipedia.org/wiki/Levenshtein_distance

0
source

Searching for any value of A in B (and vice versa) has complexity O (n ^ 2).

For large amounts of data, you are probably better off sorting each of the O (n log n) lists and then doing one pass through the sorted lists in which the added / deleted items will be added. (It’s relatively easy to do since you know that there are no duplicates.)

0
source

Source: https://habr.com/ru/post/1744338/


All Articles