PHP Extracting Similar Parts from Multiple Lines

I am trying to extract parts that look like a few lines.

The purpose of this is to attempt to extract the title of a book from several OCRings of the cover page.

This applies only to the beginning of the line, the ends of the lines do not need to be trimmed and can remain as they are.

For example, my lines could be:

$title[0]='the history of the internet, expanded and revised'; $title[1]='the history of the internet'; $title[2]='published by xyz publisher the historv of the internot, expanded and'; $title[3]='history of the internet'; 

So basically, I would like to trim each line so that it starts at the most likely starting point. Given that there may be OCR errors (for example, "historv", "internot"), I thought it was best to take the number of characters from each word, which would give me an array for each line (so a multidimensional array) with the length of each word. This can then be used to find matches and most likely reduce the beginning of the line.

Lines should be cut into:

 $title[0]='the history of the internet, expanded and revised'; $title[1]='the history of the internet'; $title[2]='the historv of the internot, expanded and'; $title[3]='XXX history of the internet'; 

Therefore, I need to be able to recognize that the “Internet history” (7 2 3 8) is a run that matches all lines, and that the previous “the” is most likely correct if it occurs in> 50% of the lines, and therefore the start each line is truncated to "the", and a placeholder of the same length is added to the line that does not contain "the".

So far I have received:

 function CompareSimilarStrings($array) { $n=count($array); // Get length of each word in each string > for($run=0; $run<$n; $run++) { $temp=explode(' ',$array[$run]); foreach($temp as $key => $val) $len[$run][$key]=strlen($val); } for($run=0; $run<$n; $run++) { } } 

As you can see, I am stuck in finding matches.

Any ideas?

+6
source share
1 answer

You should study the Smith-Waterman algorithm for local line alignment. This is a dynamic programming algorithm that finds parts of a string that are similar in that they have a low editing distance .

So, if you want to try, here is the php implementation of the algorithm .

+4
source

Source: https://habr.com/ru/post/909248/


All Articles