Preview Java Search Results

Suppose I have the following text (from a wiki ):

Java is a programming language originally developed by James Gosling at Sun Microsystems (which is currently a subsidiary of Oracle Corporation) and released in 1995 as a core component of the Sun Microsystems Java platform. The language derives most of its syntax from C and C ++, but has a simpler object model and fewer low-level tools. Java applications are usually compiled into bytecode (a class file) that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is universal, parallel, class-based ...

And I would like to analyze the coincidences of “Java” and “programming” in a Google-style result like this:

Java is a programming language originally developed by James Gosling at Sun Microsystems ... Java applications are usually compiled into byte code (a class file) that can run on any Java virtual machine (JVM) ...

  1. What tools can I use and how to use them to get the above result. Commons, Lucene, Compass?

  2. If there is an algorithm that will highlight the keywords and take care of trimming the lines and adding "..." at the end, share it.

  3. How do you decide how many and which keywords to show in the preview of search results?

+4
source share
2 answers

I do not know any tools that will help with this, but I can offer an algorithm that will give you pretty decent results. * Edit: OP requested an example code for the index. I use Trove TIntObjectHashMap to save this information, but you can do the same with Java HashMap .

Step 1: find the text for each search word and make an offset index in the text that appears each of them.

  TIntObjectHashMap <String> matchIndex = new TIntObjectHashMap <String> ();
 // for each word or other string to highlight
 // find each instance of each word in the string
 // this is pseudocode -v
 for (each instance of String searchString appearing at index int x)
   matchIndex.put (x, searchString);

Step 2: Go through each combination of index pairs in step 1 and write down the number of characters between the indices and the number of hits.

  // class to hold a match
 private class Match implements Comparable {
   private int x1, x2;
   private int hitCount;

   public Match (int x1, int x2, int hitCount);  // does the obvious

   private double sortValue () {
     return (double) hitCount / Math.abs (x1, x2);
   }  

   @Override
   public int compareTo (Match m) {
     double diff = this.sortValue () - m.sortValue ();
     if (diff == 0.0) return 0;
     return (diff <0.0)?  -eleven;
   }
 }

 // go through every combination of keys (string offsets) and record them
 // the treeset will automatically sort the results
 TreeSet <Match> matches = new TreeSet <Match> ();
 int [] keys = matchIndex.keys ();
 for (int x1 = 0; x1 <keys.length; x1 ++)
   for (int x2 = x1 + 1; x2 <keys.length; x2 ++)
     matches.put (new Match (keys [x1],
                           keys [x2] + matchIndex.get (keys [x2]). length (),
                           1 + x2 - x1));

Step 3: Take the list generated in step 2 and sort them by the number of hits per character length.

  // nicely done by the TreeSet

Step 4: Start at the top of the list in step 3 and mark each item as enabled. Remember to combine overlapping results into one larger result. Stop when the next item presses the total string length by 255 (or so) characters.

Step 5: display each of the selected items in step 4 to “in between”. Be sure to indicate what markup is needed to highlight the search words themselves in each element.

+1
source

Look at Lucene to do this, especially look at the main elements that it provides, there is a good example of building one to do it here:

http://www.cocooncenter.org/articles/lucene.html

+1
source

Source: https://habr.com/ru/post/1336503/


All Articles