Inverted Search: Document Phrases

I have a database full of phrases (80-100 characters) and several long documents (50-100 KB), and I need a ranked list of phrases for this document; instead of the usual search engine output, a list of documents for a given phrase.

I used MYSQL full-text indexing before and looked in lucene but never used it. Both of them seem oriented toward comparing short (search term) with long (document).

How would you get the opposite from this?

+3
source share
4 answers

- ~ 50 . , , , .

, , .

. , . , , . .

, 1,2,.., n , . , .

, , , .

, whet, , :

            HashSet<Long> foundHashes = new HashSet<Long>();

            LinkedList<String> words = new LinkedList<String>();
            for(int i=0; i<params.maxPhrase; i++) words.addLast("");

            StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
            Token t = new Token();
            while(st.next(t) != null) {
                String token = new String(t.termBuffer(), 0, t.termLength());
                words.addLast(token);
                words.removeFirst();

                for(int len=params.minPhrase; len<params.maxPhrase; len++) {
                    String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");

                    long hash = Utils.longHash(term);

                    if(params.lexicon.isTermHash(hash)) {
                        foundHashes.add(hash);
                    }
                }
            }

            for(long hash : foundHashes) {
                if(count.containsKey(hash)) {
                    count.put(hash, count.get(hash) + 1);
                } else {
                    count.put(hash, 1);
                }
            }
+3

, ?

, , ( |) . , . .

0

? , .

:

  • . . , , , , . 5 , 5 , . , , - (, "XX" ), .

  • , ( ) , , , .

  • .

  • , .

  • , . , .

0

, . , , itsadok.

0

Source: https://habr.com/ru/post/1726992/


All Articles