Lucene / Solr: store offset information for specific keywords

We use Solr to store documents with keywords; each keyword is associated with an interval within the document.

Keywords were prepared by some fraudulent analysts and / or manual workers before they were uploaded to Solr. A keyword can be repeated several times in a document. On the other hand, different instances of the same line in the same document may be associated with different keywords.

For example, this document

Bill studied The Bill of Rights last summer.

may be accompanied by the following keywords (with offsets in parentheses):

William Brown (0:4)
legal term (13:31)  
summer 2011 (32:43)

(Obviously, in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer there will be different years in different documents. We have all this information for all documents.)

, , KEYWORD, William Brown. , William Brown, . .

, , William Brown 0:4, Bill, .

, TermVectors, , . , ...

EDIT: , , / .

EDIT2: , , ( ).

+4
2

Q Monte

:

  • ,
  • .
  • Solr, Solr.

:

  • .

Solr 4.8+ () ()...

curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
  {
    "id": "123",
    "text" : "Bill studied The Bill of Rights last summer.",
    "content_type": "source",
    "_childDocuments_": [
      {
        "id": "123-1",
        "content_type": "source_annotation",
        "annotation": "William Brown",
        "start_offset": 0,
        "end_offset": 4
      },
      {
        "id": "123-2",
        "content_type": "source_annotation",
        "annotation": "legal term",
        "start_offset": 13,
        "end_offset": 31
      },
      {
        "id": "123-3",
        "content_type": "source_annotation",
        "annotation": "summer 2011",
        "start_offset": 32,
        "end_offset": 43
      }
    ]
  }
]

... .

1) : http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"

"response":{"numFound":1,"start":0,
    "docs":[
      {
            "id": "123-1",
            "content_type": "source_annotation",
            "annotation": "William Brown",
            "start_offset": 0,
            "end_offset": 4
      }
    ]
  }

, .

2) + : http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123

(id: 123, , ORed )

"response":{"numFound":1,"start":0,
    "docs":[
      {
            "id": "123",
            "content_type": "source",
            "text": "Bill studied The Bill of Rights last summer."
      }
    ],
    "highlighting":{}
  }

. , - content_type:source. !

content_type:source_annotation content_type:source .


Yonik .

+2

Solr / , , , Tokenizer. . , , , SynonymFilterFactory.

, SynonymFilterFactory, : foo => baz foo bar, , , , . , : "foo is awesome", foo (start=0,end=3) bar(start=0,end=3) ( , SynonymFilterFactory ):

   text:   foo    is    awesome
   start:  0      4     7
   end:    3      6     13

SynonymFilterFactory:

           bar
   text:   foo    is    awesome
   start:  0      4     7
   end:    3      6     13

, foo, , bar , , bar SynonymFilterFactory

, , , Solr. OpenSourceConnections Lucidworks ( Solr/Lucene). .

?

+3

Source: https://habr.com/ru/post/1618818/


All Articles